Sample Size Calculations: Thinking about Effect Size

Today’s blog is brought to us by Georgette Asherman, founder of Direct Effects. She has been a professional statistician in pharmaceuticals, consumer products, business services and public policy for over 10 years. She has been associated with organizations such as Unilever, Bristol-Myers Squibb, Chase Manhattan Credit Card Services, and the New Jersey Department of Health. In recent years she has developed an interest in quantitative aspects of modern biological sciences. She has worked in clinical and non-clinical biostatistics, chemistry data analysis and instrument capability studies. Previous business experience includes direct mail and credit risk modeling, satisfaction and preference studies, and other market research activities. On the policy side she has been involved in public health survey analysis, data management, and sampling design for audits of compliance. She holds an M.S. in Statistics from Rutgers University and a B.A. from Cornell University. She is a member of the American Statistical Association and the New York Area SAS Users Group. Her contributed article and contact information follow below.

———————————————————–

A major activity for statisticians in pharmaceuticals is making recommendations about sample size. Why does one control/active study have 28 subjects in each treatment arm while another has 58? This is usually attributed to a method called ‘power analysis.’ The statistician will say something like ’58 subjects are needed to get a significant effect at critical value of .05 at 90% power.’ This sounds authoritative, but what does it mean in terms of clinical activity? The statistician will say that there are enough subjects to find the effect 90% of the time-or in other words-it will only be overlooked 10% of the time. The clinicians and scientists accept this, but most likely are still perplexed.

In these discussions, the main idea of performing the study, finding a desired effect, can be overlooked. In the straightforward example of a control/active study, the effect size, also called the delta, is the difference between the means of the endpoint of interest. This delta is an assumption about an unknown, not an observed value. In designing a study, the investigator has to think what could be a desired impact to make this new active product worthwhile. Usually a big impact would be desirable and easy to see. But a well-designed study should find a small delta if it is of clinical interest, such as drop of a few points in average cholesterol or blood pressure.

The starting assumption (or null hypothesis) is a difference of 0 between the two means of the groups. (There can be other null hypotheses, but this is the most common and straightforward.) With real people and real lab results, we shouldn’t see the same average, for two groups, even when the impact of both treatments is the same. The difference of the active mean and the control mean will be a continuous real number such as 6.91 or 1.24 or -3.21. Since there are different types of averages, the word ‘mean’ is used for the classic technique of the sum divided by the number of observations. The statistical test will show if this observed difference implies a real difference that isn’t 0.

We can do a study with a number of subjects that is convenient and affordable like 10 or 20. This happens in academic research sometimes. We will observe differences of means like 6.91 or 1.24 or -3.21. This data can go into a statistical package and show a significant effect. This ‘significance’ merely means that we can comfortably reject that the true difference is 0. But if there is no significant effect, how do we know that we aren’t missing out on finding the true effect? Should we recruit more people and redo the study? Since there was no power calculation it is hard to judge.

The ‘power’ of power analysis is the framework it provides. Since we can’t test the whole population, we will never true the effect size. We observe a non-zero difference which can be larger or smaller than the true effect size. We can only say that is not zero, not that this difference equals the true effect size from the active product. But suggesting a possible effect size makes the results grounded in the clinical context. The power number, either pre-set or calculated, shows the strength of the design. The sample size is obtained from these inputs.

These calculations are now done with several popular software packages. The desired effect size and power will derive the recommended sample size. As the power goes up, the required number of subjects increases. As the effect size goes up, the required number of subjects decreases. This sounds odd to some people. But if the true effect is bigger, fewer subjects can show an average distance just as far away from 0. Detecting smaller effects, closer to 0, requires more subjects.

Besides effect size and power, the critical value and a suggested standard deviation, a calculation of the spread of the data, is required for a sample size calculation. The critical value is typically .05 or .01, either one-sided or two-sided. The power analysis can be done for two-sided or one-sided tests. Two-sided tests identify a non-zero difference in either direction while one-sided tests look for either a positive or negative change. How do we know a standard deviation if we haven’t collected any data? Statisticians rely on previous studies or published results.

This is a very brief summary of a very large topic. While I compare two means as an example, the idea is applicable to treatment means within a subject, more than two treatment groups, and non-clinical studies in-vivo or in-vitro. Besides testing means, power analysis is used for counts such as safety results. Most of all don’t be afraid to ask questions.

Georgette Asherman
Applied Statistician
Direct Effects, LLC.
www.directeffects.net
201 673-4301