Graphical presentation of data, best practices

Show the data, don’t conceal them was the first article from a series of articles published in the British Journal of Pharmacology that deals with the best practices to be followed in statistical reporting. The current set of articles in this series can be obtained at http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291476-5381/homepage/statistical_reporting.htm.

“Show the data…” deals with the graphical presentation of data. The article urges authors to present data in its ‘raw form’, that would show all the characteristics of the data distribution. The use of the so-called ‘dynamite plunger plot’ is discouraged. In a ‘dynamite plunger plot’ the mean of the data is represented by a bar and there is a ‘T’ at the top of the bar to show variability. It is argued that dynamite plunger plots can give a false notion of symmetry in the data. The conclusion of the article is that data can be better presented and compared using simple dotplots.

For those interested, the previous post in this blog was also on the graphical representation of data and included simple R code for generating different ‘types’ of dotplots.

Treatment comparisons for pretest – posttest experimental studies

Recently, I faced a question on the statistical analysis method for an animal study involving testing of lipid lowering agents. The study was designed along the following lines. Animals were randomized to receive one of the study agents. Prior to initiation of treatment, lipid levels (LDL, HDL and total cholesterol) were measured (baseline). Study drugs were administered ­­daily for xx days and the lipid levels were again measured at end of study (post treatment).

This kind of study design is often labelled as a pretest – posttest design and is quite common in the medical field for comparing different treatments.

In many clinical studies of cholesterol lowering agents, the percent change from baseline is analyzed for between group differences. Hence, I suggested using percent change from baseline as the outcome measure in an ANOVA as the statistical analysis method for the above animal experiment. On further scanning of the literature, I however noted that in spite of the widespread prevalence of the pretest – posttest experimental design, there is a lack of consensus on the best method for the data analysis.

The possible choices for the outcome measure to use in the analysis of data from a pretest – posttest study could be the post-treatment values (PV) or the change between baseline and post treatment values, referred to in the literature as change scores or gain scores (DIFF) or any measure of relative change between baseline and post treatment values. Percentage change from baseline (PC) mentioned above is an example of a measure of relative change. Some of the other measures of relative change include symmetrized percent change (SPC) and log ratio of post-treatment to baseline (LR). Furthermore, the methods for statistical analysis include parametric and non-parametric versions of ANOVA or ANCOVA on any of the above outcome measure. So indeed there exist a lot of possibilities for the analysis of data from a pretest – posttest study!

Various simulation studies have provided us with pointers to guide the choice of the outcome measure and analysis method. The current thinking seems to be that an ANCOVA on PV has higher power than a simple ANOVA on PC, especially in the situation where little correlation exists between baseline and post treatment values. However, SPC with wither the ANOVA or ANCOVA seems to be a good option in the case of additive or multiplicative correlation between baseline and post treatment values.

The simulation studies have tried to mimic various real scenarios. Vickers has studied in detail the situation where the outcome is continuous and there is an additive correlation between post treatment and baseline values. SPC as an outcome measure was not studied (Vickers A. J. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Medical Research Methodology (2001) 1:6). Meanwhile, Berry & Ayers consider count data, include SPC as an outcome measure and consider parametric and non-parametric analysis methods in the presence of additive and multiplicative correlation between baseline and post treatment values (Berry D. A. & Ayers G. D. Symmetrized Percent Change for Treatment Comparisons. The American Statistician (2006) 60:1, 27-31).

At the time of design of the statistical analysis plan, there is bound to be little information available on the extent and type of correlation that may exist between the baseline and post treatment values. Hence there is a need to conduct extensive review and/or simulations, taking into account also the type of data and correlation structures encountered in different therapeutic areas, to identify (therapeutic area specific!) best practices for the choice of the outcome measure and analysis method for the pretest – posttest design.

Calculating Standard Deviation

Most standard books on Biostatistics give two formulae for calculating the standard deviation (SD) of a set of observations.

The population SD (for a finite population) is defined as \sqrt{\frac{\sum_{i=1}^{N}\left ( X_{i} -\mu \right )^{2}}{N}}, where X1, X2, …, XN are the N observations and μ is the mean of the finite population.

The sample SD is defined as \sqrt{\frac{\sum_{i=1}^{n}\left ( x_{i} -\bar{x} \right )^{2}}{n-1}}, where x1, x2, …, xn are the n observations in the sample and \bar{x} is the sample mean.

While teaching a Biostatistics course, it becomes a little difficult to explain to students about why there are two different formulae, and why do we subtract ‘1’ from n when we calculate the SD using sample observations. These are students who are (or will be) collecting samples, performing experiments, obtaining sample data, and analyzing data. Hence it is important that they understand the concept from a practitioners point, rather than a theoretical point of view.

In my experience, the best explanation has been to avoid a theoretical discussion of unbiasedness etc. and to state that if we want to calculate the SD from sample data, we also need to know the true (population) mean. Since the population mean is unknown, we use the same n observations in the sample to first calculate the sample mean \bar{x} as an estimate of the population mean μ. Our sample observations however tend to be more close to the sample mean rather than the population mean and hence the SD calculated using the first formula will underestimate the true variability. To correct for this underestimation we divide by n – 1 instead of n. Then the question of why n – 1? We have already used the n observations to calculate one quantity, i.e., \bar{x} . We are then left with only n – 1degrees of freedom’ for further calculations that use \bar{x} .

I also state three more points to complete the topic:

  1. That the quantity n – 1 is referred to as the degrees of freedom for computing the SD and we say that 1 degree of freedom has been used up while estimating μ using sample data.
  2. For most situations that the students would encounter, the complete population is not available and they would have to work with sample data. So in most (if not all) situations, they should be using the formula with n – 1 to calculate the SD.
  3. As the sample size increases, the differences arising out of using n or n – 1 should be of little practical importance. But it is still conceptually correct to use n – 1.

When the students were then asked as to how they would define degrees of freedom, one of them came up with a definition as ‘number of observations minus number of calculations’ which is pretty much what it is, as far as a practitioner is concerned!

M$ Excel for Statistical Analysis?

Many statisticians do not advocate using Microsoft Excel for statistical analysis, except, for maybe obtaining the most simplest of data summaries and charts. Even the charts produced using the default options in Excel are considered chartjunk. However, many introductory courses in Statistics use Excel as a tool for their statistical computing labs and there is no disputing the fact that Excel is an extremely easy to use software tool.

This post is based on a real situation that arose when illustrating t-tests in Excel (Microsoft Excel 2010) for a Biostatistics course.

The question was whether the onset of BRCA mutation-related breast cancers happens early in age in subsequent generations. The sample data provided was on the age (in years) of onset of BRCA mutation-related breast cancers for mother-daughter pairs. Here is the data:

Mother Daughter
47 42
46 41
42 42
40 39
48 44
48 45
49 41
38 45
50 44
47 48
46 39
43 36
54 44
48 46
49 46
45
39 40
48 36
46 43
41
49 42
48 39
49 43
45 47
36 44

Some students in the course used the Excel function t.test and obtained the one-tailed p-value for the paired t-test as 0.001696. Some of the other students used the ‘Data Analysis’ add-in from the ‘Data’ tab and obtained the following results:

t-Test: Paired Two Sample for Means
  Variable 1 Variable 2
Mean 45.83333 42.375
Variance 17.97101 10.07065
Observations 24 24
Pearson Correlation 0.143927
Hypothesized Mean Difference 0
df 23
t Stat 1.247242
P(T<=t) one-tail 0.11243
t Critical one-tail 1.713872
P(T<=t) two-tail 0.22486
t Critical two-tail 2.068658

The one-tailed p-value reported here is 0.11243! Surprised at this discrepancy, we decided to verify the analysis by hand calculations. The correct p-value is the 0.001696 obtained from the t.test function.

Now what is the problem with the results from the ‘Data Analysis’ add-in? A closer look at the results table additionally reveals the reported degrees of freedom (df) as 23. However, we can see that because of the missing values in the dataset, the number of usable pairs for analysis is 23. The correct df is therefore 22.

This shows that missing values in the data are not handled correctly by the ‘Data Analysis’ add-in. A search shows this problem with the add-in reported as early as in the year 2000 with Microsoft Excel 2000 (http://link.springer.com/content/esm/art:10.1007/s00180-014-0482-5/file/MediaObjects/180_2014_482_MOESM1_ESM.pdf). Unfortunately, the error has never been corrected in the subsequent versions of the software. It looks like the bad charts are the least of the problems with Excel! Other problems that have been reported include poor numerical accuracy, poor random number generation tools and errors in the data analysis add-ins.

Having said all of this, we certainly cannot also deny that Excel has been, and continues to be a very useful software tool to demonstrate and conduct basic data exploration and statistical analysis, especially for a non-statistician audience. The post at http://stats.stackexchange.com/questions/3392/excel-as-a-statistics-workbench provides a balanced view of the pros and cons of using Excel for data analysis. The take away for us is to be extremely careful when using Microsoft Excel for data management and statistical analysis and to doubly verify the results of any such data operation and analysis.

Why Statistics?

Sometime back I gave a talk on the topic ‘Why Statistics?’ in the course of a workshop on Clinical Research Methodology. Having found no crisp answers to the question during the research for the talk, and considering the importance of the topic, I thought ‘Why Statistics?’ would also be a good theme for a first blog post.

‘Statistics’ is a term used to refer both to the subject of statistics as well as to data and data summaries. I plan to discuss these different definitions of ‘statistics’ in a future post.

Statistics is a scientific discipline that is important not just in clinical research. Nowadays, with the increased emphasis placed on data analytics and evidence based decisions in research and business, an awareness of the importance and the right use of statistical methods becomes crucial in every domain of application.

As a scientific discipline, statistics can be defined as “the science of collecting, analyzing, presenting, and drawing inference from data”.

We should probably rewrite this definition as “…drawing inference from incomplete data”, because most often we use data from a random sample of a large population to draw conclusions about the population. Moreover, since different random samples drawn from the same population may give slightly different results due to the sampling variability, the definition of statistics can be further expanded as “the science of collecting, analyzing, and drawing inference from incomplete data, in the presence of variability”.

Why do we need to have a basic understanding of statistics?

Nowadays, in our professional as well as personal lives, we are constantly bombarded with ‘statistics’ (here statistics = data + analysis) in the course of our work or by the media.

We need to get a good understanding of statistics so that we are in a position to critically look at the origin of the data (design of study), the data themselves, the analysis of the data and the inference. Most importantly, we also need to know that, due to the sampling variability, there is will always be a certain amount of uncertainty in the inference from a statistical analysis. We need to keep this uncertainty in mind to take appropriate informed decisions. Unfortunately, most often, this uncertainty, either does not find its way into reports, especially, media reports or is not given sufficient importance (Nice cartoon at http://adequatebird.com/wp-content/uploads/2010/01/phd012010s.gif posted in The LoveStats Blog)

To put in a nutshell, a right understanding of the science of statistics is essential to filter the truth from the lies and the damned lies!

Continue reading