Graphical presentation of data, best practices

Show the data, don’t conceal them was the first article from a series of articles published in the British Journal of Pharmacology that deals with the best practices to be followed in statistical reporting. The current set of articles in this series can be obtained at http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291476-5381/homepage/statistical_reporting.htm.

“Show the data…” deals with the graphical presentation of data. The article urges authors to present data in its ‘raw form’, that would show all the characteristics of the data distribution. The use of the so-called ‘dynamite plunger plot’ is discouraged. In a ‘dynamite plunger plot’ the mean of the data is represented by a bar and there is a ‘T’ at the top of the bar to show variability. It is argued that dynamite plunger plots can give a false notion of symmetry in the data. The conclusion of the article is that data can be better presented and compared using simple dotplots.

For those interested, the previous post in this blog was also on the graphical representation of data and included simple R code for generating different ‘types’ of dotplots.

Graphs in R – Overlaying Data Summaries in Dotplots

Dotplots are useful for the graphical visualization of small to medium-sized datasets. These simple plots provide an overview of how the data is distributed, whilst also showing the individual observations. It is however possible to make the simple dotplots more informative by overlaying them with data summaries and/or smooth distributions.

This post is about creating such superimposed dotplots in R – we first see how to create these plots using just base R graphics, and then proceed to create them using the ggplot2 R package.

## First things first - dataset 'chickwts': Weights of
## chickens fed with any one of six different feed types

?chickwts
data(chickwts)  ## load the dataset

 

Graphs using base R:

## First some plot settings

par(cex.main=0.9,cex.lab=0.8,font.lab=2,cex.axis=0.8,font.axis=2,col.axis="grey50")

We first create a dotplot where the median of each group is also displayed as a horizontal line:

## Getting the dotplot first, expanding the x-axis to leave room for the line
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method = "stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Then compute the group-wise medians
medians <- tapply(chickwts[,"weight"], chickwts[,"feed"], median)

## Now add line segments corresponding to the group-wise medians
loc <- 1:length(medians)
segments(loc-0.3, medians, loc+0.3, medians, col="red", lwd=3)

Next , we create a dotplot where the median is shown, along with the 1st and 3rd quartile, i.e., the ‘box’ of the boxplot of the data is overlaid with the dotplot:

## Getting the dotplot first, expanding the x-axis to leave room for the box
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method="stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Now draw the box, but without the whiskers!
boxplot(weight ~ feed, data = chickwts, add=TRUE, range=0, whisklty = 0, staplelty = 0)

Plots similar to ones created above, but using the ggplot2 R package instead:

## Load the ggplot2 package first
library(ggplot2)

## Data and plot settings
p <- ggplot(chickwts, aes(x=feed, y=weight)) +
labs(list(title = "Chicken weights after six weeks", x = "Feed Type", y = "Weight (g)")) +
theme(axis.title.x = element_text(face="bold"), axis.text.x = element_text(face="bold")) +
theme(axis.title.y = element_text(face="bold"), axis.text.y = element_text(face="bold"))

We use the stat_summary function to plot the median line as an errorbar, but we need to define our own function that calculates the group-wise median and produces output in a format suitable for stat_summary like so:

## define custom median function
plot.median <- function(x) {
  m <- median(x)
  c(y = m, ymin = m, ymax = m)
}

## dotplot with median line
p1 <- p + geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5) +
stat_summary(fun.data="plot.median", geom="errorbar", colour="red", width=0.5, size=1)
print(p1)

For the dotplot overlaid with the median and the 1st and 3rd quartile, the ‘box’ from the boxplot is plotted using geom_boxplot function:

## dotplot with box
p2 <- p + geom_boxplot(aes(ymin=..lower.., ymax=..upper..)) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)
print(p2)

Additionally, let’s also plot a dotplot with a violin plot overlaid. We cannot do this in base R!

## dotplot with violin plot
## and add some cool colors
p3 <- p + geom_violin(scale="width", adjust=1.5, trim = FALSE, fill="indianred1", color="darkred", size=0.8) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)
print(p3)

Treatment comparisons for pretest – posttest experimental studies

Recently, I faced a question on the statistical analysis method for an animal study involving testing of lipid lowering agents. The study was designed along the following lines. Animals were randomized to receive one of the study agents. Prior to initiation of treatment, lipid levels (LDL, HDL and total cholesterol) were measured (baseline). Study drugs were administered ­­daily for xx days and the lipid levels were again measured at end of study (post treatment).

This kind of study design is often labelled as a pretest – posttest design and is quite common in the medical field for comparing different treatments.

In many clinical studies of cholesterol lowering agents, the percent change from baseline is analyzed for between group differences. Hence, I suggested using percent change from baseline as the outcome measure in an ANOVA as the statistical analysis method for the above animal experiment. On further scanning of the literature, I however noted that in spite of the widespread prevalence of the pretest – posttest experimental design, there is a lack of consensus on the best method for the data analysis.

The possible choices for the outcome measure to use in the analysis of data from a pretest – posttest study could be the post-treatment values (PV) or the change between baseline and post treatment values, referred to in the literature as change scores or gain scores (DIFF) or any measure of relative change between baseline and post treatment values. Percentage change from baseline (PC) mentioned above is an example of a measure of relative change. Some of the other measures of relative change include symmetrized percent change (SPC) and log ratio of post-treatment to baseline (LR). Furthermore, the methods for statistical analysis include parametric and non-parametric versions of ANOVA or ANCOVA on any of the above outcome measure. So indeed there exist a lot of possibilities for the analysis of data from a pretest – posttest study!

Various simulation studies have provided us with pointers to guide the choice of the outcome measure and analysis method. The current thinking seems to be that an ANCOVA on PV has higher power than a simple ANOVA on PC, especially in the situation where little correlation exists between baseline and post treatment values. However, SPC with wither the ANOVA or ANCOVA seems to be a good option in the case of additive or multiplicative correlation between baseline and post treatment values.

The simulation studies have tried to mimic various real scenarios. Vickers has studied in detail the situation where the outcome is continuous and there is an additive correlation between post treatment and baseline values. SPC as an outcome measure was not studied (Vickers A. J. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Medical Research Methodology (2001) 1:6). Meanwhile, Berry & Ayers consider count data, include SPC as an outcome measure and consider parametric and non-parametric analysis methods in the presence of additive and multiplicative correlation between baseline and post treatment values (Berry D. A. & Ayers G. D. Symmetrized Percent Change for Treatment Comparisons. The American Statistician (2006) 60:1, 27-31).

At the time of design of the statistical analysis plan, there is bound to be little information available on the extent and type of correlation that may exist between the baseline and post treatment values. Hence there is a need to conduct extensive review and/or simulations, taking into account also the type of data and correlation structures encountered in different therapeutic areas, to identify (therapeutic area specific!) best practices for the choice of the outcome measure and analysis method for the pretest – posttest design.

Calculating Standard Deviation

Most standard books on Biostatistics give two formulae for calculating the standard deviation (SD) of a set of observations.

The population SD (for a finite population) is defined as \sqrt{\frac{\sum_{i=1}^{N}\left ( X_{i} -\mu \right )^{2}}{N}}, where X1, X2, …, XN are the N observations and μ is the mean of the finite population.

The sample SD is defined as \sqrt{\frac{\sum_{i=1}^{n}\left ( x_{i} -\bar{x} \right )^{2}}{n-1}}, where x1, x2, …, xn are the n observations in the sample and \bar{x} is the sample mean.

While teaching a Biostatistics course, it becomes a little difficult to explain to students about why there are two different formulae, and why do we subtract ‘1’ from n when we calculate the SD using sample observations. These are students who are (or will be) collecting samples, performing experiments, obtaining sample data, and analyzing data. Hence it is important that they understand the concept from a practitioners point, rather than a theoretical point of view.

In my experience, the best explanation has been to avoid a theoretical discussion of unbiasedness etc. and to state that if we want to calculate the SD from sample data, we also need to know the true (population) mean. Since the population mean is unknown, we use the same n observations in the sample to first calculate the sample mean \bar{x} as an estimate of the population mean μ. Our sample observations however tend to be more close to the sample mean rather than the population mean and hence the SD calculated using the first formula will underestimate the true variability. To correct for this underestimation we divide by n – 1 instead of n. Then the question of why n – 1? We have already used the n observations to calculate one quantity, i.e., \bar{x} . We are then left with only n – 1degrees of freedom’ for further calculations that use \bar{x} .

I also state three more points to complete the topic:

  1. That the quantity n – 1 is referred to as the degrees of freedom for computing the SD and we say that 1 degree of freedom has been used up while estimating μ using sample data.
  2. For most situations that the students would encounter, the complete population is not available and they would have to work with sample data. So in most (if not all) situations, they should be using the formula with n – 1 to calculate the SD.
  3. As the sample size increases, the differences arising out of using n or n – 1 should be of little practical importance. But it is still conceptually correct to use n – 1.

When the students were then asked as to how they would define degrees of freedom, one of them came up with a definition as ‘number of observations minus number of calculations’ which is pretty much what it is, as far as a practitioner is concerned!

Seeing India through Food – An Experiment in Multidimensional Scaling

The ‘Household Consumption of Various Goods and Services in India’ report from the National Sample Survey Office (NSSO), Government of India includes survey data on the monthly per capita quantity of consumption of selected food items. The report is available from http://mospi.nic.in/Mospi_New/Admin/publication.aspx (a registration is required). The per capita quantity of consumption data is provided for every state and union territory of India, separately for rural and urban sectors. Here is a snapshot of the data for the urban sector from the February 2012 report:

rice.kg wheat.kg arhar.kg moong.kg masur.kg urd.kg
AndhraPradesh 8.764 0.656 0.412 0.095 0.032 0.173
ArunachalPradesh 10.49 0.763 0.066 0.079 0.296 0.007
Assam 10.246 1.109 0.044 0.105 0.363 0.047
Bihar 5.804 5.732 0.088 0.057 0.222 0.025
Chhattisgarh 7.643 2.686 0.685 0.024 0.053 0.054

The results of an analysis of this food consumption data using Multidimensional Scaling (MDS) is presented in this post. The objective behind this analysis was to see what kind of clustering patterns are seen among the states of India, as far as food consumption goes.

MDS was carried out using R (version 3.1.2) using the isoMDS() function from the MASS package and using the ‘manhattan’ distance measure. The plot below is a visualization of the results from the MDS (only for the sake of clarity, the union territories are not shown in the plot):

MDS

Here is a map of India to compare the above plot with. Don’t you think that the relative positioning of the different states in India nicely captured in the MDS analysis?

It is often said that the culinary diversity in India is the result of the diversity in the geography, climate, economy, tradition and culture within the country. All of these factors also contribute to an extent on the consumption of basic food items, resulting in the clustering together of geographically and culturally nearby regions even though the data analyzed was on food.

Some more ‘food’ for thought:

  1. Though Goa is geographically close to the states of Karnataka and Maharashtra, with respect to the food consumption pattern, Goa seems closer to Kerala. This is probably reflecting the strong coastal influence for both Goa and Kerala. Karnataka and Maharashtra too have a long coastline, but the influence of the inland may have diminished the coastal effects. In contrast, Kerala being narrow, the influence of the sea is strong in its cuisine. The relative positioning of Sikkim too, is a bit off. Not able to understand the reason for this.
  2. What would a MDS plot using data from rural areas alone or from urban areas alone look like? This would be a topic for a future post.
  3. The NSSO has meanwhile made available a more recent report in 2014. We can explore how the data from the newer report compares to the 2012 data.

 

A short post: A Shiny Application for Visualizing Colors in R

Choosing the right colors for graphics is vital for good data visualization. There are 657 colors defined in R using color names. The names of these colors can be obtained using the R command:

 colors() 

Some examples of color names in R include exotic sounding ones such as ‘steelblue1’, ‘thistle’, ‘mintcream’ and so on! But how do these colors actually look onscreen?

Show R colors is a web application (app) developed using Shiny that can be used to quickly see how the named colors in R look onscreen, before they are used in graphics. Shiny (http://shiny.rstudio.com/) is an R package that makes it easy to build interactive apps from R. Very little knowledge of HTML, CSS, or JavaScript is required to develop simple apps, though some knowledge of these can help in creating more stunning applications.

I developed this app as a course project for the ‘Developing Data Products’ course from Coursera and since then have used it various times to choose colors for my graphs!

M$ Excel for Statistical Analysis?

Many statisticians do not advocate using Microsoft Excel for statistical analysis, except, for maybe obtaining the most simplest of data summaries and charts. Even the charts produced using the default options in Excel are considered chartjunk. However, many introductory courses in Statistics use Excel as a tool for their statistical computing labs and there is no disputing the fact that Excel is an extremely easy to use software tool.

This post is based on a real situation that arose when illustrating t-tests in Excel (Microsoft Excel 2010) for a Biostatistics course.

The question was whether the onset of BRCA mutation-related breast cancers happens early in age in subsequent generations. The sample data provided was on the age (in years) of onset of BRCA mutation-related breast cancers for mother-daughter pairs. Here is the data:

Mother Daughter
47 42
46 41
42 42
40 39
48 44
48 45
49 41
38 45
50 44
47 48
46 39
43 36
54 44
48 46
49 46
45
39 40
48 36
46 43
41
49 42
48 39
49 43
45 47
36 44

Some students in the course used the Excel function t.test and obtained the one-tailed p-value for the paired t-test as 0.001696. Some of the other students used the ‘Data Analysis’ add-in from the ‘Data’ tab and obtained the following results:

t-Test: Paired Two Sample for Means
  Variable 1 Variable 2
Mean 45.83333 42.375
Variance 17.97101 10.07065
Observations 24 24
Pearson Correlation 0.143927
Hypothesized Mean Difference 0
df 23
t Stat 1.247242
P(T<=t) one-tail 0.11243
t Critical one-tail 1.713872
P(T<=t) two-tail 0.22486
t Critical two-tail 2.068658

The one-tailed p-value reported here is 0.11243! Surprised at this discrepancy, we decided to verify the analysis by hand calculations. The correct p-value is the 0.001696 obtained from the t.test function.

Now what is the problem with the results from the ‘Data Analysis’ add-in? A closer look at the results table additionally reveals the reported degrees of freedom (df) as 23. However, we can see that because of the missing values in the dataset, the number of usable pairs for analysis is 23. The correct df is therefore 22.

This shows that missing values in the data are not handled correctly by the ‘Data Analysis’ add-in. A search shows this problem with the add-in reported as early as in the year 2000 with Microsoft Excel 2000 (http://link.springer.com/content/esm/art:10.1007/s00180-014-0482-5/file/MediaObjects/180_2014_482_MOESM1_ESM.pdf). Unfortunately, the error has never been corrected in the subsequent versions of the software. It looks like the bad charts are the least of the problems with Excel! Other problems that have been reported include poor numerical accuracy, poor random number generation tools and errors in the data analysis add-ins.

Having said all of this, we certainly cannot also deny that Excel has been, and continues to be a very useful software tool to demonstrate and conduct basic data exploration and statistical analysis, especially for a non-statistician audience. The post at http://stats.stackexchange.com/questions/3392/excel-as-a-statistics-workbench provides a balanced view of the pros and cons of using Excel for data analysis. The take away for us is to be extremely careful when using Microsoft Excel for data management and statistical analysis and to doubly verify the results of any such data operation and analysis.