Forest Plot (with Horizontal Bands)

Forest plots are often used in clinical trial reports to show differences in the estimated treatment effect(s) across various patient subgroups. See, for example a review. The page on Clinical Trials Safety Graphics includes a SAS code for a forest plot that depicts the hazard ratios for various patient subgroups (this web page has links to other interesting clinical trial graphics, as well as links to notes on best practices for graphs, and is worth checking out).

The purpose of this blog post is to create the same forest plot using R.

It should be possible to create such a graphic from first principles, using either base R graphics or using the ggplot2 package such as posted here. However, there is a contributed package forestplot that makes it very easy to make forest plots interspersed with tables – we just need to supply the right arguments to the forestplot function in the package. Also, one question that arose was how easy it would be to get the horizontal grey bands for alternate rows in the forest plot. This too was not very difficult. Following is a short explanation of the entire process, as well as the relevant R code:

First, we store the data for the plot, ForestPlotData in any convenient file format and then read it into a dataframe in R:

workdir <- ""C:\\Path\\To\\Relevant\\Directory""
datafile <- file.path(workdir,"ForestPlotData.csv")
data <- read.csv(datafile, stringsAsFactors=FALSE)

Then format the data a bit so that the column labels and columns match the required graphical output:

## Labels defining subgroups are a little indented!
subgps <- c(4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33)
data$Variable[subgps] <- paste("  ",data$Variable[subgps]) 
## Combine the count and percent column
np <- ifelse(!$Count), paste(data$Count," (",data$Percent,")",sep=""), NA)
## The rest of the columns in the table. 
tabletext <- cbind(c("Subgroup","\n",data$Variable), 
                    c("No. of Patients (%)","\n",np), 
                    c("4-Yr Cum. Event Rate\n PCI","\n",data$PCI.Group), 
                    c("4-Yr Cum. Event Rate\n Medical Therapy","\n",data$Medical.Therapy.Group), 
                    c("P Value","\n",data$P.Value))

Finally, include the forestplot R package and call the forestplot function with appropriate arguments.

The way I got around to creating the horizontal band at every alternate row was by using settings for a very thick transparent line in the hrzl_lines argument! See below. The col=”#99999922″ option gives the light grey color to the line as well as sets it to be transparent.

A graphics device (here, a png file) with appropriate dimensions is first opened and the forest plot is saved to the device.

png(file.path(workdir,"Figures\\Forestplot.png"),width=960, height=640)
forestplot(labeltext=tabletext, graph.pos=3, 
           lower=c(NA,NA,data$Low), upper=c(NA,NA,data$High),
           title="Hazard Ratio",
           xlab="     <---PCI Better---    ---Medical Therapy Better--->",
           hrzl_lines=list("3" = gpar(lwd=1, col="#99999922"), 
                          "7" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922"),
                          "15" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922"),
                          "23" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922"),
                          "31" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922")),
                              xlab=gpar(cex = 1.2),
                              title=gpar(cex = 1.2)),
           col=fpColors(box="black", lines="black", zero = "gray50"),
           zero=1, cex=0.9, lineheight = "auto", boxsize=0.5, colgap=unit(6,"mm"),
 , ci.vertices=TRUE, ci.vertices.height = 0.4)

Here is the resulting forest plot:


Drug Interaction Studies – Statistical Analysis

This post is actually a continuation of the previous post, and is motivated by this article that discusses the graphics and statistical analysis for a two treatment, two period, two sequence (2x2x2) crossover drug interaction study of a new treatment versus the standard. Whereas the previous post was devoted to implementing some of the graphics presented in the article, in this post we try to recreate the statistical analysis calculations for the data from the drug interaction study. The statistical analysis is implemented with R.


The drug interaction study dataset can be obtained from ocdrug.dat.txt and a brief description of the dataset is at ocdrug.txt. We first save the files to a local directory and then read the data into R dataframe and assign appropriate column labels:

ocdrug <- read.table(paste(workdir,"ocdrug.dat.txt",sep=""),sep="") 
## “workdir” is the name of the variable containing the directory name where the data file is stored 
colnames(ocdrug) <- c("ID","Seq","Period","Tmnt","EE_AUC","EE_Cmax","NET_AUC","NET_Cmax")

## Give nice names to the treatments (OCD and OC) and the treatment sequence 
ocdrug$Seq <- factor(ifelse(ocdrug$Seq == 1,"OCD-OC","OC-OCD"))
ocdrug$Tmnt <- factor(ifelse(ocdrug$Tmnt == 0,"OC","OCD"), levels = c("OC", "OCD"))

Then log transform the PK parameters:

ocdrug$logEE_AUC <- log(ocdrug$EE_AUC)
ocdrug$logEE_Cmax <- log(ocdrug$EE_Cmax)
ocdrug$logNET_AUC <- log(ocdrug$NET_AUC)
ocdrug$logNET_Cmax <- log(ocdrug$NET_Cmax)


Analysis based on Normal Theory – Linear Mixed Effects Model

We illustrate the normal theory based mixed effects model for the analysis of EE AUC. The analysis for EE Cmax and for NET AUC and Cmax are similar. We use the recommended R package nlme for developing the linear mixed effects models. The models include terms for sequence, period and treatment.


options(contrasts = c(factor = "contr.treatment", ordered = "contr.poly"))
lme_EE_AUC <- lme(logEE_AUC ~ Seq+Period+Tmnt, random=(~1|ID), data=ocdrug)

From the fitted models, we can calculate the point estimates and confidence intervals for the geometric mean ratios for AUC and Cmax for EE and NET. This is illustrated for EE AUC:

tTable_EE_AUC <- summary(lme_EE_AUC)$tTable["TmntOCD",]
logEst <- tTable_EE_AUC[1]
se <- tTable_EE_AUC[2]
df <- tTable_EE_AUC[3]
t <- qt(.95,df)</pre>

## Estimate of geometric mean ratio
gmRatio_EE_AUC <- exp(logEst)
## 90% lower and upper confidence limits
LCL_EE_AUC <- exp(logEst-t*se); UCL_EE_AUC <- exp(logEst+t*se)

The calculations for Cmax and for NET are similar. We can bind all the results from normal theory model into a dataframe:

norm.out <- data.frame(type=rep("Normal Theory",4),
            cmpd=c(rep("EE",2), rep("NET",2)),
            ratio=c(gmRatio_EE_AUC, gmRatio_EE_Cmax, gmRatio_NET_AUC, gmRatio_NET_Cmax),
            LCL=c(LCL_EE_AUC, LCL_EE_Cmax, LCL_NET_AUC, LCL_NET_Cmax),
            UCL=c(UCL_EE_AUC, UCL_EE_Cmax, UCL_NET_AUC, UCL_NET_Cmax))
rownames(norm.out) <- paste(norm.out$cmpd,norm.out$PK,sep="_")

cat("\nSummary of normal theory based analysis\n")

As discussed in the article, the diagnostic plots to assess the appropriateness of the normal theory based linear model, i.e., normal probability plots, scatter plots of observed vs. fitted values, and scatter plots of studentized residuals vs. fitted values, show some deviations from assumptions, at least for Cmax. Hence, a distribution-free or non-parametric analysis may not be unreasonable.

Distribution-free Analysis

A distribution free analysis for bioequivalence is using the two sample Hodges-Lehmann point estimator and the two sample Moses confidence interval. Difference in treatment effects are estimated as the average difference in the two treatment sequence groups with respect to the within-subject period difference in the log AUC (or log Cmax) values. But we must first calculate these within-subject period differences. Transforming the dataset to a ‘wide’ format may be useful for the calculation of these within-subject differences:

ocdrug.wide <- reshape(ocdrug, v.names=c("EE_AUC", "EE_Cmax", "NET_AUC", "NET_Cmax", 
"logEE_AUC", "logEE_Cmax", "logNET_AUC", "logNET_Cmax", "Tmnt"), timevar="Period", 
idvar="ID", direction="wide", new.row.names=NULL)
ocdrug.wide$logEE_AUC_PeriodDiff <- with(ocdrug.wide, logEE_AUC.1 - logEE_AUC.2)

Similarly, the within-subject differences can be calculated for EE Cmax and NET AUC and Cmax. The two sample Hodges-Lehmann point estimator and the two sample Moses confidence interval can be calculated from these within subject differences using the the pairwiseCI function in the R-package pairwiseCI with the method=HL.diff option. Also, as specified in the article, we need to divide the estimates and CI bounds by 2 because the difference is taken twice (sequence and period).

npEst_EE_AUC <- unlist( ~ Seq, data=ocdrug.wide, 
                  method="HL.diff", conf.level=0.90)))
npRatio_EE_AUC <- as.numeric(exp(npEst_EE_AUC[1]/2))
npLCL_EE_AUC <- as.numeric(exp(npEst_EE_AUC[2]/2))
npUCL_EE_AUC <- as.numeric(exp(npEst_EE_AUC[3]/2))

Similarly, the estimates and confidence intervals can be obtained for EE Cmax and NET AUC and Cmax. We bind all the results obtained from distribution free theory analysis into a dataframe:

np.out <- data.frame(type=rep("Distribution Free",4),
            cmpd=c(rep("EE",2), rep("NET",2)),
            ratio=c(npRatio_EE_AUC, npRatio_EE_Cmax, npRatio_NET_AUC, npRatio_NET_Cmax),
            LCL=c(npLCL_EE_AUC, npLCL_EE_Cmax, npLCL_NET_AUC, npLCL_NET_Cmax),
            UCL=c(npUCL_EE_AUC, npUCL_EE_Cmax, npUCL_NET_AUC, npUCL_NET_Cmax))

rownames(np.out) <- paste(np.out$cmpd,np.out$PK,sep="_")

cat("\nSummary of distribution-free analysis\n")

Finally, we try to recreate figure 13 that graphically compares the results of the normal theory and distribution free analyses:

# We collect the quantities to plot in a dataframe and include a dummy x-variable
plotIt <- rbind(np.out,norm.out)
plotIt$x <- c(1,3,5,7,1.5,3.5,5.5,7.5)

# Then plot
png(filename = paste(workdir,"plotCIs.png",sep=""), width = 640, height = 640)
par(font.lab=2, font.axis=2, las=1, font=2, cex=1.6, cex.lab=0.8, cex.axis=0.8)
col=c(rep("magenta",4), rep("lightseagreen",4))
pch=c(rep(17,4), rep(16,4))
plot(ratio ~ x, data=plotIt, type="p", pch=pch, col=col, axes=FALSE, 
     xlim=c(0.5,8), ylim=c(0.7,1.35), xlab="", ylab="Ratio(OCD/OC)")
arrows(plotIt$x, plotIt$LCL, plotIt$x, plotIt$UCL, 
     code=3, length = 0.08, angle = 90, lwd=3, col=col, lty=lty)
axis(side=2, at=c(0.7,0.8,1, 1.25))
text(x=c(1,3,5,7), y=rep(0.7,4), pos=4, labels=rep(c("AUC","Cmax"),2), cex=0.8)
text(x=c(2,6), y=rep(0.65,2), pos=4, labels=c("EE","NET"), xpd=NA, cex=0.8)

legend("top",c(paste("Distribution-Free","\t\t"),"Normal Theory"), 
    col=unique(col), lty=unique(lty), pch=unique(pch), horiz=TRUE, bty="n", cex=0.8, lwd=2)



Summary of normal theory based analysis
         ratio  LCL  UCL
EE_AUC    1.16 1.11 1.23
EE_Cmax   1.10 1.01 1.19
NET_AUC   1.01 0.97 1.04
NET_Cmax  0.86 0.78 0.95

Summary of distribution-free analysis
         ratio  LCL  UCL
EE_AUC    1.16 1.11 1.24
EE_Cmax   1.08 0.99 1.20
NET_AUC   1.00 0.97 1.04
NET_Cmax  0.90 0.82 0.96

These results appear in tables 2-3 and figure 13 of the article.

Spaghetti plots with ggplot2 and ggvis

This post was motivated by this article that discusses the graphics and statistical analysis for a two treatment, two period, two sequence (2x2x2) crossover drug interaction study of a new drug versus the standard. I wanted to write about implementing those graphics and the statistical analysis in R. This post is devoted to the different ways of generating the spaghetti plot in R, and the statistical analysis part will follow in the next post.

Spaghetti plots, are often used to visualize repeated measures data. These graphs can be used to visualize time trends like this or to visualize the outcome of different treatments on the same subjects, as in figure3 of the article above. Briefly, in Spaghetti plots, the responses for the same subject, either over time or over different treatments, are connected by lines to show the subject-wise trends. Sometimes, different line types or colors are used to distinguish each subject profile. The plot looks like a plate of spaghetti, that’s probably the reason for the name.


The dataset for illustrating spaghetti plots can be obtained from ocdrug.dat.txt and a brief description of the dataset is at ocdrug.txt. I first saved the files to a local directory and then read the data into a R dataframe and assigned appropriate column labels:

ocdrug <- read.table(paste(workdir,"ocdrug.dat.txt",sep=""),sep="") 
## “workdir” is the name of the variable storing the directory name where the data file is stored 
colnames(ocdrug) <- c("ID","Seq","Period","Tmnt","EE_AUC","EE_Cmax","NET_AUC","NET_Cmax")

## Give nice names to the treatments (OCD and OC) and the treatment sequence 
ocdrug$Seq <- factor(ifelse(ocdrug$Seq == 1,"OCD-OC","OC-OCD"))
ocdrug$Tmnt <- factor(ifelse(ocdrug$Tmnt == 0,"OC","OCD"), levels = c("OCD", "OC"))


Spaghetti plot using ggplot2

It is possible to make a spaghetti plot using base R graphics using the function interaction.plot(). We however do not discuss this approach here, but go directly to the approach using ggplot2. We want to exactly reproduce figure 3 of the article that actually has four sub-figures. In base R, we can use mfrow(), but in ggplot2, one way to achieve this is to first create the 4 individual figures and arrange them using the grid.arrange() function in package gridExtra. First, we load the required packages,

require(gridExtra)  ## required to arrange ggplot2 plots in a grid

and create a theme common for all the graphs:

mytheme <- theme_classic() %+replace% 
        theme(axis.title.x = element_blank(), 
        axis.title.y = element_text(face="bold",angle=90))  

We then make the first sub-figure. This is for the EE_AUC. The y-axis is in log10 scale:

p1 <- ggplot(data = ocdrug, aes(x = Tmnt, y = EE_AUC, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(1000,6000)) +
    labs(list(title = "AUC", y = paste("EE","\n","pg*hr/mL"))) + 
    geom_line(size=1) + theme(legend.position="none")

Making the remaining three graphs follows along similar lines. Note that in the graphs p2, p3 and p4 the points for some subjects (outliers?) are labeled. We can get the labels using geom_text() and choosing the subjects to be labeled. We also include a legend below graphs p3 and p4.

p2 <- ggplot(data = ocdrug, aes(x = Tmnt, y = EE_Cmax, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(100,700)) +
    labs(list(title = "Cmax", y = paste("EE","\n","pg/mL"))) + 
    geom_line(size=1) + 
    geom_text(data=subset(ocdrug, ID %in% c(2,20)), aes(Tmnt,EE_Cmax,label=ID)) +

p3 <- ggplot(data = ocdrug, aes(x = Tmnt, y = NET_AUC, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(80000,300000)) +    
    labs(list(y = paste("NET","\n","pg*hr/mL"))) + 
    geom_line(size=1) + 
    geom_text(data=subset(ocdrug, ID %in% c(18,22,20)), aes(label=ID), show_guide = F) +
    scale_colour_discrete(name="Sequence: ", labels=c("OCD then OC", "OC then OCD")) + 

p4 <- ggplot(data = ocdrug, aes(x = Tmnt, y = NET_Cmax, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(10000,60000)) +
    labs(list(y = paste("NET","\n","pg/mL"))) + 
    geom_line(size=1) + 
    geom_text(data=subset(ocdrug, ID == 9), aes(label=ID), show_guide = F) +
    scale_colour_discrete(name="Sequence: ", labels=c("OCD then OC", "OC then OCD")) + 

Finally, we arrange plots p1 through p4 as a matrix, using the function grid.arrange() and save it to a .png file:

png(filename = paste(workdir,"ByTmnt_ggplot2.png",sep=""), width = 640, height = 640, bg="transparent")
grid.arrange(p1, p2, p3, p4, ncol = 2)

Creating an interactive spaghetti plot with ggvis

Having recreated figure3 of the article using ggplot2, I then wanted to make an interactive version of the plot. The R package ggvis can be used to provide some interactive features. Here is the user interaction that we wish to add:

  1. To be able to select which (of the four) plot to view
  2. To provide a tooltip to the user, that gives info on the subject ID when the cursor is pointed at a point or line in the graph

To create a plot in ggvis that includes a tooltip, we need to first create an identifier for each row in the dataset like so:

ocdrug$uid <- 1:nrow(ocdrug)  # Add an unique id column to use as the key
all_values <- function(x) {
  if(is.null(x)) return(NULL)
  row <- ocdrug[ocdrug$uid == x$uid,]
  paste0(names(row[1]), ": ", format(row[1]))


ocdrug <- group_by(ocdrug, ID) ## Data is grouped, by subjects ocdrug %>% 

ocdrug %>% 
  ggvis(x = ~Tmnt, y = input_select(c("EE: AUC" = "EE_AUC", "EE: Cmax" = "EE_Cmax",
            "NET: AUC" = "NET_AUC", "NET: Cmax" = "NET_Cmax"), 
            label="Y-aixs variable", map = %>%    ## choose which graph to display
  layer_paths(stroke = ~Seq) %>%    ## color lines by treatment sequence as before
  layer_points(fill = ~Seq) %>%        ## color points by treatment sequence as before
  layer_points(fill = ~Seq, key := ~uid) %>%    ## having to do it twice, 
         ## else the points just seemed to appear and disappear. Have not understood why?
  add_axis("x", title = "Group", title_offset = 50, grid=FALSE) %>%    ## Axes and legend
  add_axis("y", title = "", grid=FALSE) %>%
  scale_numeric("y", trans="log") %>%
  hide_legend("stroke") %>%
  add_legend("fill", title = "Sequence") %>%
  add_tooltip(all_values, "hover")    ## Finally add the tooltip

To display the interactive plot, copy the above code and paste it in an R session. The plot would appear in the browser. The R session should be kept open. The plot below is only a screenshot from the browser and is not interactive.


My original aim was actually to create an interactive version of the ggplot2 graphic that displays all the four graphs at once, but also includes a tooltip, instead of the text labels for selected subjects. I also wanted that pointing at one subject in one particular graph will highlight the profile for that subject, not only that graph, but in the remaining three graphs as well. It however looks like, now, ggvis does not support multiple graphs in the same page. A full-fledged Shiny app may be a solution for someone with no knowledge of html, css, Java etc… I welcome experts to share any other ideas by which such interactivity can be achieved.

Waterfall plots – what and how?

Waterfall plots” are nowadays often used in oncology clinical trials for a graphical representation of the quantitative response of each subject to treatment. For an informative article explaining waterfall plots see Understanding Waterfall Plots.

In this post, we illustrate the creation of waterfall plots in R.

In a typical waterfall plot, the x-axis serves as the baseline value of the response variable. For each subject, vertical bars are drawn from the baseline, either in the positive or negative direction to depict the change from baseline in the response for the subject. The y-axis thus represents the change from baseline in the response, usually expressed as a percentage, for e.g., percent change in the size of the tumor or percent change in some marker level. Most importantly, in a waterfall plot, the bars are ordered in the decreasing order of the percent change values.

Though waterfall plots have gained popularity in oncology, they can be used for data visualization in other clinical trials as well, where the response is expressed as a change from baseline.


Instead of a tumor growth dataset, we illustrate creation of waterfall plots for the visual depiction of a quality of life data. A quality of life dataset, dataqol2 is available with the R package QoLR.

dataqol2$id <- as.factor(dataqol2$id)
dataqol2$time <- as.factor(dataqol2$time)
dataqol2$arm <- as.factor(dataqol2$arm)

dataqol2 contains longitudinal data on scores for 2 quality of life measures (QoL and pain) for 60 subjects. In the case of QoL, higher scores are better since they imply better quality of life, and for pain, lower scores are better since they imply a decrease in pain. Each subject has these scores recorded at baseline (time = 0) and then at a maximum of 5 more time points post baseline. ‘arm’ represents the treatment arm to which the subjects were assigned. The dataset is in long format.

The rest of this post is on creating a waterfall plot in R for the QoL response variable.

Creating a waterfall plot using the barplot function in base R

The waterfall plot is basically an ‘ordered bar chart’, where each bar represents the change from baseline response measure for the corresponding subject.

As the first step, it would be helpful if we change the format of the dataset from ‘long’ to ‘wide’. We use the reshape function to do this. Also, we retain only the QoL scores, but not the pain scores:

qol2.wide <- reshape(dataqol2, v.names="QoL", idvar = "id", timevar = "time", direction = "wide", drop=c("date","pain"))

For each subject, we then find the best (largest) QoL score value post baseline, compute the best percentage change from baseline and order the dataframe in the decreasing order of the best percentage changes. We also remove subjects with missing percent change values:

qol2.wide$bestQoL <- apply(qol2.wide[,5:9], 1 ,function(x) ifelse(sum(! == 0, NA, max(x,na.rm=TRUE)))
qol2.wide$bestQoL.PerChb <- ((qol2.wide$bestQoL-qol2.wide$QoL.0)/qol2.wide$QoL.0)*100

o <- order(qol2.wide$bestQoL.PerChb,decreasing=TRUE,na.last=NA)
qol2.wide <- qol2.wide[o,]

Create the waterfall plot… Finally!

barplot(qol2.wide$bestQoL.PerChb, col="blue", border="blue", space=0.5, ylim=c(-100,100),
main = "Waterfall plot for changes in QoL scores", ylab="Change from baseline (%) in QoL score",
cex.axis=1.2, cex.lab=1.4)


Since we are depicting changes in quality of life scores, the higher the bar is in the positive direction, the better the improvement in the quality of life. So, the above figure shows that, for most subjects, there was improvement in the quality of life post baseline.

We can also color the bars differently by treatment arm, and include a legend. I used the choose_palette() function from the excellent colorspace R package to get some nice colors.

col <- ifelse(qol2.wide$arm == 0, "#BC5A42", "#009296")
barplot(qol2.wide$bestQoL.PerChb, col=col, border=col, space=0.5, ylim=c(-100,100),
main = "Waterfall plot for changes in QoL scores", ylab="Change from baseline (%) in QoL score",
cex.axis=1.2, cex.lab=1.4, legend.text=c(0,1),
args.legend=list(title="Treatment arm", fill=c("#BC5A42","#009296"), border=NA, cex=0.9))


Treatment arm 1 is associated with the largest post baseline increases in the quality of life score. Since waterfall plots are basically bar charts, they can be colored by other relevant subject attributes as well.

The above is a solution to creating waterfall plots using base R graphics function barplot. It is my aim to simultaneously also develop a solution using the ggplot2 package (and in the process, develop expertise in ggplot2). So here it is…

Creating a waterfall plot using ggplot2

We use the previously created qol2.wide dataframe, but in ggplot2, we also need an x variable. So:

x <- 1:nrow(qol2.wide)

Next we specify some plot settings, we color bars differently by treatment arm and allow the default colors of ggplot2, since I think they are quite nice. We also want to remove the x-axis, and put sensible limits for the y-axis:

b <- ggplot(qol2.wide, aes(x=x, y=bestQoL.PerChb, fill=arm, color=arm)) +
scale_fill_discrete(name="Treatment\narm") + scale_color_discrete(guide="none") +
labs(list(title = "Waterfall plot for changes in QoL scores", x = NULL, y = "Change from baseline (%) in QoL score")) +
theme_classic() %+replace%
theme(axis.line.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.y = element_text(face="bold",angle=90)) +
coord_cartesian(ylim = c(-100,100))

Finally, the actual bars are drawn using geom_bar(), and we specify the width of the bars and the space between bars. We specify stat="identity" because we want the heights of the bars to represent actual values in the data. See ?geom_bar

 b <- b + geom_bar(stat="identity", width=0.7, position = position_dodge(width=0.4))


Update: Readers pointed out about a ‘waterfall chart’ in finance that seems to be somewhat different than the graphic discussed in this post, and they seem to use the word ‘chart’ instead of ‘plot’. Here is some info, that also includes some R code for the waterfall chart used in finance. Here is yet another plot referred to as ‘waterfall plot’ that seems to be used to display spectra.

Waterfall seems to be quite a popular name for plots!

Graphs in R – Overlaying Data Summaries in Dotplots

Dotplots are useful for the graphical visualization of small to medium-sized datasets. These simple plots provide an overview of how the data is distributed, whilst also showing the individual observations. It is however possible to make the simple dotplots more informative by overlaying them with data summaries and/or smooth distributions.

This post is about creating such superimposed dotplots in R – we first see how to create these plots using just base R graphics, and then proceed to create them using the ggplot2 R package.

## First things first - dataset 'chickwts': Weights of
## chickens fed with any one of six different feed types

data(chickwts)  ## load the dataset


Graphs using base R:

## First some plot settings


We first create a dotplot where the median of each group is also displayed as a horizontal line:

## Getting the dotplot first, expanding the x-axis to leave room for the line
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method = "stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Then compute the group-wise medians
medians <- tapply(chickwts[,"weight"], chickwts[,"feed"], median)

## Now add line segments corresponding to the group-wise medians
loc <- 1:length(medians)
segments(loc-0.3, medians, loc+0.3, medians, col="red", lwd=3)

Next , we create a dotplot where the median is shown, along with the 1st and 3rd quartile, i.e., the ‘box’ of the boxplot of the data is overlaid with the dotplot:

## Getting the dotplot first, expanding the x-axis to leave room for the box
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method="stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Now draw the box, but without the whiskers!
boxplot(weight ~ feed, data = chickwts, add=TRUE, range=0, whisklty = 0, staplelty = 0)

Plots similar to ones created above, but using the ggplot2 R package instead:

## Load the ggplot2 package first

## Data and plot settings
p <- ggplot(chickwts, aes(x=feed, y=weight)) +
labs(list(title = "Chicken weights after six weeks", x = "Feed Type", y = "Weight (g)")) +
theme(axis.title.x = element_text(face="bold"), axis.text.x = element_text(face="bold")) +
theme(axis.title.y = element_text(face="bold"), axis.text.y = element_text(face="bold"))

We use the stat_summary function to plot the median line as an errorbar, but we need to define our own function that calculates the group-wise median and produces output in a format suitable for stat_summary like so:

## define custom median function
plot.median <- function(x) {
  m <- median(x)
  c(y = m, ymin = m, ymax = m)

## dotplot with median line
p1 <- p + geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5) +
stat_summary("plot.median", geom="errorbar", colour="red", width=0.5, size=1)

For the dotplot overlaid with the median and the 1st and 3rd quartile, the ‘box’ from the boxplot is plotted using geom_boxplot function:

## dotplot with box
p2 <- p + geom_boxplot(aes(ymin=..lower.., ymax=..upper..)) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)

Additionally, let’s also plot a dotplot with a violin plot overlaid. We cannot do this in base R!

## dotplot with violin plot
## and add some cool colors
p3 <- p + geom_violin(scale="width", adjust=1.5, trim = FALSE, fill="indianred1", color="darkred", size=0.8) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)

A short post: A Shiny Application for Visualizing Colors in R

Choosing the right colors for graphics is vital for good data visualization. There are 657 colors defined in R using color names. The names of these colors can be obtained using the R command:


Some examples of color names in R include exotic sounding ones such as ‘steelblue1’, ‘thistle’, ‘mintcream’ and so on! But how do these colors actually look onscreen?

Show R colors is a web application (app) developed using Shiny that can be used to quickly see how the named colors in R look onscreen, before they are used in graphics. Shiny ( is an R package that makes it easy to build interactive apps from R. Very little knowledge of HTML, CSS, or JavaScript is required to develop simple apps, though some knowledge of these can help in creating more stunning applications.

I developed this app as a course project for the ‘Developing Data Products’ course from Coursera and since then have used it various times to choose colors for my graphs!

M$ Excel for Statistical Analysis?

Many statisticians do not advocate using Microsoft Excel for statistical analysis, except, for maybe obtaining the most simplest of data summaries and charts. Even the charts produced using the default options in Excel are considered chartjunk. However, many introductory courses in Statistics use Excel as a tool for their statistical computing labs and there is no disputing the fact that Excel is an extremely easy to use software tool.

This post is based on a real situation that arose when illustrating t-tests in Excel (Microsoft Excel 2010) for a Biostatistics course.

The question was whether the onset of BRCA mutation-related breast cancers happens early in age in subsequent generations. The sample data provided was on the age (in years) of onset of BRCA mutation-related breast cancers for mother-daughter pairs. Here is the data:

Mother Daughter
47 42
46 41
42 42
40 39
48 44
48 45
49 41
38 45
50 44
47 48
46 39
43 36
54 44
48 46
49 46
39 40
48 36
46 43
49 42
48 39
49 43
45 47
36 44

Some students in the course used the Excel function t.test and obtained the one-tailed p-value for the paired t-test as 0.001696. Some of the other students used the ‘Data Analysis’ add-in from the ‘Data’ tab and obtained the following results:

t-Test: Paired Two Sample for Means
  Variable 1 Variable 2
Mean 45.83333 42.375
Variance 17.97101 10.07065
Observations 24 24
Pearson Correlation 0.143927
Hypothesized Mean Difference 0
df 23
t Stat 1.247242
P(T<=t) one-tail 0.11243
t Critical one-tail 1.713872
P(T<=t) two-tail 0.22486
t Critical two-tail 2.068658

The one-tailed p-value reported here is 0.11243! Surprised at this discrepancy, we decided to verify the analysis by hand calculations. The correct p-value is the 0.001696 obtained from the t.test function.

Now what is the problem with the results from the ‘Data Analysis’ add-in? A closer look at the results table additionally reveals the reported degrees of freedom (df) as 23. However, we can see that because of the missing values in the dataset, the number of usable pairs for analysis is 23. The correct df is therefore 22.

This shows that missing values in the data are not handled correctly by the ‘Data Analysis’ add-in. A search shows this problem with the add-in reported as early as in the year 2000 with Microsoft Excel 2000 ( Unfortunately, the error has never been corrected in the subsequent versions of the software. It looks like the bad charts are the least of the problems with Excel! Other problems that have been reported include poor numerical accuracy, poor random number generation tools and errors in the data analysis add-ins.

Having said all of this, we certainly cannot also deny that Excel has been, and continues to be a very useful software tool to demonstrate and conduct basic data exploration and statistical analysis, especially for a non-statistician audience. The post at provides a balanced view of the pros and cons of using Excel for data analysis. The take away for us is to be extremely careful when using Microsoft Excel for data management and statistical analysis and to doubly verify the results of any such data operation and analysis.