Spaghetti plots with ggplot2 and ggvis

This post was motivated by this article that discusses the graphics and statistical analysis for a two treatment, two period, two sequence (2x2x2) crossover drug interaction study of a new drug versus the standard. I wanted to write about implementing those graphics and the statistical analysis in R. This post is devoted to the different ways of generating the spaghetti plot in R, and the statistical analysis part will follow in the next post.

Spaghetti plots, are often used to visualize repeated measures data. These graphs can be used to visualize time trends like this or to visualize the outcome of different treatments on the same subjects, as in figure3 of the article above. Briefly, in Spaghetti plots, the responses for the same subject, either over time or over different treatments, are connected by lines to show the subject-wise trends. Sometimes, different line types or colors are used to distinguish each subject profile. The plot looks like a plate of spaghetti, that’s probably the reason for the name.


The dataset for illustrating spaghetti plots can be obtained from ocdrug.dat.txt and a brief description of the dataset is at ocdrug.txt. I first saved the files to a local directory and then read the data into a R dataframe and assigned appropriate column labels:

ocdrug <- read.table(paste(workdir,"ocdrug.dat.txt",sep=""),sep="") 
## “workdir” is the name of the variable storing the directory name where the data file is stored 
colnames(ocdrug) <- c("ID","Seq","Period","Tmnt","EE_AUC","EE_Cmax","NET_AUC","NET_Cmax")

## Give nice names to the treatments (OCD and OC) and the treatment sequence 
ocdrug$Seq <- factor(ifelse(ocdrug$Seq == 1,"OCD-OC","OC-OCD"))
ocdrug$Tmnt <- factor(ifelse(ocdrug$Tmnt == 0,"OC","OCD"), levels = c("OCD", "OC"))


Spaghetti plot using ggplot2

It is possible to make a spaghetti plot using base R graphics using the function interaction.plot(). We however do not discuss this approach here, but go directly to the approach using ggplot2. We want to exactly reproduce figure 3 of the article that actually has four sub-figures. In base R, we can use mfrow(), but in ggplot2, one way to achieve this is to first create the 4 individual figures and arrange them using the grid.arrange() function in package gridExtra. First, we load the required packages,

require(gridExtra)  ## required to arrange ggplot2 plots in a grid

and create a theme common for all the graphs:

mytheme <- theme_classic() %+replace% 
        theme(axis.title.x = element_blank(), 
        axis.title.y = element_text(face="bold",angle=90))  

We then make the first sub-figure. This is for the EE_AUC. The y-axis is in log10 scale:

p1 <- ggplot(data = ocdrug, aes(x = Tmnt, y = EE_AUC, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(1000,6000)) +
    labs(list(title = "AUC", y = paste("EE","\n","pg*hr/mL"))) + 
    geom_line(size=1) + theme(legend.position="none")

Making the remaining three graphs follows along similar lines. Note that in the graphs p2, p3 and p4 the points for some subjects (outliers?) are labeled. We can get the labels using geom_text() and choosing the subjects to be labeled. We also include a legend below graphs p3 and p4.

p2 <- ggplot(data = ocdrug, aes(x = Tmnt, y = EE_Cmax, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(100,700)) +
    labs(list(title = "Cmax", y = paste("EE","\n","pg/mL"))) + 
    geom_line(size=1) + 
    geom_text(data=subset(ocdrug, ID %in% c(2,20)), aes(Tmnt,EE_Cmax,label=ID)) +

p3 <- ggplot(data = ocdrug, aes(x = Tmnt, y = NET_AUC, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(80000,300000)) +    
    labs(list(y = paste("NET","\n","pg*hr/mL"))) + 
    geom_line(size=1) + 
    geom_text(data=subset(ocdrug, ID %in% c(18,22,20)), aes(label=ID), show_guide = F) +
    scale_colour_discrete(name="Sequence: ", labels=c("OCD then OC", "OC then OCD")) + 

p4 <- ggplot(data = ocdrug, aes(x = Tmnt, y = NET_Cmax, group = ID, colour = Seq)) +
    mytheme +
    coord_trans(y="log10", limy=c(10000,60000)) +
    labs(list(y = paste("NET","\n","pg/mL"))) + 
    geom_line(size=1) + 
    geom_text(data=subset(ocdrug, ID == 9), aes(label=ID), show_guide = F) +
    scale_colour_discrete(name="Sequence: ", labels=c("OCD then OC", "OC then OCD")) + 

Finally, we arrange plots p1 through p4 as a matrix, using the function grid.arrange() and save it to a .png file:

png(filename = paste(workdir,"ByTmnt_ggplot2.png",sep=""), width = 640, height = 640, bg="transparent")
grid.arrange(p1, p2, p3, p4, ncol = 2)

Creating an interactive spaghetti plot with ggvis

Having recreated figure3 of the article using ggplot2, I then wanted to make an interactive version of the plot. The R package ggvis can be used to provide some interactive features. Here is the user interaction that we wish to add:

  1. To be able to select which (of the four) plot to view
  2. To provide a tooltip to the user, that gives info on the subject ID when the cursor is pointed at a point or line in the graph

To create a plot in ggvis that includes a tooltip, we need to first create an identifier for each row in the dataset like so:

ocdrug$uid <- 1:nrow(ocdrug)  # Add an unique id column to use as the key
all_values <- function(x) {
  if(is.null(x)) return(NULL)
  row <- ocdrug[ocdrug$uid == x$uid,]
  paste0(names(row[1]), ": ", format(row[1]))


ocdrug <- group_by(ocdrug, ID) ## Data is grouped, by subjects ocdrug %>% 

ocdrug %>% 
  ggvis(x = ~Tmnt, y = input_select(c("EE: AUC" = "EE_AUC", "EE: Cmax" = "EE_Cmax",
            "NET: AUC" = "NET_AUC", "NET: Cmax" = "NET_Cmax"), 
            label="Y-aixs variable", map = %>%    ## choose which graph to display
  layer_paths(stroke = ~Seq) %>%    ## color lines by treatment sequence as before
  layer_points(fill = ~Seq) %>%        ## color points by treatment sequence as before
  layer_points(fill = ~Seq, key := ~uid) %>%    ## having to do it twice, 
         ## else the points just seemed to appear and disappear. Have not understood why?
  add_axis("x", title = "Group", title_offset = 50, grid=FALSE) %>%    ## Axes and legend
  add_axis("y", title = "", grid=FALSE) %>%
  scale_numeric("y", trans="log") %>%
  hide_legend("stroke") %>%
  add_legend("fill", title = "Sequence") %>%
  add_tooltip(all_values, "hover")    ## Finally add the tooltip

To display the interactive plot, copy the above code and paste it in an R session. The plot would appear in the browser. The R session should be kept open. The plot below is only a screenshot from the browser and is not interactive.


My original aim was actually to create an interactive version of the ggplot2 graphic that displays all the four graphs at once, but also includes a tooltip, instead of the text labels for selected subjects. I also wanted that pointing at one subject in one particular graph will highlight the profile for that subject, not only that graph, but in the remaining three graphs as well. It however looks like, now, ggvis does not support multiple graphs in the same page. A full-fledged Shiny app may be a solution for someone with no knowledge of html, css, Java etc… I welcome experts to share any other ideas by which such interactivity can be achieved.


Waterfall plots – what and how?

Waterfall plots” are nowadays often used in oncology clinical trials for a graphical representation of the quantitative response of each subject to treatment. For an informative article explaining waterfall plots see Understanding Waterfall Plots.

In this post, we illustrate the creation of waterfall plots in R.

In a typical waterfall plot, the x-axis serves as the baseline value of the response variable. For each subject, vertical bars are drawn from the baseline, either in the positive or negative direction to depict the change from baseline in the response for the subject. The y-axis thus represents the change from baseline in the response, usually expressed as a percentage, for e.g., percent change in the size of the tumor or percent change in some marker level. Most importantly, in a waterfall plot, the bars are ordered in the decreasing order of the percent change values.

Though waterfall plots have gained popularity in oncology, they can be used for data visualization in other clinical trials as well, where the response is expressed as a change from baseline.


Instead of a tumor growth dataset, we illustrate creation of waterfall plots for the visual depiction of a quality of life data. A quality of life dataset, dataqol2 is available with the R package QoLR.

dataqol2$id <- as.factor(dataqol2$id)
dataqol2$time <- as.factor(dataqol2$time)
dataqol2$arm <- as.factor(dataqol2$arm)

dataqol2 contains longitudinal data on scores for 2 quality of life measures (QoL and pain) for 60 subjects. In the case of QoL, higher scores are better since they imply better quality of life, and for pain, lower scores are better since they imply a decrease in pain. Each subject has these scores recorded at baseline (time = 0) and then at a maximum of 5 more time points post baseline. ‘arm’ represents the treatment arm to which the subjects were assigned. The dataset is in long format.

The rest of this post is on creating a waterfall plot in R for the QoL response variable.

Creating a waterfall plot using the barplot function in base R

The waterfall plot is basically an ‘ordered bar chart’, where each bar represents the change from baseline response measure for the corresponding subject.

As the first step, it would be helpful if we change the format of the dataset from ‘long’ to ‘wide’. We use the reshape function to do this. Also, we retain only the QoL scores, but not the pain scores:

qol2.wide <- reshape(dataqol2, v.names="QoL", idvar = "id", timevar = "time", direction = "wide", drop=c("date","pain"))

For each subject, we then find the best (largest) QoL score value post baseline, compute the best percentage change from baseline and order the dataframe in the decreasing order of the best percentage changes. We also remove subjects with missing percent change values:

qol2.wide$bestQoL <- apply(qol2.wide[,5:9], 1 ,function(x) ifelse(sum(! == 0, NA, max(x,na.rm=TRUE)))
qol2.wide$bestQoL.PerChb <- ((qol2.wide$bestQoL-qol2.wide$QoL.0)/qol2.wide$QoL.0)*100

o <- order(qol2.wide$bestQoL.PerChb,decreasing=TRUE,na.last=NA)
qol2.wide <- qol2.wide[o,]

Create the waterfall plot… Finally!

barplot(qol2.wide$bestQoL.PerChb, col="blue", border="blue", space=0.5, ylim=c(-100,100),
main = "Waterfall plot for changes in QoL scores", ylab="Change from baseline (%) in QoL score",
cex.axis=1.2, cex.lab=1.4)


Since we are depicting changes in quality of life scores, the higher the bar is in the positive direction, the better the improvement in the quality of life. So, the above figure shows that, for most subjects, there was improvement in the quality of life post baseline.

We can also color the bars differently by treatment arm, and include a legend. I used the choose_palette() function from the excellent colorspace R package to get some nice colors.

col <- ifelse(qol2.wide$arm == 0, "#BC5A42", "#009296")
barplot(qol2.wide$bestQoL.PerChb, col=col, border=col, space=0.5, ylim=c(-100,100),
main = "Waterfall plot for changes in QoL scores", ylab="Change from baseline (%) in QoL score",
cex.axis=1.2, cex.lab=1.4, legend.text=c(0,1),
args.legend=list(title="Treatment arm", fill=c("#BC5A42","#009296"), border=NA, cex=0.9))


Treatment arm 1 is associated with the largest post baseline increases in the quality of life score. Since waterfall plots are basically bar charts, they can be colored by other relevant subject attributes as well.

The above is a solution to creating waterfall plots using base R graphics function barplot. It is my aim to simultaneously also develop a solution using the ggplot2 package (and in the process, develop expertise in ggplot2). So here it is…

Creating a waterfall plot using ggplot2

We use the previously created qol2.wide dataframe, but in ggplot2, we also need an x variable. So:

x <- 1:nrow(qol2.wide)

Next we specify some plot settings, we color bars differently by treatment arm and allow the default colors of ggplot2, since I think they are quite nice. We also want to remove the x-axis, and put sensible limits for the y-axis:

b <- ggplot(qol2.wide, aes(x=x, y=bestQoL.PerChb, fill=arm, color=arm)) +
scale_fill_discrete(name="Treatment\narm") + scale_color_discrete(guide="none") +
labs(list(title = "Waterfall plot for changes in QoL scores", x = NULL, y = "Change from baseline (%) in QoL score")) +
theme_classic() %+replace%
theme(axis.line.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.y = element_text(face="bold",angle=90)) +
coord_cartesian(ylim = c(-100,100))

Finally, the actual bars are drawn using geom_bar(), and we specify the width of the bars and the space between bars. We specify stat="identity" because we want the heights of the bars to represent actual values in the data. See ?geom_bar

 b <- b + geom_bar(stat="identity", width=0.7, position = position_dodge(width=0.4))


Update: Readers pointed out about a ‘waterfall chart’ in finance that seems to be somewhat different than the graphic discussed in this post, and they seem to use the word ‘chart’ instead of ‘plot’. Here is some info, that also includes some R code for the waterfall chart used in finance. Here is yet another plot referred to as ‘waterfall plot’ that seems to be used to display spectra.

Waterfall seems to be quite a popular name for plots!

Graphs in R – Overlaying Data Summaries in Dotplots

Dotplots are useful for the graphical visualization of small to medium-sized datasets. These simple plots provide an overview of how the data is distributed, whilst also showing the individual observations. It is however possible to make the simple dotplots more informative by overlaying them with data summaries and/or smooth distributions.

This post is about creating such superimposed dotplots in R – we first see how to create these plots using just base R graphics, and then proceed to create them using the ggplot2 R package.

## First things first - dataset 'chickwts': Weights of
## chickens fed with any one of six different feed types

data(chickwts)  ## load the dataset


Graphs using base R:

## First some plot settings


We first create a dotplot where the median of each group is also displayed as a horizontal line:

## Getting the dotplot first, expanding the x-axis to leave room for the line
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method = "stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Then compute the group-wise medians
medians <- tapply(chickwts[,"weight"], chickwts[,"feed"], median)

## Now add line segments corresponding to the group-wise medians
loc <- 1:length(medians)
segments(loc-0.3, medians, loc+0.3, medians, col="red", lwd=3)

Next , we create a dotplot where the median is shown, along with the 1st and 3rd quartile, i.e., the ‘box’ of the boxplot of the data is overlaid with the dotplot:

## Getting the dotplot first, expanding the x-axis to leave room for the box
stripchart(weight ~ feed, data = chickwts, xlim=c(0.5,6.5), vertical=TRUE, method="stack", offset=0.8, pch=19,
main = "Chicken weights after six weeks", xlab = "Feed Type", ylab = "Weight (g)")

## Now draw the box, but without the whiskers!
boxplot(weight ~ feed, data = chickwts, add=TRUE, range=0, whisklty = 0, staplelty = 0)

Plots similar to ones created above, but using the ggplot2 R package instead:

## Load the ggplot2 package first

## Data and plot settings
p <- ggplot(chickwts, aes(x=feed, y=weight)) +
labs(list(title = "Chicken weights after six weeks", x = "Feed Type", y = "Weight (g)")) +
theme(axis.title.x = element_text(face="bold"), axis.text.x = element_text(face="bold")) +
theme(axis.title.y = element_text(face="bold"), axis.text.y = element_text(face="bold"))

We use the stat_summary function to plot the median line as an errorbar, but we need to define our own function that calculates the group-wise median and produces output in a format suitable for stat_summary like so:

## define custom median function
plot.median <- function(x) {
  m <- median(x)
  c(y = m, ymin = m, ymax = m)

## dotplot with median line
p1 <- p + geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5) +
stat_summary("plot.median", geom="errorbar", colour="red", width=0.5, size=1)

For the dotplot overlaid with the median and the 1st and 3rd quartile, the ‘box’ from the boxplot is plotted using geom_boxplot function:

## dotplot with box
p2 <- p + geom_boxplot(aes(ymin=..lower.., ymax=..upper..)) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)

Additionally, let’s also plot a dotplot with a violin plot overlaid. We cannot do this in base R!

## dotplot with violin plot
## and add some cool colors
p3 <- p + geom_violin(scale="width", adjust=1.5, trim = FALSE, fill="indianred1", color="darkred", size=0.8) +
geom_dotplot(binaxis='y', stackdir='center', method="histodot", binwidth=5)