Manipulate(d) Regression!

The R package ‘manipulate’ can be used to create interactive plots in RStudio. Though not as versatile as the ‘shiny’ package, ‘manipulate’ can be used to quickly add interactive elements to standard R plots. This can prove useful for demonstrating statistical concepts, especially to a non-statistician audience.

The R code at the end of this post uses the ‘manipulate’ package with a regression plot to illustrate the effect of outliers (and influential) points on the fitted linear regression model. The resulting manipulate(d) plot in RStudio includes a gear icon, which, when clicked, opens up a slider control. The slider can be used to move some data points. The plot changes interactively with the data.

Here are some static figures:

Initial state: It is possible to move two points in the scatter plot, one at the end and one at the center.Initial

An outlier at center has a limited influence on the fitted regression model.


An outlier at the ends of support of x and y ‘moves’ the regression line towards it and is also an influential point!


Here is the complete R code for generating the interactive plot. This is to be run in RStudio.


## First define a custom function that fits a linear regression line 
## to (x,y) points and overlays the regression line in a scatterplot.
## The plot is then 'manipulated' to change as y values change.

linregIllustrate <- function(x, y, e, h.max,{
  max.x <- max(x)
  med.x <- median(x)
  max.xind <- which(x == max.x)
  med.xind <- which(x == med.x)

  y1 <- y     ## Modified y
  y1[max.xind] <- y1[max.xind]+h.max  ## at the end
  y1[med.xind] <- y1[med.xind]  ## at the center
  plot(x, y1, xlim=c(min(x),max(x)+5), ylim=c(min(y1),max(y1)), pch=16, 
       xlab="X", ylab="Y")
  text(x[max.xind], y1[max.xind],"I'm movable!", pos=3, offset = 0.3, cex=0.7, font=2, col="red")
  text(x[med.xind], y1[med.xind],"I'm movable too!", pos=3, offset = 0.3, cex=0.7, font=2, col="red")
  m <- lm(y ~ x)  ## Regression with original set of points, the black line
  abline(m, lwd=2)

  m1 <- lm(y1 ~ x)  ## Regression with modified y, the dashed red line
  abline(m1, col="red", lwd=2, lty=2)

## Now generate some x and y data 
x <- rnorm(35,10,5)
e <- rnorm(35,0,5)
y <- 3*x+5+e

## Plot and manipulate the plot!
manipulate(linregIllustrate(x, y, e, h.max,, 
           h.max=slider(-100, 100, initial=0, step=10, label="Move y at the end"), 
 , 100, initial=0, step=10, label="Move y at the center"))

Drug Interaction Studies – Statistical Analysis

This post is actually a continuation of the previous post, and is motivated by this article that discusses the graphics and statistical analysis for a two treatment, two period, two sequence (2x2x2) crossover drug interaction study of a new treatment versus the standard. Whereas the previous post was devoted to implementing some of the graphics presented in the article, in this post we try to recreate the statistical analysis calculations for the data from the drug interaction study. The statistical analysis is implemented with R.


The drug interaction study dataset can be obtained from ocdrug.dat.txt and a brief description of the dataset is at ocdrug.txt. We first save the files to a local directory and then read the data into R dataframe and assign appropriate column labels:

ocdrug <- read.table(paste(workdir,"ocdrug.dat.txt",sep=""),sep="") 
## “workdir” is the name of the variable containing the directory name where the data file is stored 
colnames(ocdrug) <- c("ID","Seq","Period","Tmnt","EE_AUC","EE_Cmax","NET_AUC","NET_Cmax")

## Give nice names to the treatments (OCD and OC) and the treatment sequence 
ocdrug$Seq <- factor(ifelse(ocdrug$Seq == 1,"OCD-OC","OC-OCD"))
ocdrug$Tmnt <- factor(ifelse(ocdrug$Tmnt == 0,"OC","OCD"), levels = c("OC", "OCD"))

Then log transform the PK parameters:

ocdrug$logEE_AUC <- log(ocdrug$EE_AUC)
ocdrug$logEE_Cmax <- log(ocdrug$EE_Cmax)
ocdrug$logNET_AUC <- log(ocdrug$NET_AUC)
ocdrug$logNET_Cmax <- log(ocdrug$NET_Cmax)


Analysis based on Normal Theory – Linear Mixed Effects Model

We illustrate the normal theory based mixed effects model for the analysis of EE AUC. The analysis for EE Cmax and for NET AUC and Cmax are similar. We use the recommended R package nlme for developing the linear mixed effects models. The models include terms for sequence, period and treatment.


options(contrasts = c(factor = "contr.treatment", ordered = "contr.poly"))
lme_EE_AUC <- lme(logEE_AUC ~ Seq+Period+Tmnt, random=(~1|ID), data=ocdrug)

From the fitted models, we can calculate the point estimates and confidence intervals for the geometric mean ratios for AUC and Cmax for EE and NET. This is illustrated for EE AUC:

tTable_EE_AUC <- summary(lme_EE_AUC)$tTable["TmntOCD",]
logEst <- tTable_EE_AUC[1]
se <- tTable_EE_AUC[2]
df <- tTable_EE_AUC[3]
t <- qt(.95,df)</pre>

## Estimate of geometric mean ratio
gmRatio_EE_AUC <- exp(logEst)
## 90% lower and upper confidence limits
LCL_EE_AUC <- exp(logEst-t*se); UCL_EE_AUC <- exp(logEst+t*se)

The calculations for Cmax and for NET are similar. We can bind all the results from normal theory model into a dataframe:

norm.out <- data.frame(type=rep("Normal Theory",4),
            cmpd=c(rep("EE",2), rep("NET",2)),
            ratio=c(gmRatio_EE_AUC, gmRatio_EE_Cmax, gmRatio_NET_AUC, gmRatio_NET_Cmax),
            LCL=c(LCL_EE_AUC, LCL_EE_Cmax, LCL_NET_AUC, LCL_NET_Cmax),
            UCL=c(UCL_EE_AUC, UCL_EE_Cmax, UCL_NET_AUC, UCL_NET_Cmax))
rownames(norm.out) <- paste(norm.out$cmpd,norm.out$PK,sep="_")

cat("\nSummary of normal theory based analysis\n")

As discussed in the article, the diagnostic plots to assess the appropriateness of the normal theory based linear model, i.e., normal probability plots, scatter plots of observed vs. fitted values, and scatter plots of studentized residuals vs. fitted values, show some deviations from assumptions, at least for Cmax. Hence, a distribution-free or non-parametric analysis may not be unreasonable.

Distribution-free Analysis

A distribution free analysis for bioequivalence is using the two sample Hodges-Lehmann point estimator and the two sample Moses confidence interval. Difference in treatment effects are estimated as the average difference in the two treatment sequence groups with respect to the within-subject period difference in the log AUC (or log Cmax) values. But we must first calculate these within-subject period differences. Transforming the dataset to a ‘wide’ format may be useful for the calculation of these within-subject differences:

ocdrug.wide <- reshape(ocdrug, v.names=c("EE_AUC", "EE_Cmax", "NET_AUC", "NET_Cmax", 
"logEE_AUC", "logEE_Cmax", "logNET_AUC", "logNET_Cmax", "Tmnt"), timevar="Period", 
idvar="ID", direction="wide", new.row.names=NULL)
ocdrug.wide$logEE_AUC_PeriodDiff <- with(ocdrug.wide, logEE_AUC.1 - logEE_AUC.2)

Similarly, the within-subject differences can be calculated for EE Cmax and NET AUC and Cmax. The two sample Hodges-Lehmann point estimator and the two sample Moses confidence interval can be calculated from these within subject differences using the the pairwiseCI function in the R-package pairwiseCI with the method=HL.diff option. Also, as specified in the article, we need to divide the estimates and CI bounds by 2 because the difference is taken twice (sequence and period).

npEst_EE_AUC <- unlist( ~ Seq, data=ocdrug.wide, 
                  method="HL.diff", conf.level=0.90)))
npRatio_EE_AUC <- as.numeric(exp(npEst_EE_AUC[1]/2))
npLCL_EE_AUC <- as.numeric(exp(npEst_EE_AUC[2]/2))
npUCL_EE_AUC <- as.numeric(exp(npEst_EE_AUC[3]/2))

Similarly, the estimates and confidence intervals can be obtained for EE Cmax and NET AUC and Cmax. We bind all the results obtained from distribution free theory analysis into a dataframe:

np.out <- data.frame(type=rep("Distribution Free",4),
            cmpd=c(rep("EE",2), rep("NET",2)),
            ratio=c(npRatio_EE_AUC, npRatio_EE_Cmax, npRatio_NET_AUC, npRatio_NET_Cmax),
            LCL=c(npLCL_EE_AUC, npLCL_EE_Cmax, npLCL_NET_AUC, npLCL_NET_Cmax),
            UCL=c(npUCL_EE_AUC, npUCL_EE_Cmax, npUCL_NET_AUC, npUCL_NET_Cmax))

rownames(np.out) <- paste(np.out$cmpd,np.out$PK,sep="_")

cat("\nSummary of distribution-free analysis\n")

Finally, we try to recreate figure 13 that graphically compares the results of the normal theory and distribution free analyses:

# We collect the quantities to plot in a dataframe and include a dummy x-variable
plotIt <- rbind(np.out,norm.out)
plotIt$x <- c(1,3,5,7,1.5,3.5,5.5,7.5)

# Then plot
png(filename = paste(workdir,"plotCIs.png",sep=""), width = 640, height = 640)
par(font.lab=2, font.axis=2, las=1, font=2, cex=1.6, cex.lab=0.8, cex.axis=0.8)
col=c(rep("magenta",4), rep("lightseagreen",4))
pch=c(rep(17,4), rep(16,4))
plot(ratio ~ x, data=plotIt, type="p", pch=pch, col=col, axes=FALSE, 
     xlim=c(0.5,8), ylim=c(0.7,1.35), xlab="", ylab="Ratio(OCD/OC)")
arrows(plotIt$x, plotIt$LCL, plotIt$x, plotIt$UCL, 
     code=3, length = 0.08, angle = 90, lwd=3, col=col, lty=lty)
axis(side=2, at=c(0.7,0.8,1, 1.25))
text(x=c(1,3,5,7), y=rep(0.7,4), pos=4, labels=rep(c("AUC","Cmax"),2), cex=0.8)
text(x=c(2,6), y=rep(0.65,2), pos=4, labels=c("EE","NET"), xpd=NA, cex=0.8)

legend("top",c(paste("Distribution-Free","\t\t"),"Normal Theory"), 
    col=unique(col), lty=unique(lty), pch=unique(pch), horiz=TRUE, bty="n", cex=0.8, lwd=2)



Summary of normal theory based analysis
         ratio  LCL  UCL
EE_AUC    1.16 1.11 1.23
EE_Cmax   1.10 1.01 1.19
NET_AUC   1.01 0.97 1.04
NET_Cmax  0.86 0.78 0.95

Summary of distribution-free analysis
         ratio  LCL  UCL
EE_AUC    1.16 1.11 1.24
EE_Cmax   1.08 0.99 1.20
NET_AUC   1.00 0.97 1.04
NET_Cmax  0.90 0.82 0.96

These results appear in tables 2-3 and figure 13 of the article.

Treatment comparisons for pretest – posttest experimental studies

Recently, I faced a question on the statistical analysis method for an animal study involving testing of lipid lowering agents. The study was designed along the following lines. Animals were randomized to receive one of the study agents. Prior to initiation of treatment, lipid levels (LDL, HDL and total cholesterol) were measured (baseline). Study drugs were administered ­­daily for xx days and the lipid levels were again measured at end of study (post treatment).

This kind of study design is often labelled as a pretest – posttest design and is quite common in the medical field for comparing different treatments.

In many clinical studies of cholesterol lowering agents, the percent change from baseline is analyzed for between group differences. Hence, I suggested using percent change from baseline as the outcome measure in an ANOVA as the statistical analysis method for the above animal experiment. On further scanning of the literature, I however noted that in spite of the widespread prevalence of the pretest – posttest experimental design, there is a lack of consensus on the best method for the data analysis.

The possible choices for the outcome measure to use in the analysis of data from a pretest – posttest study could be the post-treatment values (PV) or the change between baseline and post treatment values, referred to in the literature as change scores or gain scores (DIFF) or any measure of relative change between baseline and post treatment values. Percentage change from baseline (PC) mentioned above is an example of a measure of relative change. Some of the other measures of relative change include symmetrized percent change (SPC) and log ratio of post-treatment to baseline (LR). Furthermore, the methods for statistical analysis include parametric and non-parametric versions of ANOVA or ANCOVA on any of the above outcome measure. So indeed there exist a lot of possibilities for the analysis of data from a pretest – posttest study!

Various simulation studies have provided us with pointers to guide the choice of the outcome measure and analysis method. The current thinking seems to be that an ANCOVA on PV has higher power than a simple ANOVA on PC, especially in the situation where little correlation exists between baseline and post treatment values. However, SPC with wither the ANOVA or ANCOVA seems to be a good option in the case of additive or multiplicative correlation between baseline and post treatment values.

The simulation studies have tried to mimic various real scenarios. Vickers has studied in detail the situation where the outcome is continuous and there is an additive correlation between post treatment and baseline values. SPC as an outcome measure was not studied (Vickers A. J. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Medical Research Methodology (2001) 1:6). Meanwhile, Berry & Ayers consider count data, include SPC as an outcome measure and consider parametric and non-parametric analysis methods in the presence of additive and multiplicative correlation between baseline and post treatment values (Berry D. A. & Ayers G. D. Symmetrized Percent Change for Treatment Comparisons. The American Statistician (2006) 60:1, 27-31).

At the time of design of the statistical analysis plan, there is bound to be little information available on the extent and type of correlation that may exist between the baseline and post treatment values. Hence there is a need to conduct extensive review and/or simulations, taking into account also the type of data and correlation structures encountered in different therapeutic areas, to identify (therapeutic area specific!) best practices for the choice of the outcome measure and analysis method for the pretest – posttest design.

Calculating Standard Deviation

Most standard books on Biostatistics give two formulae for calculating the standard deviation (SD) of a set of observations.

The population SD (for a finite population) is defined as \sqrt{\frac{\sum_{i=1}^{N}\left ( X_{i} -\mu \right )^{2}}{N}}, where X1, X2, …, XN are the N observations and μ is the mean of the finite population.

The sample SD is defined as \sqrt{\frac{\sum_{i=1}^{n}\left ( x_{i} -\bar{x} \right )^{2}}{n-1}}, where x1, x2, …, xn are the n observations in the sample and \bar{x} is the sample mean.

While teaching a Biostatistics course, it becomes a little difficult to explain to students about why there are two different formulae, and why do we subtract ‘1’ from n when we calculate the SD using sample observations. These are students who are (or will be) collecting samples, performing experiments, obtaining sample data, and analyzing data. Hence it is important that they understand the concept from a practitioners point, rather than a theoretical point of view.

In my experience, the best explanation has been to avoid a theoretical discussion of unbiasedness etc. and to state that if we want to calculate the SD from sample data, we also need to know the true (population) mean. Since the population mean is unknown, we use the same n observations in the sample to first calculate the sample mean \bar{x} as an estimate of the population mean μ. Our sample observations however tend to be more close to the sample mean rather than the population mean and hence the SD calculated using the first formula will underestimate the true variability. To correct for this underestimation we divide by n – 1 instead of n. Then the question of why n – 1? We have already used the n observations to calculate one quantity, i.e., \bar{x} . We are then left with only n – 1degrees of freedom’ for further calculations that use \bar{x} .

I also state three more points to complete the topic:

  1. That the quantity n – 1 is referred to as the degrees of freedom for computing the SD and we say that 1 degree of freedom has been used up while estimating μ using sample data.
  2. For most situations that the students would encounter, the complete population is not available and they would have to work with sample data. So in most (if not all) situations, they should be using the formula with n – 1 to calculate the SD.
  3. As the sample size increases, the differences arising out of using n or n – 1 should be of little practical importance. But it is still conceptually correct to use n – 1.

When the students were then asked as to how they would define degrees of freedom, one of them came up with a definition as ‘number of observations minus number of calculations’ which is pretty much what it is, as far as a practitioner is concerned!

M$ Excel for Statistical Analysis?

Many statisticians do not advocate using Microsoft Excel for statistical analysis, except, for maybe obtaining the most simplest of data summaries and charts. Even the charts produced using the default options in Excel are considered chartjunk. However, many introductory courses in Statistics use Excel as a tool for their statistical computing labs and there is no disputing the fact that Excel is an extremely easy to use software tool.

This post is based on a real situation that arose when illustrating t-tests in Excel (Microsoft Excel 2010) for a Biostatistics course.

The question was whether the onset of BRCA mutation-related breast cancers happens early in age in subsequent generations. The sample data provided was on the age (in years) of onset of BRCA mutation-related breast cancers for mother-daughter pairs. Here is the data:

Mother Daughter
47 42
46 41
42 42
40 39
48 44
48 45
49 41
38 45
50 44
47 48
46 39
43 36
54 44
48 46
49 46
39 40
48 36
46 43
49 42
48 39
49 43
45 47
36 44

Some students in the course used the Excel function t.test and obtained the one-tailed p-value for the paired t-test as 0.001696. Some of the other students used the ‘Data Analysis’ add-in from the ‘Data’ tab and obtained the following results:

t-Test: Paired Two Sample for Means
  Variable 1 Variable 2
Mean 45.83333 42.375
Variance 17.97101 10.07065
Observations 24 24
Pearson Correlation 0.143927
Hypothesized Mean Difference 0
df 23
t Stat 1.247242
P(T<=t) one-tail 0.11243
t Critical one-tail 1.713872
P(T<=t) two-tail 0.22486
t Critical two-tail 2.068658

The one-tailed p-value reported here is 0.11243! Surprised at this discrepancy, we decided to verify the analysis by hand calculations. The correct p-value is the 0.001696 obtained from the t.test function.

Now what is the problem with the results from the ‘Data Analysis’ add-in? A closer look at the results table additionally reveals the reported degrees of freedom (df) as 23. However, we can see that because of the missing values in the dataset, the number of usable pairs for analysis is 23. The correct df is therefore 22.

This shows that missing values in the data are not handled correctly by the ‘Data Analysis’ add-in. A search shows this problem with the add-in reported as early as in the year 2000 with Microsoft Excel 2000 ( Unfortunately, the error has never been corrected in the subsequent versions of the software. It looks like the bad charts are the least of the problems with Excel! Other problems that have been reported include poor numerical accuracy, poor random number generation tools and errors in the data analysis add-ins.

Having said all of this, we certainly cannot also deny that Excel has been, and continues to be a very useful software tool to demonstrate and conduct basic data exploration and statistical analysis, especially for a non-statistician audience. The post at provides a balanced view of the pros and cons of using Excel for data analysis. The take away for us is to be extremely careful when using Microsoft Excel for data management and statistical analysis and to doubly verify the results of any such data operation and analysis.