# Manipulate(d) Regression!

The R package ‘manipulate’ can be used to create interactive plots in RStudio. Though not as versatile as the ‘shiny’ package, ‘manipulate’ can be used to quickly add interactive elements to standard R plots. This can prove useful for demonstrating statistical concepts, especially to a non-statistician audience.

The R code at the end of this post uses the ‘manipulate’ package with a regression plot to illustrate the effect of outliers (and influential) points on the fitted linear regression model. The resulting manipulate(d) plot in RStudio includes a gear icon, which, when clicked, opens up a slider control. The slider can be used to move some data points. The plot changes interactively with the data.

Here are some static figures:

Initial state: It is possible to move two points in the scatter plot, one at the end and one at the center.

An outlier at center has a limited influence on the fitted regression model.

An outlier at the ends of support of x and y ‘moves’ the regression line towards it and is also an influential point!

Here is the complete R code for generating the interactive plot. This is to be run in RStudio.

library(manipulate)

## First define a custom function that fits a linear regression line
## to (x,y) points and overlays the regression line in a scatterplot.
## The plot is then 'manipulated' to change as y values change.

linregIllustrate <- function(x, y, e, h.max, h.med){
max.x <- max(x)
med.x <- median(x)
max.xind <- which(x == max.x)
med.xind <- which(x == med.x)

y1 <- y     ## Modified y
y1[max.xind] <- y1[max.xind]+h.max  ## at the end
y1[med.xind] <- y1[med.xind]+h.med  ## at the center
plot(x, y1, xlim=c(min(x),max(x)+5), ylim=c(min(y1),max(y1)), pch=16,
xlab="X", ylab="Y")
text(x[max.xind], y1[max.xind],"I'm movable!", pos=3, offset = 0.3, cex=0.7, font=2, col="red")
text(x[med.xind], y1[med.xind],"I'm movable too!", pos=3, offset = 0.3, cex=0.7, font=2, col="red")

m <- lm(y ~ x)  ## Regression with original set of points, the black line
abline(m, lwd=2)

m1 <- lm(y1 ~ x)  ## Regression with modified y, the dashed red line
abline(m1, col="red", lwd=2, lty=2)
}

## Now generate some x and y data
x <- rnorm(35,10,5)
e <- rnorm(35,0,5)
y <- 3*x+5+e

## Plot and manipulate the plot!
manipulate(linregIllustrate(x, y, e, h.max, h.med),
h.max=slider(-100, 100, initial=0, step=10, label="Move y at the end"),
h.med=slider(-100, 100, initial=0, step=10, label="Move y at the center"))


# Calculating Standard Deviation

Most standard books on Biostatistics give two formulae for calculating the standard deviation (SD) of a set of observations.

The population SD (for a finite population) is defined as $\sqrt{\frac{\sum_{i=1}^{N}\left ( X_{i} -\mu \right )^{2}}{N}}$, where X1, X2, …, XN are the N observations and μ is the mean of the finite population.

The sample SD is defined as $\sqrt{\frac{\sum_{i=1}^{n}\left ( x_{i} -\bar{x} \right )^{2}}{n-1}}$, where x1, x2, …, xn are the n observations in the sample and $\bar{x}$ is the sample mean.

While teaching a Biostatistics course, it becomes a little difficult to explain to students about why there are two different formulae, and why do we subtract ‘1’ from n when we calculate the SD using sample observations. These are students who are (or will be) collecting samples, performing experiments, obtaining sample data, and analyzing data. Hence it is important that they understand the concept from a practitioners point, rather than a theoretical point of view.

In my experience, the best explanation has been to avoid a theoretical discussion of unbiasedness etc. and to state that if we want to calculate the SD from sample data, we also need to know the true (population) mean. Since the population mean is unknown, we use the same n observations in the sample to first calculate the sample mean $\bar{x}$ as an estimate of the population mean μ. Our sample observations however tend to be more close to the sample mean rather than the population mean and hence the SD calculated using the first formula will underestimate the true variability. To correct for this underestimation we divide by n – 1 instead of n. Then the question of why n – 1? We have already used the n observations to calculate one quantity, i.e., $\bar{x}$. We are then left with only n – 1degrees of freedom’ for further calculations that use $\bar{x}$.

I also state three more points to complete the topic:

1. That the quantity n – 1 is referred to as the degrees of freedom for computing the SD and we say that 1 degree of freedom has been used up while estimating μ using sample data.
2. For most situations that the students would encounter, the complete population is not available and they would have to work with sample data. So in most (if not all) situations, they should be using the formula with n – 1 to calculate the SD.
3. As the sample size increases, the differences arising out of using n or n – 1 should be of little practical importance. But it is still conceptually correct to use n – 1.

When the students were then asked as to how they would define degrees of freedom, one of them came up with a definition as ‘number of observations minus number of calculations’ which is pretty much what it is, as far as a practitioner is concerned!

# M\$ Excel for Statistical Analysis?

Many statisticians do not advocate using Microsoft Excel for statistical analysis, except, for maybe obtaining the most simplest of data summaries and charts. Even the charts produced using the default options in Excel are considered chartjunk. However, many introductory courses in Statistics use Excel as a tool for their statistical computing labs and there is no disputing the fact that Excel is an extremely easy to use software tool.

This post is based on a real situation that arose when illustrating t-tests in Excel (Microsoft Excel 2010) for a Biostatistics course.

The question was whether the onset of BRCA mutation-related breast cancers happens early in age in subsequent generations. The sample data provided was on the age (in years) of onset of BRCA mutation-related breast cancers for mother-daughter pairs. Here is the data:

 Mother Daughter 47 42 46 41 42 42 40 39 48 44 48 45 49 41 38 45 50 44 47 48 46 39 43 36 54 44 48 46 49 46 45 39 40 48 36 46 43 41 49 42 48 39 49 43 45 47 36 44

Some students in the course used the Excel function t.test and obtained the one-tailed p-value for the paired t-test as 0.001696. Some of the other students used the ‘Data Analysis’ add-in from the ‘Data’ tab and obtained the following results:

 t-Test: Paired Two Sample for Means Variable 1 Variable 2 Mean 45.83333 42.375 Variance 17.97101 10.07065 Observations 24 24 Pearson Correlation 0.143927 Hypothesized Mean Difference 0 df 23 t Stat 1.247242 P(T<=t) one-tail 0.11243 t Critical one-tail 1.713872 P(T<=t) two-tail 0.22486 t Critical two-tail 2.068658

The one-tailed p-value reported here is 0.11243! Surprised at this discrepancy, we decided to verify the analysis by hand calculations. The correct p-value is the 0.001696 obtained from the t.test function.

Now what is the problem with the results from the ‘Data Analysis’ add-in? A closer look at the results table additionally reveals the reported degrees of freedom (df) as 23. However, we can see that because of the missing values in the dataset, the number of usable pairs for analysis is 23. The correct df is therefore 22.

This shows that missing values in the data are not handled correctly by the ‘Data Analysis’ add-in. A search shows this problem with the add-in reported as early as in the year 2000 with Microsoft Excel 2000 (http://link.springer.com/content/esm/art:10.1007/s00180-014-0482-5/file/MediaObjects/180_2014_482_MOESM1_ESM.pdf). Unfortunately, the error has never been corrected in the subsequent versions of the software. It looks like the bad charts are the least of the problems with Excel! Other problems that have been reported include poor numerical accuracy, poor random number generation tools and errors in the data analysis add-ins.

Having said all of this, we certainly cannot also deny that Excel has been, and continues to be a very useful software tool to demonstrate and conduct basic data exploration and statistical analysis, especially for a non-statistician audience. The post at http://stats.stackexchange.com/questions/3392/excel-as-a-statistics-workbench provides a balanced view of the pros and cons of using Excel for data analysis. The take away for us is to be extremely careful when using Microsoft Excel for data management and statistical analysis and to doubly verify the results of any such data operation and analysis.