Many statisticians do not advocate using Microsoft Excel for statistical analysis, except, for maybe obtaining the most simplest of data summaries and charts. Even the charts produced using the default options in Excel are considered chartjunk. However, many introductory courses in Statistics use Excel as a tool for their statistical computing labs and there is no disputing the fact that Excel is an extremely easy to use software tool.
This post is based on a real situation that arose when illustrating t-tests in Excel (Microsoft Excel 2010) for a Biostatistics course.
The question was whether the onset of BRCA mutation-related breast cancers happens early in age in subsequent generations. The sample data provided was on the age (in years) of onset of BRCA mutation-related breast cancers for mother-daughter pairs. Here is the data:
Mother | Daughter |
47 | 42 |
46 | 41 |
42 | 42 |
40 | 39 |
48 | 44 |
48 | 45 |
49 | 41 |
38 | 45 |
50 | 44 |
47 | 48 |
46 | 39 |
43 | 36 |
54 | 44 |
48 | 46 |
49 | 46 |
45 | |
39 | 40 |
48 | 36 |
46 | 43 |
41 | |
49 | 42 |
48 | 39 |
49 | 43 |
45 | 47 |
36 | 44 |
Some students in the course used the Excel function t.test and obtained the one-tailed p-value for the paired t-test as 0.001696. Some of the other students used the ‘Data Analysis’ add-in from the ‘Data’ tab and obtained the following results:
t-Test: Paired Two Sample for Means | ||
Variable 1 | Variable 2 | |
Mean | 45.83333 | 42.375 |
Variance | 17.97101 | 10.07065 |
Observations | 24 | 24 |
Pearson Correlation | 0.143927 | |
Hypothesized Mean Difference | 0 | |
df | 23 | |
t Stat | 1.247242 | |
P(T<=t) one-tail | 0.11243 | |
t Critical one-tail | 1.713872 | |
P(T<=t) two-tail | 0.22486 | |
t Critical two-tail | 2.068658 |
The one-tailed p-value reported here is 0.11243! Surprised at this discrepancy, we decided to verify the analysis by hand calculations. The correct p-value is the 0.001696 obtained from the t.test function.
Now what is the problem with the results from the ‘Data Analysis’ add-in? A closer look at the results table additionally reveals the reported degrees of freedom (df) as 23. However, we can see that because of the missing values in the dataset, the number of usable pairs for analysis is 23. The correct df is therefore 22.
This shows that missing values in the data are not handled correctly by the ‘Data Analysis’ add-in. A search shows this problem with the add-in reported as early as in the year 2000 with Microsoft Excel 2000 (http://link.springer.com/content/esm/art:10.1007/s00180-014-0482-5/file/MediaObjects/180_2014_482_MOESM1_ESM.pdf). Unfortunately, the error has never been corrected in the subsequent versions of the software. It looks like the bad charts are the least of the problems with Excel! Other problems that have been reported include poor numerical accuracy, poor random number generation tools and errors in the data analysis add-ins.
Having said all of this, we certainly cannot also deny that Excel has been, and continues to be a very useful software tool to demonstrate and conduct basic data exploration and statistical analysis, especially for a non-statistician audience. The post at http://stats.stackexchange.com/questions/3392/excel-as-a-statistics-workbench provides a balanced view of the pros and cons of using Excel for data analysis. The take away for us is to be extremely careful when using Microsoft Excel for data management and statistical analysis and to doubly verify the results of any such data operation and analysis.