# Forest Plot (with Horizontal Bands)

Forest plots are often used in clinical trial reports to show differences in the estimated treatment effect(s) across various patient subgroups. See, for example a review. The page on Clinical Trials Safety Graphics includes a SAS code for a forest plot that depicts the hazard ratios for various patient subgroups (this web page has links to other interesting clinical trial graphics, as well as links to notes on best practices for graphs, and is worth checking out).

The purpose of this blog post is to create the same forest plot using R.

It should be possible to create such a graphic from first principles, using either base R graphics or using the ggplot2 package such as posted here. However, there is a contributed package forestplot that makes it very easy to make forest plots interspersed with tables – we just need to supply the right arguments to the forestplot function in the package. Also, one question that arose was how easy it would be to get the horizontal grey bands for alternate rows in the forest plot. This too was not very difficult. Following is a short explanation of the entire process, as well as the relevant R code:

First, we store the data for the plot, ForestPlotData in any convenient file format and then read it into a dataframe in R:

workdir <- ""C:\\Path\\To\\Relevant\\Directory""
datafile <- file.path(workdir,"ForestPlotData.csv")


Then format the data a bit so that the column labels and columns match the required graphical output:

## Labels defining subgroups are a little indented!
subgps <- c(4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33)
data$Variable[subgps] <- paste(" ",data$Variable[subgps])

## Combine the count and percent column
np <- ifelse(!is.na(data$Count), paste(data$Count," (",data$Percent,")",sep=""), NA) ## The rest of the columns in the table. tabletext <- cbind(c("Subgroup","\n",data$Variable),
c("No. of Patients (%)","\n",np),
c("4-Yr Cum. Event Rate\n PCI","\n",data$PCI.Group), c("4-Yr Cum. Event Rate\n Medical Therapy","\n",data$Medical.Therapy.Group),
c("P Value","\n",data$P.Value))  Finally, include the forestplot R package and call the forestplot function with appropriate arguments. The way I got around to creating the horizontal band at every alternate row was by using settings for a very thick transparent line in the hrzl_lines argument! See below. The col=”#99999922″ option gives the light grey color to the line as well as sets it to be transparent. A graphics device (here, a png file) with appropriate dimensions is first opened and the forest plot is saved to the device. library(forestplot) png(file.path(workdir,"Figures\\Forestplot.png"),width=960, height=640) forestplot(labeltext=tabletext, graph.pos=3, mean=c(NA,NA,data$Point.Estimate),
lower=c(NA,NA,data$Low), upper=c(NA,NA,data$High),
title="Hazard Ratio",
xlab="     <---PCI Better---    ---Medical Therapy Better--->",
hrzl_lines=list("3" = gpar(lwd=1, col="#99999922"),
"7" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922"),
"15" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922"),
"23" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922"),
"31" = gpar(lwd=60, lineend="butt", columns=c(2:6), col="#99999922")),
txt_gp=fpTxtGp(label=gpar(cex=1.25),
ticks=gpar(cex=1.1),
xlab=gpar(cex = 1.2),
title=gpar(cex = 1.2)),
col=fpColors(box="black", lines="black", zero = "gray50"),
zero=1, cex=0.9, lineheight = "auto", boxsize=0.5, colgap=unit(6,"mm"),
lwd.ci=2, ci.vertices=TRUE, ci.vertices.height = 0.4)
dev.off()


Here is the resulting forest plot:

# Manipulate(d) Regression!

The R package ‘manipulate’ can be used to create interactive plots in RStudio. Though not as versatile as the ‘shiny’ package, ‘manipulate’ can be used to quickly add interactive elements to standard R plots. This can prove useful for demonstrating statistical concepts, especially to a non-statistician audience.

The R code at the end of this post uses the ‘manipulate’ package with a regression plot to illustrate the effect of outliers (and influential) points on the fitted linear regression model. The resulting manipulate(d) plot in RStudio includes a gear icon, which, when clicked, opens up a slider control. The slider can be used to move some data points. The plot changes interactively with the data.

Here are some static figures:

Initial state: It is possible to move two points in the scatter plot, one at the end and one at the center.

An outlier at center has a limited influence on the fitted regression model.

An outlier at the ends of support of x and y ‘moves’ the regression line towards it and is also an influential point!

Here is the complete R code for generating the interactive plot. This is to be run in RStudio.

library(manipulate)

## First define a custom function that fits a linear regression line
## to (x,y) points and overlays the regression line in a scatterplot.
## The plot is then 'manipulated' to change as y values change.

linregIllustrate <- function(x, y, e, h.max, h.med){
max.x <- max(x)
med.x <- median(x)
max.xind <- which(x == max.x)
med.xind <- which(x == med.x)

y1 <- y     ## Modified y
y1[max.xind] <- y1[max.xind]+h.max  ## at the end
y1[med.xind] <- y1[med.xind]+h.med  ## at the center
plot(x, y1, xlim=c(min(x),max(x)+5), ylim=c(min(y1),max(y1)), pch=16,
xlab="X", ylab="Y")
text(x[max.xind], y1[max.xind],"I'm movable!", pos=3, offset = 0.3, cex=0.7, font=2, col="red")
text(x[med.xind], y1[med.xind],"I'm movable too!", pos=3, offset = 0.3, cex=0.7, font=2, col="red")

m <- lm(y ~ x)  ## Regression with original set of points, the black line
abline(m, lwd=2)

m1 <- lm(y1 ~ x)  ## Regression with modified y, the dashed red line
abline(m1, col="red", lwd=2, lty=2)
}

## Now generate some x and y data
x <- rnorm(35,10,5)
e <- rnorm(35,0,5)
y <- 3*x+5+e

## Plot and manipulate the plot!
manipulate(linregIllustrate(x, y, e, h.max, h.med),
h.max=slider(-100, 100, initial=0, step=10, label="Move y at the end"),
h.med=slider(-100, 100, initial=0, step=10, label="Move y at the center"))


# A Visualization of World Cuisines

In a previous post, we had ‘mapped’ the culinary diversity in India through a visualization of food consumption patterns. Since then, one of the topics in my to-do list was a visualization of world cuisines. The primary question was similar to that asked of the Indian cuisine: Are cuisines of geographically and culturally closer regions also similar? I recently came across an article on the analysis of recipe ingredients that distinguish the cuisines of the world. The analysis was conducted on a publicly available dataset consisting of ingredients for more than 13,000 recipes from the recipe website Epicurious. Each recipe was also tagged with the cuisine it belonged to, and there were a total of 26 different cuisines. This dataset was initially reported in an analysis of flavor network and principles of food pairing.

In this post, we (re)look the Epicurious recipe dataset and perform an exploratory analysis and visualization of ingredient frequencies among cuisines. Ingredients that are frequently found in a region’s recipes would also have high consumption in that region, and so an analysis of the ‘ingredient frequency’ of a cuisine should give us similar info as an analysis of ‘ingredient consumption’.

Outline of Analysis Method

Here is a part of the first few lines of data from the Epicurious dataset:

 Vietnamese vinegar cilantro mint olive_oil cayenne fish lime_juice Vietnamese onion cayenne fish black_pepper seed garlic Vietnamese garlic soy_sauce lime_juice thai_pepper Vietnamese cilantro shallot lime_juice fish cayenne ginger pea Vietnamese coriander vinegar lemon lime_juice fish cayenne scallion Vietnamese coriander lemongrass sesame_oil beef root fish …

Each row of the dataset lists the ingredients for one recipe and the first column gives the cuisine the recipe belongs to. As the first step in our analysis, we collect ALL the ingredients for each cuisine (over all the recipes for that cuisine). Then we calculate the frequency of occurrence of each ingredient in each cuisine and normalize the frequencies for each cuisine with the number of recipes available for that cuisine. This matrix of normalized ingredient frequencies is used for further analysis.

We use two approaches for the exploratory analysis of the normalized ingredient frequencies: (1) heatmap and (2) principal component analysis (pca), followed by display using biplots. The complete R code for the analysis is given at the end of this post.

Results

There are a total of 350 ingredients occurring in the dataset (among all cuisines). Some of the ingredients occur in just one cuisine, which, though interesting, will not be of much use for the current analysis. For better visual display, we restrict attention to ingredients showing most variation in normalized frequency across cuisines. The results are shown below:

Heatmap:

Biplot:

The figures look self-explanatory and does show the clustering together of geographically nearby regions on the basis of commonly used ingredients. Moreover, we also notice the grouping together of regions with historical travel patterns (North Europe and American, Spanish_Portuguese and SouthAmerican/Mexican) or historical trading patterns (Indian and Middle East).

We need to further test the stability of the grouping obtained here by including data from the Allrecipes dataset. Also, probably taking the third principal component might dissipate some of the crowd along the PC2 axis. These would be some of the tasks for the next post…

Here is the complete R code used for the analysis:

workdir <- "C:\\Path\\To\\Dataset\\Directory"
datafile <- file.path(workdir,"epic_recipes.txt")
na.strings=c("", "NA"), stringsAsFactors = FALSE)

a <- aggregate(data[,-1], by=list(data[,1]), paste, collapse=",")
a$combined <- apply(a[,2:ncol(a)], 1, paste, collapse=",") a$combined <- gsub(",NA","",a$combined) ## this column contains the totality of all ingredients for a cuisine cuisines <- as.data.frame(table(data[,1])) ## Number of recipes for each cuisine freq <- lapply(lapply(strsplit(a$combined,","), table), as.data.frame) ## Frequency of ingredients
names(freq) <- a[,1]
prop <- lapply(seq_along(freq), function(i) {
colnames(freq[[i]])[2] <- names(freq)[i]
freq[[i]][,2] <- freq[[i]][,2]/cuisines[i,2] ## proportion (normalized frequency)
freq[[i]]}
)
names(prop) <- a[,1] ## this is a list of 26 elements, one for each cuisine

final <- Reduce(function(...) merge(..., all=TRUE, by="Var1"), prop)
row.names(final) <- final[,1]
final <- final[,-1]
final[is.na(final)] <- 0 ## If ingredient missing in all recipes, proportion set to zero
final <- t(final) ## proportion matrix

s <- sort(apply(final, 2, sd), decreasing=TRUE)
## Selecting ingredients with maximum variation in frequency among cuisines and
## Using standardized proportions for final analysis
final_imp <- scale(subset(final, select=names(which(s > 0.1))))

## heatmap
library(gplots)
heatmap.2(final_imp, trace="none", margins = c(6,11), col=topo.colors(7),
key=TRUE, key.title=NA, keysize=1.2, density.info="none")

## PCA and biplot
p <- princomp(final_imp)
biplot(p,pc.biplot=TRUE, col=c("black","red"), cex=c(0.9,0.8),
xlim=c(-2.5,2.5), xlab="PC1, 39.7% explained variance", ylab="PC2, 24.5% explained variance")


# Drug Interaction Studies – Statistical Analysis

This post is actually a continuation of the previous post, and is motivated by this article that discusses the graphics and statistical analysis for a two treatment, two period, two sequence (2x2x2) crossover drug interaction study of a new treatment versus the standard. Whereas the previous post was devoted to implementing some of the graphics presented in the article, in this post we try to recreate the statistical analysis calculations for the data from the drug interaction study. The statistical analysis is implemented with R.

Dataset:

The drug interaction study dataset can be obtained from ocdrug.dat.txt and a brief description of the dataset is at ocdrug.txt. We first save the files to a local directory and then read the data into R dataframe and assign appropriate column labels:

ocdrug <- read.table(paste(workdir,"ocdrug.dat.txt",sep=""),sep="")
## “workdir” is the name of the variable containing the directory name where the data file is stored
colnames(ocdrug) <- c("ID","Seq","Period","Tmnt","EE_AUC","EE_Cmax","NET_AUC","NET_Cmax")

## Give nice names to the treatments (OCD and OC) and the treatment sequence
ocdrug$Seq <- factor(ifelse(ocdrug$Seq == 1,"OCD-OC","OC-OCD"))
ocdrug$Tmnt <- factor(ifelse(ocdrug$Tmnt == 0,"OC","OCD"), levels = c("OC", "OCD"))


Then log transform the PK parameters:

ocdrug$logEE_AUC <- log(ocdrug$EE_AUC)
ocdrug$logEE_Cmax <- log(ocdrug$EE_Cmax)
ocdrug$logNET_AUC <- log(ocdrug$NET_AUC)
ocdrug$logNET_Cmax <- log(ocdrug$NET_Cmax)


Analysis based on Normal Theory – Linear Mixed Effects Model

We illustrate the normal theory based mixed effects model for the analysis of EE AUC. The analysis for EE Cmax and for NET AUC and Cmax are similar. We use the recommended R package nlme for developing the linear mixed effects models. The models include terms for sequence, period and treatment.

require(nlme)

options(contrasts = c(factor = "contr.treatment", ordered = "contr.poly"))
lme_EE_AUC <- lme(logEE_AUC ~ Seq+Period+Tmnt, random=(~1|ID), data=ocdrug)


From the fitted models, we can calculate the point estimates and confidence intervals for the geometric mean ratios for AUC and Cmax for EE and NET. This is illustrated for EE AUC:

tTable_EE_AUC <- summary(lme_EE_AUC)$tTable["TmntOCD",] logEst <- tTable_EE_AUC[1] se <- tTable_EE_AUC[2] df <- tTable_EE_AUC[3] t <- qt(.95,df)</pre> ## Estimate of geometric mean ratio gmRatio_EE_AUC <- exp(logEst) ## 90% lower and upper confidence limits LCL_EE_AUC <- exp(logEst-t*se); UCL_EE_AUC <- exp(logEst+t*se)  The calculations for Cmax and for NET are similar. We can bind all the results from normal theory model into a dataframe: norm.out <- data.frame(type=rep("Normal Theory",4), cmpd=c(rep("EE",2), rep("NET",2)), PK=rep(c("AUC","Cmax"),2), ratio=c(gmRatio_EE_AUC, gmRatio_EE_Cmax, gmRatio_NET_AUC, gmRatio_NET_Cmax), LCL=c(LCL_EE_AUC, LCL_EE_Cmax, LCL_NET_AUC, LCL_NET_Cmax), UCL=c(UCL_EE_AUC, UCL_EE_Cmax, UCL_NET_AUC, UCL_NET_Cmax)) rownames(norm.out) <- paste(norm.out$cmpd,norm.out$PK,sep="_") cat("\nSummary of normal theory based analysis\n") cat("---------------------------------------\n") print(round(norm.out[,4:6],2))  As discussed in the article, the diagnostic plots to assess the appropriateness of the normal theory based linear model, i.e., normal probability plots, scatter plots of observed vs. fitted values, and scatter plots of studentized residuals vs. fitted values, show some deviations from assumptions, at least for Cmax. Hence, a distribution-free or non-parametric analysis may not be unreasonable. Distribution-free Analysis A distribution free analysis for bioequivalence is using the two sample Hodges-Lehmann point estimator and the two sample Moses confidence interval. Difference in treatment effects are estimated as the average difference in the two treatment sequence groups with respect to the within-subject period difference in the log AUC (or log Cmax) values. But we must first calculate these within-subject period differences. Transforming the dataset to a ‘wide’ format may be useful for the calculation of these within-subject differences: ocdrug.wide <- reshape(ocdrug, v.names=c("EE_AUC", "EE_Cmax", "NET_AUC", "NET_Cmax", "logEE_AUC", "logEE_Cmax", "logNET_AUC", "logNET_Cmax", "Tmnt"), timevar="Period", idvar="ID", direction="wide", new.row.names=NULL) ocdrug.wide$logEE_AUC_PeriodDiff <- with(ocdrug.wide, logEE_AUC.1 - logEE_AUC.2)


Similarly, the within-subject differences can be calculated for EE Cmax and NET AUC and Cmax. The two sample Hodges-Lehmann point estimator and the two sample Moses confidence interval can be calculated from these within subject differences using the the pairwiseCI function in the R-package pairwiseCI with the method=HL.diff option. Also, as specified in the article, we need to divide the estimates and CI bounds by 2 because the difference is taken twice (sequence and period).

npEst_EE_AUC <- unlist(as.data.frame(pairwiseCI(logEE_AUC_PeriodDiff ~ Seq, data=ocdrug.wide,
method="HL.diff", conf.level=0.90)))
npRatio_EE_AUC <- as.numeric(exp(npEst_EE_AUC[1]/2))
npLCL_EE_AUC <- as.numeric(exp(npEst_EE_AUC[2]/2))
npUCL_EE_AUC <- as.numeric(exp(npEst_EE_AUC[3]/2))


Similarly, the estimates and confidence intervals can be obtained for EE Cmax and NET AUC and Cmax. We bind all the results obtained from distribution free theory analysis into a dataframe:

np.out <- data.frame(type=rep("Distribution Free",4),
cmpd=c(rep("EE",2), rep("NET",2)),
PK=rep(c("AUC","Cmax"),2),
ratio=c(npRatio_EE_AUC, npRatio_EE_Cmax, npRatio_NET_AUC, npRatio_NET_Cmax),
LCL=c(npLCL_EE_AUC, npLCL_EE_Cmax, npLCL_NET_AUC, npLCL_NET_Cmax),
UCL=c(npUCL_EE_AUC, npUCL_EE_Cmax, npUCL_NET_AUC, npUCL_NET_Cmax))

rownames(np.out) <- paste(np.out$cmpd,np.out$PK,sep="_")

cat("\nSummary of distribution-free analysis\n")
cat("-------------------------------------\n")
print(round(np.out[,4:6],2))


Finally, we try to recreate figure 13 that graphically compares the results of the normal theory and distribution free analyses:

# We collect the quantities to plot in a dataframe and include a dummy x-variable
plotIt <- rbind(np.out,norm.out)
plotIt$x <- c(1,3,5,7,1.5,3.5,5.5,7.5) # Then plot png(filename = paste(workdir,"plotCIs.png",sep=""), width = 640, height = 640) par(font.lab=2, font.axis=2, las=1, font=2, cex=1.6, cex.lab=0.8, cex.axis=0.8) col=c(rep("magenta",4), rep("lightseagreen",4)) lty=c(rep(2,4),rep(1,4)) pch=c(rep(17,4), rep(16,4)) plot(ratio ~ x, data=plotIt, type="p", pch=pch, col=col, axes=FALSE, xlim=c(0.5,8), ylim=c(0.7,1.35), xlab="", ylab="Ratio(OCD/OC)") arrows(plotIt$x, plotIt$LCL, plotIt$x, plotIt\$UCL,
code=3, length = 0.08, angle = 90, lwd=3, col=col, lty=lty)
axis(side=2, at=c(0.7,0.8,1, 1.25))
text(x=c(1,3,5,7), y=rep(0.7,4), pos=4, labels=rep(c("AUC","Cmax"),2), cex=0.8)
text(x=c(2,6), y=rep(0.65,2), pos=4, labels=c("EE","NET"), xpd=NA, cex=0.8)
abline(h=c(0.8,1,1.25),lty=3)

legend("top",c(paste("Distribution-Free","\t\t"),"Normal Theory"),
col=unique(col), lty=unique(lty), pch=unique(pch), horiz=TRUE, bty="n", cex=0.8, lwd=2)
dev.off()


Results

Summary of normal theory based analysis
---------------------------------------
ratio  LCL  UCL
EE_AUC    1.16 1.11 1.23
EE_Cmax   1.10 1.01 1.19
NET_AUC   1.01 0.97 1.04
NET_Cmax  0.86 0.78 0.95

Summary of distribution-free analysis
-------------------------------------
ratio  LCL  UCL
EE_AUC    1.16 1.11 1.24
EE_Cmax   1.08 0.99 1.20
NET_AUC   1.00 0.97 1.04
NET_Cmax  0.90 0.82 0.96



These results appear in tables 2-3 and figure 13 of the article.

# Treatment comparisons for pretest – posttest experimental studies

Recently, I faced a question on the statistical analysis method for an animal study involving testing of lipid lowering agents. The study was designed along the following lines. Animals were randomized to receive one of the study agents. Prior to initiation of treatment, lipid levels (LDL, HDL and total cholesterol) were measured (baseline). Study drugs were administered ­­daily for xx days and the lipid levels were again measured at end of study (post treatment).

This kind of study design is often labelled as a pretest – posttest design and is quite common in the medical field for comparing different treatments.

In many clinical studies of cholesterol lowering agents, the percent change from baseline is analyzed for between group differences. Hence, I suggested using percent change from baseline as the outcome measure in an ANOVA as the statistical analysis method for the above animal experiment. On further scanning of the literature, I however noted that in spite of the widespread prevalence of the pretest – posttest experimental design, there is a lack of consensus on the best method for the data analysis.

The possible choices for the outcome measure to use in the analysis of data from a pretest – posttest study could be the post-treatment values (PV) or the change between baseline and post treatment values, referred to in the literature as change scores or gain scores (DIFF) or any measure of relative change between baseline and post treatment values. Percentage change from baseline (PC) mentioned above is an example of a measure of relative change. Some of the other measures of relative change include symmetrized percent change (SPC) and log ratio of post-treatment to baseline (LR). Furthermore, the methods for statistical analysis include parametric and non-parametric versions of ANOVA or ANCOVA on any of the above outcome measure. So indeed there exist a lot of possibilities for the analysis of data from a pretest – posttest study!

Various simulation studies have provided us with pointers to guide the choice of the outcome measure and analysis method. The current thinking seems to be that an ANCOVA on PV has higher power than a simple ANOVA on PC, especially in the situation where little correlation exists between baseline and post treatment values. However, SPC with wither the ANOVA or ANCOVA seems to be a good option in the case of additive or multiplicative correlation between baseline and post treatment values.

The simulation studies have tried to mimic various real scenarios. Vickers has studied in detail the situation where the outcome is continuous and there is an additive correlation between post treatment and baseline values. SPC as an outcome measure was not studied (Vickers A. J. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Medical Research Methodology (2001) 1:6). Meanwhile, Berry & Ayers consider count data, include SPC as an outcome measure and consider parametric and non-parametric analysis methods in the presence of additive and multiplicative correlation between baseline and post treatment values (Berry D. A. & Ayers G. D. Symmetrized Percent Change for Treatment Comparisons. The American Statistician (2006) 60:1, 27-31).

At the time of design of the statistical analysis plan, there is bound to be little information available on the extent and type of correlation that may exist between the baseline and post treatment values. Hence there is a need to conduct extensive review and/or simulations, taking into account also the type of data and correlation structures encountered in different therapeutic areas, to identify (therapeutic area specific!) best practices for the choice of the outcome measure and analysis method for the pretest – posttest design.

# Calculating Standard Deviation

Most standard books on Biostatistics give two formulae for calculating the standard deviation (SD) of a set of observations.

The population SD (for a finite population) is defined as $\sqrt{\frac{\sum_{i=1}^{N}\left ( X_{i} -\mu \right )^{2}}{N}}$, where X1, X2, …, XN are the N observations and μ is the mean of the finite population.

The sample SD is defined as $\sqrt{\frac{\sum_{i=1}^{n}\left ( x_{i} -\bar{x} \right )^{2}}{n-1}}$, where x1, x2, …, xn are the n observations in the sample and $\bar{x}$ is the sample mean.

While teaching a Biostatistics course, it becomes a little difficult to explain to students about why there are two different formulae, and why do we subtract ‘1’ from n when we calculate the SD using sample observations. These are students who are (or will be) collecting samples, performing experiments, obtaining sample data, and analyzing data. Hence it is important that they understand the concept from a practitioners point, rather than a theoretical point of view.

In my experience, the best explanation has been to avoid a theoretical discussion of unbiasedness etc. and to state that if we want to calculate the SD from sample data, we also need to know the true (population) mean. Since the population mean is unknown, we use the same n observations in the sample to first calculate the sample mean $\bar{x}$ as an estimate of the population mean μ. Our sample observations however tend to be more close to the sample mean rather than the population mean and hence the SD calculated using the first formula will underestimate the true variability. To correct for this underestimation we divide by n – 1 instead of n. Then the question of why n – 1? We have already used the n observations to calculate one quantity, i.e., $\bar{x}$. We are then left with only n – 1degrees of freedom’ for further calculations that use $\bar{x}$.

I also state three more points to complete the topic:

1. That the quantity n – 1 is referred to as the degrees of freedom for computing the SD and we say that 1 degree of freedom has been used up while estimating μ using sample data.
2. For most situations that the students would encounter, the complete population is not available and they would have to work with sample data. So in most (if not all) situations, they should be using the formula with n – 1 to calculate the SD.
3. As the sample size increases, the differences arising out of using n or n – 1 should be of little practical importance. But it is still conceptually correct to use n – 1.

When the students were then asked as to how they would define degrees of freedom, one of them came up with a definition as ‘number of observations minus number of calculations’ which is pretty much what it is, as far as a practitioner is concerned!

# Seeing India through Food – An Experiment in Multidimensional Scaling

The ‘Household Consumption of Various Goods and Services in India’ report from the National Sample Survey Office (NSSO), Government of India includes survey data on the monthly per capita quantity of consumption of selected food items. The report is available from http://mospi.nic.in/Mospi_New/Admin/publication.aspx (a registration is required). The per capita quantity of consumption data is provided for every state and union territory of India, separately for rural and urban sectors. Here is a snapshot of the data for the urban sector from the February 2012 report:

 rice.kg wheat.kg arhar.kg moong.kg masur.kg urd.kg … AndhraPradesh 8.764 0.656 0.412 0.095 0.032 0.173 ArunachalPradesh 10.49 0.763 0.066 0.079 0.296 0.007 Assam 10.246 1.109 0.044 0.105 0.363 0.047 Bihar 5.804 5.732 0.088 0.057 0.222 0.025 Chhattisgarh 7.643 2.686 0.685 0.024 0.053 0.054 …

The results of an analysis of this food consumption data using Multidimensional Scaling (MDS) is presented in this post. The objective behind this analysis was to see what kind of clustering patterns are seen among the states of India, as far as food consumption goes.

MDS was carried out using R (version 3.1.2) using the isoMDS() function from the MASS package and using the ‘manhattan’ distance measure. The plot below is a visualization of the results from the MDS (only for the sake of clarity, the union territories are not shown in the plot):

Here is a map of India to compare the above plot with. Don’t you think that the relative positioning of the different states in India nicely captured in the MDS analysis?

It is often said that the culinary diversity in India is the result of the diversity in the geography, climate, economy, tradition and culture within the country. All of these factors also contribute to an extent on the consumption of basic food items, resulting in the clustering together of geographically and culturally nearby regions even though the data analyzed was on food.

Some more ‘food’ for thought:

1. Though Goa is geographically close to the states of Karnataka and Maharashtra, with respect to the food consumption pattern, Goa seems closer to Kerala. This is probably reflecting the strong coastal influence for both Goa and Kerala. Karnataka and Maharashtra too have a long coastline, but the influence of the inland may have diminished the coastal effects. In contrast, Kerala being narrow, the influence of the sea is strong in its cuisine. The relative positioning of Sikkim too, is a bit off. Not able to understand the reason for this.
2. What would a MDS plot using data from rural areas alone or from urban areas alone look like? This would be a topic for a future post.
3. The NSSO has meanwhile made available a more recent report in 2014. We can explore how the data from the newer report compares to the 2012 data.