A Visualization of World Cuisines

In a previous post, we had ‘mapped’ the culinary diversity in India through a visualization of food consumption patterns. Since then, one of the topics in my to-do list was a visualization of world cuisines. The primary question was similar to that asked of the Indian cuisine: Are cuisines of geographically and culturally closer regions also similar? I recently came across an article on the analysis of recipe ingredients that distinguish the cuisines of the world. The analysis was conducted on a publicly available dataset consisting of ingredients for more than 13,000 recipes from the recipe website Epicurious. Each recipe was also tagged with the cuisine it belonged to, and there were a total of 26 different cuisines. This dataset was initially reported in an analysis of flavor network and principles of food pairing.

In this post, we (re)look the Epicurious recipe dataset and perform an exploratory analysis and visualization of ingredient frequencies among cuisines. Ingredients that are frequently found in a region’s recipes would also have high consumption in that region, and so an analysis of the ‘ingredient frequency’ of a cuisine should give us similar info as an analysis of ‘ingredient consumption’.

Outline of Analysis Method

Here is a part of the first few lines of data from the Epicurious dataset:

 Vietnamese vinegar cilantro mint olive_oil cayenne fish lime_juice
Vietnamese onion cayenne fish black_pepper seed garlic
Vietnamese garlic soy_sauce lime_juice thai_pepper
Vietnamese cilantro shallot lime_juice fish cayenne ginger  pea
Vietnamese coriander vinegar lemon lime_juice fish cayenne  scallion
Vietnamese coriander lemongrass sesame_oil beef root fish

Each row of the dataset lists the ingredients for one recipe and the first column gives the cuisine the recipe belongs to. As the first step in our analysis, we collect ALL the ingredients for each cuisine (over all the recipes for that cuisine). Then we calculate the frequency of occurrence of each ingredient in each cuisine and normalize the frequencies for each cuisine with the number of recipes available for that cuisine. This matrix of normalized ingredient frequencies is used for further analysis.

We use two approaches for the exploratory analysis of the normalized ingredient frequencies: (1) heatmap and (2) principal component analysis (pca), followed by display using biplots. The complete R code for the analysis is given at the end of this post.


There are a total of 350 ingredients occurring in the dataset (among all cuisines). Some of the ingredients occur in just one cuisine, which, though interesting, will not be of much use for the current analysis. For better visual display, we restrict attention to ingredients showing most variation in normalized frequency across cuisines. The results are shown below:



The figures look self-explanatory and does show the clustering together of geographically nearby regions on the basis of commonly used ingredients. Moreover, we also notice the grouping together of regions with historical travel patterns (North Europe and American, Spanish_Portuguese and SouthAmerican/Mexican) or historical trading patterns (Indian and Middle East).

We need to further test the stability of the grouping obtained here by including data from the Allrecipes dataset. Also, probably taking the third principal component might dissipate some of the crowd along the PC2 axis. These would be some of the tasks for the next post…

Here is the complete R code used for the analysis:

workdir <- "C:\\Path\\To\\Dataset\\Directory"
datafile <- file.path(workdir,"epic_recipes.txt")
data <- read.table(datafile, fill=TRUE, col.names=1:max(count.fields(datafile)),
na.strings=c("", "NA"), stringsAsFactors = FALSE)

a <- aggregate(data[,-1], by=list(data[,1]), paste, collapse=",")
a$combined <- apply(a[,2:ncol(a)], 1, paste, collapse=",")
a$combined <- gsub(",NA","",a$combined) ## this column contains the totality of all ingredients for a cuisine

cuisines <- as.data.frame(table(data[,1])) ## Number of recipes for each cuisine
freq <- lapply(lapply(strsplit(a$combined,","), table), as.data.frame) ## Frequency of ingredients
names(freq) <- a[,1]
prop <- lapply(seq_along(freq), function(i) {
colnames(freq[[i]])[2] <- names(freq)[i]
freq[[i]][,2] <- freq[[i]][,2]/cuisines[i,2] ## proportion (normalized frequency)
names(prop) <- a[,1] ## this is a list of 26 elements, one for each cuisine

final <- Reduce(function(...) merge(..., all=TRUE, by="Var1"), prop)
row.names(final) <- final[,1]
final <- final[,-1]
final[is.na(final)] <- 0 ## If ingredient missing in all recipes, proportion set to zero
final <- t(final) ## proportion matrix

s <- sort(apply(final, 2, sd), decreasing=TRUE)
## Selecting ingredients with maximum variation in frequency among cuisines and
## Using standardized proportions for final analysis
final_imp <- scale(subset(final, select=names(which(s > 0.1)))) 

## heatmap 
heatmap.2(final_imp, trace="none", margins = c(6,11), col=topo.colors(7), 
key=TRUE, key.title=NA, keysize=1.2, density.info="none") 

## PCA and biplot 
p <- princomp(final_imp) 
biplot(p,pc.biplot=TRUE, col=c("black","red"), cex=c(0.9,0.8), 
xlim=c(-2.5,2.5), xlab="PC1, 39.7% explained variance", ylab="PC2, 24.5% explained variance") 


Seeing India through Food – An Experiment in Multidimensional Scaling

The ‘Household Consumption of Various Goods and Services in India’ report from the National Sample Survey Office (NSSO), Government of India includes survey data on the monthly per capita quantity of consumption of selected food items. The report is available from http://mospi.nic.in/Mospi_New/Admin/publication.aspx (a registration is required). The per capita quantity of consumption data is provided for every state and union territory of India, separately for rural and urban sectors. Here is a snapshot of the data for the urban sector from the February 2012 report:

rice.kg wheat.kg arhar.kg moong.kg masur.kg urd.kg
AndhraPradesh 8.764 0.656 0.412 0.095 0.032 0.173
ArunachalPradesh 10.49 0.763 0.066 0.079 0.296 0.007
Assam 10.246 1.109 0.044 0.105 0.363 0.047
Bihar 5.804 5.732 0.088 0.057 0.222 0.025
Chhattisgarh 7.643 2.686 0.685 0.024 0.053 0.054

The results of an analysis of this food consumption data using Multidimensional Scaling (MDS) is presented in this post. The objective behind this analysis was to see what kind of clustering patterns are seen among the states of India, as far as food consumption goes.

MDS was carried out using R (version 3.1.2) using the isoMDS() function from the MASS package and using the ‘manhattan’ distance measure. The plot below is a visualization of the results from the MDS (only for the sake of clarity, the union territories are not shown in the plot):


Here is a map of India to compare the above plot with. Don’t you think that the relative positioning of the different states in India nicely captured in the MDS analysis?

It is often said that the culinary diversity in India is the result of the diversity in the geography, climate, economy, tradition and culture within the country. All of these factors also contribute to an extent on the consumption of basic food items, resulting in the clustering together of geographically and culturally nearby regions even though the data analyzed was on food.

Some more ‘food’ for thought:

  1. Though Goa is geographically close to the states of Karnataka and Maharashtra, with respect to the food consumption pattern, Goa seems closer to Kerala. This is probably reflecting the strong coastal influence for both Goa and Kerala. Karnataka and Maharashtra too have a long coastline, but the influence of the inland may have diminished the coastal effects. In contrast, Kerala being narrow, the influence of the sea is strong in its cuisine. The relative positioning of Sikkim too, is a bit off. Not able to understand the reason for this.
  2. What would a MDS plot using data from rural areas alone or from urban areas alone look like? This would be a topic for a future post.
  3. The NSSO has meanwhile made available a more recent report in 2014. We can explore how the data from the newer report compares to the 2012 data.