In biology we use PCA extensively. In almost all articles we see PCA plots. So here I will show a trick to hopefully improve the interpretation of a PCA plot.
We will use USArrests dataset that is available in R. To get more information try
## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7
Here we do pca:
pcax = prcomp(USArrests, scale = T) head(pcax$x)
## PC1 PC2 PC3 PC4 ## Alabama -0.9756604 1.1220012 -0.43980366 0.154696581 ## Alaska -1.9305379 1.0624269 2.01950027 -0.434175454 ## Arizona -1.7454429 -0.7384595 0.05423025 -0.826264240 ## Arkansas 0.1399989 1.1085423 0.11342217 -0.180973554 ## California -2.4986128 -1.5274267 0.59254100 -0.338559240 ## Colorado -1.4993407 -0.9776297 1.08400162 0.001450164
We can directly use R’s builtin function
plot to see the results:
biplot function which would also show what affects the data spread on a PCA plot. However, my main point of creating this post is to comment on the axis lengths. See the following two figures:
Are they the same? The same plot, yes, but would you react the same to both of them? No, right? The first one puts more weight to the x-axis. Here is my suggestion:
library(tidyverse) varexplained = summary(pcax)$imp[2, 1:2] varratio = unname(varexplained/varexplained) data.frame(pcax$x) %>% ggplot(aes(x = PC1, y = PC2)) + geom_point() + theme_bw() + coord_fixed(ratio = 1/varratio)
Here, I believe, we can now interpret the dispersion better as the axis length is also proportional to the variance explained by each PC. PC1 explains 62% and PC2 explains around 25% of the variance and thus the ratio between x and y-axes is 2.5.