## PCA plots

In biology we use PCA extensively. In almost all articles we see PCA plots. So here I will show a trick to hopefully improve the interpretation of a PCA plot.

We will use USArrests dataset that is available in R. To get more information try ?USArrests.

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Here we do pca:

pcax = prcomp(USArrests, scale = T)
head(pcax$x) ## PC1 PC2 PC3 PC4 ## Alabama -0.9756604 1.1220012 -0.43980366 0.154696581 ## Alaska -1.9305379 1.0624269 2.01950027 -0.434175454 ## Arizona -1.7454429 -0.7384595 0.05423025 -0.826264240 ## Arkansas 0.1399989 1.1085423 0.11342217 -0.180973554 ## California -2.4986128 -1.5274267 0.59254100 -0.338559240 ## Colorado -1.4993407 -0.9776297 1.08400162 0.001450164 We can directly use R’s builtin function plot to see the results: plot(pcax$x)

Or even biplot function which would also show what affects the data spread on a PCA plot. However, my main point of creating this post is to comment on the axis lengths. See the following two figures:

plot(pcax$x) plot(pcax$x)

Are they the same? The same plot, yes, but would you react the same to both of them? No, right? The first one puts more weight to the x-axis. Here is my suggestion:

library(tidyverse)
varexplained = summary(pcax)$imp[2, 1:2] varratio = unname(varexplained[1]/varexplained[2]) data.frame(pcax$x) %>% ggplot(aes(x = PC1, y = PC2)) + geom_point() +
theme_bw() + coord_fixed(ratio = 1/varratio)

Here, I believe, we can now interpret the dispersion better as the axis length is also proportional to the variance explained by each PC. PC1 explains 62% and PC2 explains around 25% of the variance and thus the ratio between x and y-axes is 2.5.