In biology we use PCA extensively. In almost all articles we see PCA plots. So here I will show a trick to hopefully improve the interpretation of a PCA plot.

We will use USArrests dataset that is available in R. To get more information try `?USArrests`

.

`head(USArrests)`

```
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
```

Here we do pca:

```
pcax = prcomp(USArrests, scale = T)
head(pcax$x)
```

```
## PC1 PC2 PC3 PC4
## Alabama -0.9756604 1.1220012 -0.43980366 0.154696581
## Alaska -1.9305379 1.0624269 2.01950027 -0.434175454
## Arizona -1.7454429 -0.7384595 0.05423025 -0.826264240
## Arkansas 0.1399989 1.1085423 0.11342217 -0.180973554
## California -2.4986128 -1.5274267 0.59254100 -0.338559240
## Colorado -1.4993407 -0.9776297 1.08400162 0.001450164
```

We can directly use R’s builtin function `plot`

to see the results:

`plot(pcax$x)`

Or even `biplot`

function which would also show what affects the data spread on a PCA plot. However, my main point of creating this post is to comment on the axis lengths. See the following two figures:

`plot(pcax$x)`

`plot(pcax$x)`

Are they the same? The same plot, yes, but would you react the same to both of them? No, right? The first one puts more weight to the x-axis. Here is my suggestion:

```
library(tidyverse)
varexplained = summary(pcax)$imp[2, 1:2]
varratio = unname(varexplained[1]/varexplained[2])
data.frame(pcax$x) %>% ggplot(aes(x = PC1, y = PC2)) + geom_point() +
theme_bw() + coord_fixed(ratio = 1/varratio)
```

Here, I believe, we can now interpret the dispersion better as the axis length is also proportional to the variance explained by each PC. PC1 explains 62% and PC2 explains around 25% of the variance and thus the ratio between x and y-axes is 2.5.

## Share this post

Twitter

Google+

Facebook

Reddit

LinkedIn

Pinterest

Email