# Principal Component Analysis (PCA)

In this video, the speaker discusses the concept of <mark style="color:blue;">**Principal Component Analysis (PCA)**</mark> as a powerful technique for analysing and understanding high dimensional datasets.

{% embed url="<https://www.youtube.com/watch?index=3&list=PLpl0vKOacRCFh9fMbutQMDVUnxSrDaAZl&v=0IjQ0SyQEOI>" %}

The first example of a large complex dataset that <mark style="color:blue;">**Principal Component Analysis (PCA)**</mark> was applied to came from a research article titled "Genes mirror geography within Europe".

In this study, the researchers collected <mark style="color:yellow;">**genotypes of 3,192 individuals of European ancestry**</mark>, measuring approximately <mark style="color:yellow;">**500,000 SNP (Single Nucleotide Polymorphism) loci**</mark> for each individual.  This <mark style="color:yellow;">**results in a large data matrix**</mark> with dimensions 3,192 (n) by 500,000 (m).

To analyse this high-dimensional dataset, the researchers employed PCA, a dimensionality reduction technique.&#x20;

PCA projects the high-dimensional data onto a lower-dimensional space while preserving the most important information. In this case, the researchers projected the data onto a 2-dimensional space defined by the first two principal components (PC1 and PC2).

The resulting visualisation reveals that the relative location of individuals in the reduced gene space closely mirrors their geographic origins within Europe. This finding suggests that genetic variation among Europeans is strongly influenced by their geographic location and ancestral background.

### <mark style="color:purple;">Now, let's dive deeper into PCA and how it works</mark>

Principal Component Analysis (PCA) is a statistical technique that aims to reduce the dimensionality of a dataset while retaining most of its variation.&#x20;

It does so by identifying a new set of variables, called <mark style="color:blue;">**principal components (PCs)**</mark>, which are *<mark style="color:yellow;">**linear combinations of the original variables**</mark>*.&#x20;

These PCs are <mark style="color:blue;">**orthogonal (uncorrelated)**</mark> to each other and are ordered by the amount of variance they explain in the data.

The <mark style="color:blue;">**first principal component (PC1)**</mark> is the direction in the high-dimensional space along which the data varies the most.&#x20;

The <mark style="color:blue;">**second principal component (PC2)**</mark> is <mark style="color:yellow;">**orthogonal**</mark> to <mark style="color:blue;">**PC1**</mark> and captures the second most significant direction of variation, and so on.&#x20;

By projecting the data onto these PCs, we can effectively reduce the dimensionality of the dataset while preserving its most important features.

To implement PCA, the following steps are typically followed:

1. <mark style="color:purple;">**Standardise the data:**</mark> Ensure that all variables have zero mean and unit variance to avoid any single variable dominating the analysis due to its scale.
2. <mark style="color:purple;">**Compute the covariance matrix:**</mark> Calculate the covariance matrix of the standardised data, which captures the pairwise relationships between variables.
3. <mark style="color:purple;">**Find the eigenvectors and eigenvalues:**</mark> Compute the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the directions of the principal components, while the corresponding eigenvalues indicate the amount of variance explained by each PC.
4. <mark style="color:purple;">**Sort the eigenvectors:**</mark> Order the eigenvectors by their associated eigenvalues in descending order. The eigenvector with the highest eigenvalue becomes PC1, the second highest becomes PC2, and so on.
5. <mark style="color:purple;">**Project the data:**</mark> Transform the original data by projecting it onto the selected PCs. This step reduces the dimensionality of the data while preserving the most significant information.

By applying PCA to high-dimensional datasets, researchers can identify the most important features, visualise the data in a lower-dimensional space, and gain insights into the underlying patterns and structures.&#x20;

PCA has wide-ranging applications in various fields, including biology, genetics, neuroscience, and many others, where high-dimensional data is becoming increasingly common.

### <mark style="color:purple;">Another explanation</mark>

Principal Component Analysis (PCA) is a method for compressing high-dimensional data into a lower-dimensional representation while preserving the essence of the original data.

The key steps in performing PCA are as follows

1. <mark style="color:purple;">**Start with a high-dimensional dataset**</mark>, such as single-cell RNA sequencing data, where each cell is represented by a <mark style="color:yellow;">**vector of gene**</mark> expression levels.
2. <mark style="color:purple;">**Calculate the principal components (PCs)**</mark> of the data. Each PC is a linear combination of the original variables (<mark style="color:yellow;">**genes**</mark>) that captures a certain amount of variation in the data.&#x20;
3. The first <mark style="color:blue;">**PC (PC1)**</mark> captures the most variation, the <mark style="color:blue;">**second PC (PC2)**</mark> captures the second most variation, and so on.
4. Each <mark style="color:yellow;">**gene**</mark> is <mark style="color:blue;">**assigned a weight or "loading" for each PC**</mark>, which represents the gene's influence on that PC.  Genes with high loadings on a particular PC are the most important in driving the variation captured by that PC.
5. To <mark style="color:yellow;">**plot cells in the reduced-dimensional space**</mark>, a score is calculated for each cell along each PC. This is done by multiplying the cell's expression level for each gene by the gene's loading for that PC and summing these products across all genes.
6. The scores for the first two (or three) PCs are used as the coordinates for plotting each cell in a 2D (or 3D) scatter plot. Cells with similar expression profiles will cluster together in this reduced-dimensional space.

#### <mark style="color:green;">**Some key points about PCA**</mark>

* The <mark style="color:yellow;">**PCs are orthogonal (perpendicular) to each other**</mark>, meaning they capture independent sources of variation in the data.
* The number of PCs is equal to the number of original variables (genes), but in practice, <mark style="color:yellow;">**most of the variation in the data is often captured by the first few PCs**</mark>.
* <mark style="color:yellow;">**PCA is an unsupervised method**</mark>, meaning it does not use any information about cell types or other labels in finding the principal components.
* The <mark style="color:yellow;">**loadings can be used to identify genes that are important**</mark> in distinguishing different cell types or states along each PC.

To assess the quality of a PCA, one can use diagnostic plots such as a <mark style="color:blue;">**scree plot**</mark>, which shows the amount of variation captured by each PC. A good PCA will have most of the variation captured by the first few PCs, with subsequent PCs capturing diminishing amounts of variation.

#### <mark style="color:green;">What is a Scree Plot?</mark>

A scree plot is a diagnostic tool used in Principal Component Analysis (PCA) to visualise the importance of each principal component.&#x20;

It plots the amount of variation each PC captures from the data.&#x20;

Typically, the PCs are ordered on the x-axis in descending order of their variance contribution, with the corresponding amount of variation explained on the y-axis.

This plot is crucial for determining the number of PCs that should be retained for further analysis. A common approach is to look for the "elbow" in the scree plot, a point after which the marginal gain in explained variance significantly drops, indicating that subsequent PCs contribute less to the explanation of data variation.

By effectively using a scree plot, researchers can ensure they are capturing the most meaningful components of their data, while also simplifying its complexity.

### <mark style="color:purple;">**And another explanation**</mark>

Principal Component Analysis (PCA) is a method for combining variables in a way that simplifies the analysis and maximises the amount of information retained in the combined variables.&#x20;

The main goal of PCA is to *<mark style="color:yellow;">**reduce the dimensionality of a dataset while preserving as much of the original variation as possible.**</mark>*

The basic idea behind PCA is to find the *<mark style="color:yellow;">**optimal linear combination of variables that accounts for the maximum amount of variability in the data**</mark>*. This linear combination is represented by the following equation:

<mark style="color:purple;">**Combined Variable**</mark> = <mark style="color:orange;">**α₁**</mark> × <mark style="color:green;">**Variable**</mark>₁ + <mark style="color:orange;">**α₂**</mark> × <mark style="color:green;">**Variable**</mark>₂ + ... + <mark style="color:orange;">**αₙ**</mark> × <mark style="color:green;">**Variable**</mark>ₙ

The <mark style="color:blue;">**coefficients**</mark>**&#x20;**<mark style="color:orange;">**α₁, α₂, ..., αₙ**</mark>**&#x20;**<mark style="color:blue;">**are called weights or loadings**</mark>, and they determine the importance of each variable in the combined variable.  PCA aims to find the optimal values for these weights that maximise the variance of the combined variable.

PCA finds the optimal weights by <mark style="color:yellow;">**computing the covariance matrix of the data and then finding its eigenvectors**</mark>.&#x20;

The eigenvector with the largest eigenvalue contains the optimal weights for the linear combination that maximises the variance of the combined variable.

#### <mark style="color:green;">**What is an eigenvector?**</mark>

In the context of Principal Component Analysis (PCA), *<mark style="color:yellow;">**an eigenvector points in the direction of maximum variance in the data**</mark>*, and its corresponding eigenvalue quantifies the magnitude of this variance.&#x20;

Each <mark style="color:blue;">**principal component**</mark> is an eigenvector of the covariance matrix of the standardised data.&#x20;

The <mark style="color:blue;">**corresponding eigenvalue**</mark> represents the *<mark style="color:yellow;">**amount of variance in the data explained by that principal component**</mark>*.

The eigenvectors are orthogonal to each other, meaning they are uncorrelated and capture different directions of variability in the data.

When performing PCA, the *<mark style="color:yellow;">**eigenvectors are sorted in descending order based on their eigenvalues**</mark>*.&#x20;

The eigenvector with the largest eigenvalue becomes the first principal component (PC1), the eigenvector with the second largest eigenvalue becomes the second principal component (PC2), and so on.&#x20;

By selecting the top k eigenvectors, we can *<mark style="color:yellow;">**reduce the dimensionality of the data while retaining the most important information**</mark>*.

<mark style="color:green;">**Several applications of PCA in biology**</mark>

<mark style="color:purple;">**Reducing the number of variables:**</mark> PCA can be used to combine correlated variables into fewer, more informative variables. For example, combining diastolic and systolic blood pressure into a single "blood pressure" variable and combining weight and height into a "body size" variable.

<mark style="color:purple;">**Identifying individuals with similar health profiles:**</mark> By reducing many clinical variables to just two principal components (PC1 and PC2), it becomes easier to identify individuals with similar health profiles based on their proximity in a 2D plot of the principal component scores.

<mark style="color:purple;">**Analysing gene expression data:**</mark> In datasets with thousands of genes measured across multiple individuals or single cells, PCA can be used to identify individuals or cells with similar gene expression profiles. By plotting the principal component scores in a 2D plot, distinct clusters can be identified, representing different cell types or cells undergoing specific biological processes.

In summary, PCA is a powerful tool for dimensionality reduction and data visualisation that helps in identifying patterns and relationships in high-dimensional datasets. By finding the optimal linear combination of variables that maximizes the variance of the combined variable, PCA allows for a more intuitive understanding of complex biological data.

### <mark style="color:purple;">Unsupervised Learning Method</mark>

Principal Component Analysis (PCA) is an unsupervised learning method, which means it does not require any prior knowledge of class labels or groupings in the data.&#x20;

PCA aims to find the underlying structure and patterns in the data without being guided by a specific target variable or outcome.&#x20;

This distinguishes PCA from supervised learning techniques, such as classification and regression, which rely on labeled data to learn a mapping between input features and output variables.

The unsupervised nature of PCA makes it particularly useful for exploratory data analysis, data visualisation, and feature extraction.&#x20;

By identifying the principal components that capture the most variability in the data, PCA can help uncover hidden structures and relationships between variables, even when no explicit labels or categories are available.

### <mark style="color:purple;">A numerical example</mark>

Let's consider a simplified example using just three attributes from the heart disease dataset:&#x20;

1. age
2. trestbps (resting blood pressure)
3. chol (serum cholesterol)

Suppose we have the following data for five patients:

<table><thead><tr><th width="100" align="center">Patient</th><th width="72">Age</th><th width="109" align="center">Trestbps</th><th width="96">Chol</th></tr></thead><tbody><tr><td align="center">1</td><td>50</td><td align="center">120</td><td>200</td></tr><tr><td align="center">2</td><td>60</td><td align="center">140</td><td>220</td></tr><tr><td align="center">3</td><td>55</td><td align="center">130</td><td>180</td></tr><tr><td align="center">4</td><td>65</td><td align="center">150</td><td>240</td></tr><tr><td align="center">5</td><td>45</td><td align="center">110</td><td>190</td></tr></tbody></table>

<mark style="color:purple;">**Step 1:**</mark> Standardise the data by subtracting the mean and dividing by the standard deviation for each attribute.

<mark style="color:purple;">**Step 2:**</mark> Compute the covariance matrix of the standardised data.

<mark style="color:purple;">**Step 3:**</mark> Find the <mark style="color:blue;">**eigenvectors**</mark> and eigenvalues of the covariance matrix.

Suppose the <mark style="color:blue;">**eigenvectors**</mark> and <mark style="color:blue;">**eigenvalues**</mark> are:

```yaml
Eigenvector 1: [0.58, 0.58, 0.58], Eigenvalue 1: 2.5
Eigenvector 2: [0.71, -0.71, 0], Eigenvalue 2: 0.8
Eigenvector 3: [-0.41, -0.41, 0.82], Eigenvalue 3: 0.2
```

<mark style="color:purple;">**Step 4:**</mark> Sort the eigenvectors by their eigenvalues in *<mark style="color:yellow;">**descending order**</mark>*. In this case, the order is already correct.

<mark style="color:purple;">**Step 5:**</mark> Project the standardised data onto the top k eigenvectors (principal components) to reduce dimensionality. Let's choose <mark style="color:blue;">**k=2.**</mark>

The projected data points in the <mark style="color:blue;">**reduced 2D space**</mark> (PC1, PC2) could be:

<table><thead><tr><th width="125" align="center">Patient</th><th width="123" align="center">PC1</th><th data-type="number">PC2</th></tr></thead><tbody><tr><td align="center">1</td><td align="center">-1.2</td><td>-0.5</td></tr><tr><td align="center">2</td><td align="center">1.1</td><td>0.8</td></tr><tr><td align="center">3</td><td align="center">-0.3</td><td>-0.2</td></tr><tr><td align="center">4</td><td align="center">1.5</td><td>1.1</td></tr><tr><td align="center">5</td><td align="center">-1.1</td><td>-1.2</td></tr></tbody></table>

In this reduced space, patients with similar risk profiles will cluster together, making it easier to identify groups of patients with high or low risk of heart disease.&#x20;

This simplified example demonstrates how PCA can be used to reduce the dimensionality of the heart disease dataset while preserving the most important information for analysis and visualization.
