Principal Component Analysis (PCA)
Last updated
Copyright Continuum Labs - 2023
Last updated
In this video, the speaker discusses the concept of Principal Component Analysis (PCA) as a powerful technique for analysing and understanding high dimensional datasets.
The first example of a large complex dataset that Principal Component Analysis (PCA) was applied to came from a research article titled "Genes mirror geography within Europe".
In this study, the researchers collected genotypes of 3,192 individuals of European ancestry, measuring approximately 500,000 SNP (Single Nucleotide Polymorphism) loci for each individual. This results in a large data matrix with dimensions 3,192 (n) by 500,000 (m).
To analyse this high-dimensional dataset, the researchers employed PCA, a dimensionality reduction technique.
PCA projects the high-dimensional data onto a lower-dimensional space while preserving the most important information. In this case, the researchers projected the data onto a 2-dimensional space defined by the first two principal components (PC1 and PC2).
The resulting visualisation reveals that the relative location of individuals in the reduced gene space closely mirrors their geographic origins within Europe. This finding suggests that genetic variation among Europeans is strongly influenced by their geographic location and ancestral background.
Principal Component Analysis (PCA) is a statistical technique that aims to reduce the dimensionality of a dataset while retaining most of its variation.
It does so by identifying a new set of variables, called principal components (PCs), which are linear combinations of the original variables.
These PCs are orthogonal (uncorrelated) to each other and are ordered by the amount of variance they explain in the data.
The first principal component (PC1) is the direction in the high-dimensional space along which the data varies the most.
The second principal component (PC2) is orthogonal to PC1 and captures the second most significant direction of variation, and so on.
By projecting the data onto these PCs, we can effectively reduce the dimensionality of the dataset while preserving its most important features.
To implement PCA, the following steps are typically followed:
Standardise the data: Ensure that all variables have zero mean and unit variance to avoid any single variable dominating the analysis due to its scale.
Compute the covariance matrix: Calculate the covariance matrix of the standardised data, which captures the pairwise relationships between variables.
Find the eigenvectors and eigenvalues: Compute the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the directions of the principal components, while the corresponding eigenvalues indicate the amount of variance explained by each PC.
Sort the eigenvectors: Order the eigenvectors by their associated eigenvalues in descending order. The eigenvector with the highest eigenvalue becomes PC1, the second highest becomes PC2, and so on.
Project the data: Transform the original data by projecting it onto the selected PCs. This step reduces the dimensionality of the data while preserving the most significant information.
By applying PCA to high-dimensional datasets, researchers can identify the most important features, visualise the data in a lower-dimensional space, and gain insights into the underlying patterns and structures.
PCA has wide-ranging applications in various fields, including biology, genetics, neuroscience, and many others, where high-dimensional data is becoming increasingly common.
Principal Component Analysis (PCA) is a method for compressing high-dimensional data into a lower-dimensional representation while preserving the essence of the original data.
The key steps in performing PCA are as follows
Start with a high-dimensional dataset, such as single-cell RNA sequencing data, where each cell is represented by a vector of gene expression levels.
Calculate the principal components (PCs) of the data. Each PC is a linear combination of the original variables (genes) that captures a certain amount of variation in the data.
The first PC (PC1) captures the most variation, the second PC (PC2) captures the second most variation, and so on.
Each gene is assigned a weight or "loading" for each PC, which represents the gene's influence on that PC. Genes with high loadings on a particular PC are the most important in driving the variation captured by that PC.
To plot cells in the reduced-dimensional space, a score is calculated for each cell along each PC. This is done by multiplying the cell's expression level for each gene by the gene's loading for that PC and summing these products across all genes.
The scores for the first two (or three) PCs are used as the coordinates for plotting each cell in a 2D (or 3D) scatter plot. Cells with similar expression profiles will cluster together in this reduced-dimensional space.
The PCs are orthogonal (perpendicular) to each other, meaning they capture independent sources of variation in the data.
The number of PCs is equal to the number of original variables (genes), but in practice, most of the variation in the data is often captured by the first few PCs.
PCA is an unsupervised method, meaning it does not use any information about cell types or other labels in finding the principal components.
The loadings can be used to identify genes that are important in distinguishing different cell types or states along each PC.
To assess the quality of a PCA, one can use diagnostic plots such as a scree plot, which shows the amount of variation captured by each PC. A good PCA will have most of the variation captured by the first few PCs, with subsequent PCs capturing diminishing amounts of variation.
A scree plot is a diagnostic tool used in Principal Component Analysis (PCA) to visualise the importance of each principal component.
It plots the amount of variation each PC captures from the data.
Typically, the PCs are ordered on the x-axis in descending order of their variance contribution, with the corresponding amount of variation explained on the y-axis.
This plot is crucial for determining the number of PCs that should be retained for further analysis. A common approach is to look for the "elbow" in the scree plot, a point after which the marginal gain in explained variance significantly drops, indicating that subsequent PCs contribute less to the explanation of data variation.
By effectively using a scree plot, researchers can ensure they are capturing the most meaningful components of their data, while also simplifying its complexity.
Principal Component Analysis (PCA) is a method for combining variables in a way that simplifies the analysis and maximises the amount of information retained in the combined variables.
The main goal of PCA is to reduce the dimensionality of a dataset while preserving as much of the original variation as possible.
The basic idea behind PCA is to find the optimal linear combination of variables that accounts for the maximum amount of variability in the data. This linear combination is represented by the following equation:
Combined Variable = α₁ × Variable₁ + α₂ × Variable₂ + ... + αₙ × Variableₙ
The coefficients α₁, α₂, ..., αₙ are called weights or loadings, and they determine the importance of each variable in the combined variable. PCA aims to find the optimal values for these weights that maximise the variance of the combined variable.
PCA finds the optimal weights by computing the covariance matrix of the data and then finding its eigenvectors.
The eigenvector with the largest eigenvalue contains the optimal weights for the linear combination that maximises the variance of the combined variable.
In the context of Principal Component Analysis (PCA), an eigenvector points in the direction of maximum variance in the data, and its corresponding eigenvalue quantifies the magnitude of this variance.
Each principal component is an eigenvector of the covariance matrix of the standardised data.
The corresponding eigenvalue represents the amount of variance in the data explained by that principal component.
The eigenvectors are orthogonal to each other, meaning they are uncorrelated and capture different directions of variability in the data.
When performing PCA, the eigenvectors are sorted in descending order based on their eigenvalues.
The eigenvector with the largest eigenvalue becomes the first principal component (PC1), the eigenvector with the second largest eigenvalue becomes the second principal component (PC2), and so on.
By selecting the top k eigenvectors, we can reduce the dimensionality of the data while retaining the most important information.
Several applications of PCA in biology
Reducing the number of variables: PCA can be used to combine correlated variables into fewer, more informative variables. For example, combining diastolic and systolic blood pressure into a single "blood pressure" variable and combining weight and height into a "body size" variable.
Identifying individuals with similar health profiles: By reducing many clinical variables to just two principal components (PC1 and PC2), it becomes easier to identify individuals with similar health profiles based on their proximity in a 2D plot of the principal component scores.
Analysing gene expression data: In datasets with thousands of genes measured across multiple individuals or single cells, PCA can be used to identify individuals or cells with similar gene expression profiles. By plotting the principal component scores in a 2D plot, distinct clusters can be identified, representing different cell types or cells undergoing specific biological processes.
In summary, PCA is a powerful tool for dimensionality reduction and data visualisation that helps in identifying patterns and relationships in high-dimensional datasets. By finding the optimal linear combination of variables that maximizes the variance of the combined variable, PCA allows for a more intuitive understanding of complex biological data.
Principal Component Analysis (PCA) is an unsupervised learning method, which means it does not require any prior knowledge of class labels or groupings in the data.
PCA aims to find the underlying structure and patterns in the data without being guided by a specific target variable or outcome.
This distinguishes PCA from supervised learning techniques, such as classification and regression, which rely on labeled data to learn a mapping between input features and output variables.
The unsupervised nature of PCA makes it particularly useful for exploratory data analysis, data visualisation, and feature extraction.
By identifying the principal components that capture the most variability in the data, PCA can help uncover hidden structures and relationships between variables, even when no explicit labels or categories are available.
Let's consider a simplified example using just three attributes from the heart disease dataset:
age
trestbps (resting blood pressure)
chol (serum cholesterol)
Suppose we have the following data for five patients:
Patient | Age | Trestbps | Chol |
---|---|---|---|
1 | 50 | 120 | 200 |
2 | 60 | 140 | 220 |
3 | 55 | 130 | 180 |
4 | 65 | 150 | 240 |
5 | 45 | 110 | 190 |
Step 1: Standardise the data by subtracting the mean and dividing by the standard deviation for each attribute.
Step 2: Compute the covariance matrix of the standardised data.
Step 3: Find the eigenvectors and eigenvalues of the covariance matrix.
Suppose the eigenvectors and eigenvalues are:
Step 4: Sort the eigenvectors by their eigenvalues in descending order. In this case, the order is already correct.
Step 5: Project the standardised data onto the top k eigenvectors (principal components) to reduce dimensionality. Let's choose k=2.
The projected data points in the reduced 2D space (PC1, PC2) could be:
Patient | PC1 | PC2 |
---|---|---|
1 | -1.2 | -0.5 |
2 | 1.1 | 0.8 |
3 | -0.3 | -0.2 |
4 | 1.5 | 1.1 |
5 | -1.1 | -1.2 |
In this reduced space, patients with similar risk profiles will cluster together, making it easier to identify groups of patients with high or low risk of heart disease.
This simplified example demonstrates how PCA can be used to reduce the dimensionality of the heart disease dataset while preserving the most important information for analysis and visualization.