The post Disclosing transcriptional profiles of a conserved GI stromal stem cell niche with Nadia appeared first on Dolomite Bio Blog.

]]>In the human body, the development of distinct organs is regulated by specific stromal niche (mesenchymal stem cells) signals. For the stomach and intestines, epithelial developments are controlled by the activities of stem cells from discrete niches: isthmus and crypt, respectively. To both organs, Wnt signaling is the signal transduction pathway that promotes tumorigenesis, regulates stem cell homeostasis and differentiation. Extensive studies have already digested the source of Wnt ligands from intestinal stromal cells (Paneth cells). Also, its transcriptional control through β-catenin was analyzed recently by our Nadia instrument. In contrast, little is known about the source of stomach Wnt ligands as well as their production.

A pre- print published by Kim et al., uses single cell transcriptomes analysis to identify a stem cell niche that is conserved between the stomach and intestines, and the role of stomach stromal cells, in particular. From the single cells pattern, they characterized these conserved stromal cells with telocyte, pericyte markers and a high expression of Wnt ligands. Furthermore, the Wnt signaling pathway directly mediated by Hedgehog (Hh-) and GLI2- transcriptional activation was also revealed. Lastly, through genetic analysis on mouse models, these conserved stromal Wnt signals show crucial roles in gut regeneration and development rather than in maintaining the stem cell niche’s homeostasis.

1. Single cell analysis identified conserved gastrointestinal stromal populations.

2. Wnt secretion by pericyte-like cells during regeneration.

3. Hh-GLI2 activation of Wnt signaling in conserved stromal cells.

4. Redundant role of stromal Wnt secretion in gut development.

To study heterogeneity, gastrointestinal stromal cells from mice were isolated and the transcriptomic profiles of 4946 stomach and 3459 intestinal single cells were analyzed by the Drop-Seq protocol with our Nadia instrument. t-SNE (T-distributed stochastic neighbor embedding) and unsupervised hierarchical data analysis distinguished 17 and 12 stromal cell clusters with either similar or distinct transcriptomic profiles between the stomach and intestines, respectively.

Then, the Markov affinity-based graph (MAGIC) was used to assess the Wnt ligand expression in the stem cell niche signaling. While the distinct stromal clusters also showed the Wnt ligand expression, a notable high enrichment of this signal was marked in the conserved niche of stomach and intestines. These conserved niches expressed pericyte markers (Ng2- Cspg4; Pdgfr-β) and telocyte markers (FoxL1) through fluorescence staining and hybridization. Therefore, they termed this conserved stromal niche as pericyte-like cells.

Accordingly, the function of these pericyte-like cells as a stem cell niche was assessed by their Wnt signal expression. Firstly, RNA fluorescence hybridization confirmed the expression of Wnt ligands (Wnt2b and Wnt4) in these cell populations. Then, to test its essence as a stem cell niche, they deleted the Wnt trans-membrane secretion protein Wntless (Wls). This resulted in a reduction of both stomach and intestinal stem cells, however, no proliferation changes in gastrointestinal was observed, suggesting alternative sources of Wnt signals. Therefore, these pericyte-like stromal cells were hypothesized to have a critical role in gastrointestinal regeneration. To evaluate this, they irradiated the gene-edited mice versus control mice and witnessed the malfunction in the regenerative progenitors and stem cells of both stomach and intestines at 10 days post-irradiation. Thus, we can clearly conclude that the pericyte-like stromal cells are critical for Wnt stromal niche regeneration.

The single cell RNA-seq data also demonstrated a high enrichment of Hh pathway in the conserved stromal cells. Only activated in the gut, Hh is an essential signal for the epithelial and stem cell proliferation which also regulates the Wnt ligands. To study its Wnt – regulating role, they activated the Hh signaling in the conserved pericyte-like stromal cells. As a result, the proliferation rate increased in both the gastrointestinal epithelial cells and their progenitors or stem cells. Meanwhile, fluorescence hybridization shows a higher level of Wnt ligands in these conserved stromal cells, strongly suggesting Hh activation of Wnt signaling. Interestingly, Hh downstream transcription factor – GLI2 was abundant in these conserved cells and highlighted for the enrichment of Hh and Wnt pathways. Thus, transcriptional reporter assay in cultured stomach and intestinal mesenchymal cells was performed to confirm the direct activation of GLI2 on stromal Wnt ligands (Wnt2b, Wnt9a).

Due to the mild effect of Wnt inhibition on the stem cell and the ubiquity of Wnt ligands in other stromal cell populations, they propose a non – essential role of gastrointestinal stromal cells as a Wnt stem cell niche. To further clarify the functions of these gut stromal Wnt signals in development, they investigated the gastrointestinal phenotype of Wnt-deleted (Wlsfl/fl) mice. Significant reduction of the forestomach and intestinal lumen size coupled with epithelial depletion was shown through hematoxylin-eosin staining. Further study also demonstrated defects in stem cells development and epithelial proliferation as well as Wnt target gene expression. To conclude, they indicated a critical role for the gastrointestinal stromal Wnt niche during development but not for the stem cell niche homeostasis.

In conclusion, this study disclosed the transcriptional profile and signaling pathway of a conserved gastrointestinal stromal stem cell niche as well as highlighted their redundancy among the stromal cell population. These pericyte-like cells regulate the regeneration and development of stomach and intestinal epithelial cells by expressing the Wnt signals activated by the Hh-GLI2 transcriptional factors. Single cell analysis by Drop-seq using the Nadia instrument have proven their efficacy in gastrointestinal niche cells clustering and enrichment with powerful data analysis such as the Seurat package.

If you are interested in studying stem cells and human health in general by single cell research, have a look at our solutions:

Single cell RNA-Seq with the Nadia platform.

RNAdia kit for Single Cell RNA-Seq on Nadia

Dolomite Bio also offers single-cell analysis service using in-house bioinformatics pipeline

Breaking down cost barriers for single cell research: Introducing RNAdia

The post Disclosing transcriptional profiles of a conserved GI stromal stem cell niche with Nadia appeared first on Dolomite Bio Blog.

]]>The post Cell annotation in single cell data analysis appeared first on Dolomite Bio Blog.

]]>Cells are the basic building blocks of tissues and organisms, carrying out highly specific and specialized functions. Cells specialize as a result of various intrinsic (molecular profiles) and extrinsic (spatial location) cues. Humans naturally want to classify everything into little boxes; thus we should ask ourselves “What exactly is cell type?”. Despite research efforts over the past 3-4 centuries, there is still a lot more to learn about cells.

The annotation question is largely about specifying the levels of analysis. What level are we classifying at? Functional? Regulatory? Morphological? How much variation are we expecting? Gene expression levels vary gradually over a spectrum, and without having a level of analysis in mind, we cannot draw boundaries between different expression “types”. More importantly, transcriptional differences that we use as separation boundaries may not have any biological functional relevance at all.

Moreover, if we take into consideration cell development and fate decisions, the question is even more challenging. The changes of cells throughout their life cycles are intricate and rapid. Assume that we could rigorously classify cell types and states in our desired level of analysis, we still need to make sense of such dynamic biological regulation.

Over the past decade or so, single-cell sequencing technologies have allowed us to zoom into the molecular mechanisms that regulate cell behavior, including fate decisions, developmental transitions, and responses to injuries and diseases. Thanks to the technology, we can conduct mixed cell experiments without experimentally purifying or separating the cells by type. We then computationally utilize unbiased algorithms to label cell types in multiple species, tissues, and contexts, and also construct systems biology models to predict cell behaviors during development.

We can broadly classify single-cell cell type annotation into two classes: manual and automated. In manual methods, we use marker features to define cell identities, one by one. In automated methods, we use computational tools to analyze the transcriptomic profiles and assign identities. Each option has its pros and cons.

Manual annotation is straightforward – through marker genes. You can do so in a “forward” fashion, by using a list of marker genes as reference, then query and match reference with selected cells.

We can do this within Seurat with Marker Features function, or by Differential Expression Analysis (DEA) of a cell population against others.

We can often find marker genes in the single-cell atlas, especially if our sample comes from a similar biological context as the ones used in the atlas. Researchers are also recommended to mine the literature for further evidence. In the best-case scenario, if most cells in our cell population show high expression levels of canonical markers, then the annotation is almost complete.

We know too well that such a happy scenario is hardly a reality. There are times when the classic marker genes are insufficient for classification, where we meet a novel or rare cell type or, the cells were poorly handled during the experiment. If we are fortunate to have avoided these issues, we may proceed with **enrichment analysis** to find out the cell functions.

Alternatively, you can use machine learning to crawl for similar cell populations within databases for single-cell studies. This method, though more advanced, will return not only the suggested cell type but also suggest similar signature genes and processes. These suggestions, akin to Google search suggestions, may help you dive deeper into your studies. And with that, let’s explore some automated tools.

Automated methods are more diverse than manual methods, and are based on one of three core principles:

- Marker-based: algorithms score, and rank clusters based on the level of expression of marker gene, then assign clusters to their respective cell type. Some open-source tools include SCINA , CellAssign.
- Supervised classification: we use machine learning to train the tools with reference single-cell studies. For example, we have: scClassify and scPred. More training datasets and better quality training datasets will result in higher-quality tools.
- Correlation-based: As the name suggests, we calculate how much each cluster correlates to the reference cell type. We will also rank all the correlation scores and assign clusters to whichever cell type with the best score. The most familiar example is: Seurat.

Tools are created to answer the problem that their creators have been facing, and none of them are perfect. Depending on the scale and the type of your dataset, some methods may be better than others. You must always follow up and validate your cell annotation. You can test the statistical significance of cluster membership, and also integrate cell annotation with other experimental and computational methods. The best practice is to use one or two automated tool(s) to quickly give your cells some identities, and then cross-check with a manual method and other validation methods.

If you need scRNA-seq-related help, Dolomite Bio offers end-to-end Single-Cell Consultancy Service (SCC) and Bioinformatics Service (BIS) that helps you through one or more steps of the workflow:

Sample preparation (SCC)

Library preparation (SCC)

Sequencing (SCC)

Computational Analysis (SCC + BIS)

Interested queries and/or suggestions for what we should write next in our blog series should be directed to: bioinformatics@dolomite-bio.com

- Trapnell, C. (2015) Defining cell types and states with single-cell genomics– Genome Research.
- Pasquini, G., Arias, J., Schafer, P., Busskamp, V. (2021) Automated methods for cell type annotation on scRNA-seq data – Computational and Structural Biotechnology Journal 19.
- Clarke, Z., Andrews, T., Atif, J., Pouyabahar, D., Innes, B., MacParland, S., Bader, G. (2021) Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods . Nature Protocols 16, 2749-2764.
- Amezquita, R., Lun, A., Hicks, S., Gottardo, R. (2021) Basics of Single-Cell Analysis with Bioconductor . Bioconductor, 2021.
- Nguyen, H. (2022) scRNA-Seq Cell Type Annotation: Common Approaches and Tools. BioTuring.
- Ianevski, A., Giri, A., Aittokallio, T. (2022) Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data]. Nature Communications 13.

Pre-processing of single cell data – A comparison of dropSeqPipe and Kallisto-bustools

The post Cell annotation in single cell data analysis appeared first on Dolomite Bio Blog.

]]>The post UMAP method- Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>In the previous blog, we introduced dimension reduction in single cell RNA-Sequencing (scRNA-Seq) as well as PCA and t-SNE dimension reduction methods. We have learned that single cell data are sparse, noisy, and high-dimensional and that dimension reduction is needed to turn the data into something more manageable. In this last blog of the series, we will discuss UMAP.

Uniform Manifold Approximation and Projection (UMAP) was first described by McInnes et al. in 2018. At first glance, UMAP and t-SNE are highly similar to each other. UMAP can also be expressed in 4 equations, pretty much like those used in t-SNE to learn more read article 3 in this series. Both start with the construction of a high-dimensional representation of the data, then try to reconstruct a low-dimensional graphic that is as close to the first one as possible.

However, there are a few key differences:

- If t-SNE normalized data on both high- and low-dimensions, UMAP skips through these steps.
- Other mathematical changes (such as using k-nearest neighbor in lieu of perplexity equation, or Stochastic Gradient Descent in place of Gradient Descent) help UMAP reduce memory usage and shorten running time. The mathematical underpinning is interesting but is out of scope for this blog.
- The 4 main parameters that you should know about are: n_neighbors, min_dist, n_components, and metric. I will discuss the usage of each parameter in the next section.

UMAP is a great nonlinear technique that tends to keep a more global structure of the dataset than t-SNE (but this is not without cons, see the following section). Furthermore, plots generated by UMAP are more continuous in nature compared to t-SNE helping it to display cell biological lineages better. Overall, data can be categorized as binary, categorical, and continuous whereby scRNA-Seq data tend to be continuous.

In t-SNE, one often tunes “perplexity”, a parameter that guesses the number of close neighbors each point has. In comparison, one tunes n_neighbours and min_dist in UMAP to balance local and global structures.

Unlike t-SNE which initializes randomly, UMAP does not, and thus running UMAP multiple times would generate the same results.

UMAP is rather light computationally. You can run UMAP on a strong laptop. For t-SNE, you likely need cluster computers.

Interpretability of UMAP is lacking. The best method for interpretability is PCA. As its name alludes to, UMAP assumes a manifold data structure. To learn what a manifold is take a look at this article. UMAP then tends to find manifold structure within data noise. The larger the dataset, the less noise, hence UMAP is recommended for a big dataset but not small once.

When using UMAP, you should tune n_neighbors and min_dist to be suitable to your research question. I cannot comment precisely on how you should tune them, as this is a trial-and-error process that gets better with experience. The default values are 15 and 0.1 respectively.

In general, the rules are:

- n_neighbors values ranging from 2 (a very local view of the manifold) up to 200 (a quarter of the data). Tuning this parameter is a tradeoff between local versus global structure preservation.
- As min_dist is increased, the points are pushed apart into softer more general features, providing a better overarching view of the data at the loss of the more detailed topological structure. This also shows a tradeoff between local versus global structures.

You can also tune n_components in UMAP. It determines the number of dimensions in the lower-dimensional space. UMAP scales well in embedding dimension so n_components can be higher than 2 or 3 dimensions. This is an advantage of UMAP over t-SNE.

A detailed technical tutorial can be found on this website.

The three methods PCA, t-SNE, and UMAP all have their pros and cons. In general, for scRNA-Seq analysis, we would recommend the following:

- Perform quality control, feature selection and normalization on the count matrix. You can refer to this Seurat/R tutorial from Harvard University, here and here.
- Start your dimension reduction analysis with PCA. The default number of PCs is often between 30 and 50 but it’s best if you referred to the Scree plot to determine the exact plateau.
- If global structure preservation is your goal, use PCA only. It is excellent at reducing the dimensionality of your dataset.
- However, if interpretation and local structure are important, PCA will likely be problematic. You will then need to look at t-SNE or UMAP.
- Use PCA + t-SNE on a smaller dataset.
- Use PCA + UMAP on a bigger dataset.

The content of these blogs is meant to be introductory. If you need more resources, I suggest the following:

If you need scRNA-seq-related help, Dolomite Bio offers end-to-end Single-Cell Consultancy Service that helps you through one or more steps of the workflow:

- Sample preparation
- Library preparation
- Sequencing
- Computational Analysis

Interested queries and/or suggestions for what we should write next in our blog series should be directed to: bioinformatics@dolomite-bio.com

Need help with your single cell data analysis? Check out Dolomite Bio’s new Bioinformatics Service

The post UMAP method- Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>The post Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>Introduced in 2009, single-cell RNA sequencing (scRNA-Seq) allows researchers to observe cellular heterogeneity of the transcriptome at the resolution of individual cells. This method produces vast amounts of data with expression levels of an individual gene for thousands of cells within a single sample.

ScRNA-Seq data are often compiled in a count matrix, each value in the matrix represents the number of reads in a cell originating from the corresponding gene. The data is high-dimensional, where each individual gene represents a **dimension**. And as there are so many genes and related information, we cannot (and should not) include all the details. We would then select **features** (genes and their useful information) to develop **models** (meaningful subsets of data). What genes are chosen will have a major impact on the performance of downstream analyses.

“The more the merrier” is not the case for scRNA-Seq. With more data than necessary, we need to train the models for longer, use up more computational resources. So instead of that, we use **dimension reduction**, a computation method that simplifies the data, reducing computational work, eliminating noise and helpings to visualize the highly complex data.

This blog series will give you an introduction to dimension reduction and the three most common methods, PCA, t-SNE, and UMAP. We will describe each method’s strengths and weaknesses, as well as the best practice in handling scRNA-Seq data.

Single cell workflows typically produce large amounts of data, as they measure whole-genome expression of thousands of single cells within a sample. The computational analysis of such data is difficult due to various reasons:

- (1) Data is high dimensional.
- (2) Data is sparse and noisy – in Figure 1, you can see many zeros and only a few non-zeroes.
- (3) Different cell populations have unequal sizes.

To address some of these characteristics of scRNA-Seq data, we deploy normalization and dimension reduction to transform the original high-dimensional matrix into a lower-dimensional subspace enriched with useful signals. Issue 1 is solved with dimension reduction, whereas issues 2 and 3 are solved with normalization. In this series of blogs, we will first focus solely on dimension reduction. If conducted properly, dimension reduction can reduce levels of noise and complexity, and aid downstream analyses such as clustering and visualization.

In Figure 2 above, we have a recommended scRNA-Seq workflow, according to the consensus published by the researchers at ELIXIR-EXCELERATE. First, the raw reads (compiled in FASTQ files) must go through multiple rounds of quality control (QC). Common QC metrics include a number of unique molecular identifiers (UMIs), UMI count/cell, genes detected/cell, UMIs vs genes detected, mitochondrial count ratio, novelty score, reads/cell.

After the QC has been completed, we need to normalize the data. Normalization helps remove cell-specific bias by addressing the zeros due to transient expression. After normalization, dimension reduction should be the next logical step. The purpose is to reduce the number of dimensions of the data and optimize the pipeline memory footprint. The three main techniques are PCA, t-SNE, and UMAP. PCA is a full-fledged dimension reduction method that can be used on its own, but recent datasets typically require PCA as the first starting point before using t-SNE or UMAP. After dimensionality reduction, you can conduct downstream analyses, clustering as well as differential expression analyses.

The workflow of choice depends on your research questions, sequencing batches, experimental conditions, sequencing methods, and many other factors. But in general, the above workflow is one of the most commonly used.

In this first blog in the series on dimensionality reduction, we have introduced you to the most basic concepts of dimensionality reduction in single cell sequencing. You will have learned what a typical scRNA-Seq dataset and a respective post-processing workflow look like. In the next blogs, we will talk in detail about the different popular dimension reduction methods: PCA, t-SNE, and UMAP. Stay tuned!

If you need scRNA-seq-related help, Dolomite Bio offers end-to-end Single-Cell Consultancy Service that helps you through one or more steps of the workflow:

- Sample preparation
- Library preparation
- Sequencing
- Computational Analysis

Interested queries and/or suggestions for what we should write next in our blog series should be directed to: bioinformatics@dolomite-bio.com

Need help with your single cell data analysis? Check out Dolomite Bio’s new Bioinformatics Service

The post Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>The post t-SNE method- Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>In the previous blog, we have introduced dimension reduction in single-cell RNA-Sequencing (scRNA-Seq) and PCA as one dimension reduction method. We have learned that single cell data are sparse, noisy, and high-dimensional and that dimension reduction is needed to turn the data into something more manageable. In this blog, we will discuss t-SNE.

T-distributed stochastic neighbor embedding (T-SNE) is a method that gives us expression values on a cell-wise basis. First introduced by van der Maaten and Hinton in 2008, t-SNE is a probabilistic dimensionality reduction technique.

Some sources said the motivation behind t-SNE stems from the limitations of PCA.

If PCA aimed to maximize global structure and produced some local inconsistencies along the way (far away points end up being neighbors), t-SNE focuses on preserving the local relationships among data points (Fig 1).

PCA is a linear technique while t-SNE is nonlinear and can “unroll” structures more correctly. The algorithm is adaptive to the underlying data, so it carries out different transformations to different regions of data.

The math behind t-SNE has 4 equations. The first one sets the symmetry rule, t-SNE calculates the probability of similarity of points in high-dimensional space and then similarly in the corresponding low-dimensional space.

The second equation sets perplexity, which is a parameter that guesses the number of close neighbors each point has. In general, larger datasets mean larger perplexity. By setting perplexity, t-SNE balances the global and local structures of the data. Wattenberg et al. (2016) did an amazing job illustrating this point with multiple simple data sets.

The third one calculates the student’s t-test and the fourth one calculates Kullback-Leibler divergence. The detailed math is explained in the original Maaten paper, but it’s out of scope for our blog today.

Essentially, what these 4 equations do is take a set of points in high-dimensional space and reorganize them as accurately as possible in the lower (typically 2D) dimensions.

As a non-linear technique, t-SNE has the amazing ability to work with high-dimensional data. It is user-friendly and often produces more meaningful outputs than many other alternatives.

T-SNE preserves local structure. What this means is that points that are close to one another in the high-dimensional data set will tend to be close to one another in the lower-dimensional chart.

T-SNE is incredibly flexible to different types of input data. It is one recommended choice for scRNA-seq data visualization, for example here or here.

Although impressive, the t-SNE outputs are prone to misreading. T-SNE is a dimensionality reduction technique and NOT a clustering technique. T-SNE reduces the number of dimensions and then attempts to find patterns in the data by identifying observed clusters. However, after this process, the input features are no longer the same as they were, and you cannot make any inference based only on the t-SNE output. **Cluster size and cluster distance in t-SNE do not have any intrinsic meaning**, as illustrated brilliantly by Wattenberg et al. (2016) in this tutorial.

The algorithm of t-SNE means it is good at preserving the local relationships between points but not much of the global data structure.

Running t-SNE multiple times on the same dataset will likely result in different outputs. It’s best to run t-SNE repeatedly to choose the best perplexity parameter, and to average the output. You may also stabilize the overall output through setting a seed to override the random initialization.

Reading t-SNE may require you to pick up on the random noises (“odd” results). This is possible through training and experience. Some of them are addressed and explained clearly by Maaten himself on his website.

T-SNE is computationally heavy. In very large datasets, we need to couple t-SNE with another technique to both (1) increase accuracy and (2) reduce the computational power required. One recommendation suggested coupling of t-SNE with PCA for dense data and TruncatedSVD for sparse data.

In scRNA-seq, we often couple t-SNE with PCA. The reason for this is that the mathematical goal of t-SNE is to capture the local relationships between cells (points of a network), whereas PCA will calculate the “true” distance between points in a high dimensional space (we call this the global structure). Using PCA as the first dimensionality reduction technique helps us project our scRNA-seq data into a lower-dimensional subspace. The distances become more real, and t-SNE can display more real relationships between points. Doing so will also reduce computational resources and time needed.

Other best practices that you should also follow include:

(1) Setting the perplexity limit between 5 and 50 (default is often 30)

(2) Run t-SNE repeatedly and average the output

After the third blog in the series on dimensionality reduction, you now know what t-Stochastic Neighbor Embedding is and how it is used in scRNA-Seq data analysis. In the next blogs, we will continue to discuss the final most popular method of dimensionality reduction: UMAP.

The content of these blogs is only meant to be introductory. If you would like to know more about the mathematical basis or the algorithms of t-SNE, I suggest the following resources:

– The original van der Maaten paper

– T-SNE Python tutorial

If you need scRNA-seq-related help, Dolomite Bio offers end-to-end Single-Cell Consultancy Service that helps you through one or more steps of the workflow:

- Sample preparation
- Library preparation
- Sequencing
- Computational Analysis

Need help with your single cell data analysis? Check out Dolomite Bio’s new Bioinformatics Service

The post t-SNE method- Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>The post PCA method – Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>In the previous blog, we introduced dimension reduction in single cell RNA-Sequencing (scRNA-Seq). We have learned that single cell data are sparse, noisy, and high-dimensional and that dimension reduction is needed to turn the data into something more manageable. In this blog, we will discuss the dimension reduction method PCA.

Principal Component Analysis (PCA) is a method that helps you focus on key variables while ignoring noises and distractions. It compresses the original data and only captures the essence.

Through correlation of dimensions, PCA finds the minimum number of variables (principal components) that keeps the most amount of information. There is a principal component (PC) for each gene. So, if you have 300 genes, you have 300 dimensions and 300 components. To learn how PC specifically apply to single cell research take a look at this article.

How PCA captures variation between datapoints is nicely visualized by Figure 1 which labels the first 2 PCs with arrows, showing that those 2 PCs display the largest amount of variation. Other PCs of the data would have some component of PC1 or PC2.

It then orders the PCs by their degree of variability: PC1 spans the most variable dimension, PC2 2nd most, PC3 3rd most, and so on. A Scree plot can show us how well the PCs explain the variation. And we can see that the variation plateaus off after a while. This plateau is where we have our cut-off point.

For a scRNA-Seq dataset, the number of PCs to be kept is typically between 30 and 50, as they usually explain almost all variation. Depending on the size of the datasets you will be required to keep more or fewer PCs. 30 is a good starting point for the analysis and can be adjusted according to the cut-off point in the Scree plot.

The use of PCA is two-fold:

- PCA helps to filter out noise as a basis for downstream analysis and the number of PCs used can be determined through a Scree plot.
- PCA can also be used for visualization. Typically, the first 2 PCs are displayed, which explain the majority of the variance.

PCA is highly computationally efficient quick and easy.

PCA preserves both, long-range (global) and short-range (local) structures of data.

PCA can be computed iteratively and each of the components is independent from each other. To change the calculations from k dimensions to (k+1) dimensions, you only need to add a few more lines of calculations. This is also useful if you wanted to drop your least useful PC while still retaining most of your variance.

In analyzing actual scRNA-seq data, PCA can give you a quick “sanity check”. For example, to check if replicates are clustering together, or if different conditions produce unexpected effects.

PCA is good as the first method of dimensional reduction. Before using other reduction and clustering techniques, you can use PCA to select the top 10-50 principal components.

ScRNA-seq data is high-dimensional and highly nonlinear (lots of dropout 0s), while PCA is a linear technique. PCA assumes the original data is linear and normally distributed. These two assumptions are NOT applicable to scRNA-seq data. Under no circumstance should PCA be used as the only visualization technique. It is best used as the first dimensionality reduction method before t-SNE or UMAP is deployed. Before using PCA, data should also be scaled (for instance, with the ScaleData command in Seurat).

The new features (components) created by PCA have no intrinsic meaning. Researchers who do not correctly understand PCA will try to assign real-world implications to the components, leading to incorrect interpretations.

Looking at the pros and cons of PCA, PCA is best used only as the first dimensionality reduction technique. It is highly computationally efficient, thus will give us some quick sanity check, to assess the next best course of action given a set of scRNA-Seq data.

PCA will also make our work faster and easier if we have further need for t-SNE or UMAP, as it has already made the data more compact and useful.

However, you need to be mindful not to assign any real-world interpretation to the PCs. PCA obscures your features, and wrong interpretation will introduce further problems downstream.

After the second blog in the series on dimensionality reduction, you will have now learned what Principal Component Analysis is and how it is used for the analysis of scRNA-Seq data. In the next set of blogs, we will continue discussing the other popular methods of dimensionality reduction: t-SNE and UMAP.

The content of these blogs is only meant to be an introduction to the topic of dimension reduction. If you would like to know more about the mathematical basis or the algorithms of PCA, I suggest the following resources:

– Using PCA on Python/Scikit-learn

– Dimension reduction for scRNA-Seq

If you need scRNA-Seq related help, Dolomite Bio offers end-to-end Single-Cell Consultancy Service that helps you through one or more steps of the workflow:

- Sample preparation
- Library preparation
- Sequencing
- Computational Analysis

The post PCA method – Dimensionality Reduction in Single Cell Genomics appeared first on Dolomite Bio Blog.

]]>