Skip to content

kmvanden/microbiome_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Microbiome Analysis

Why Microbiome Data Are Statistically Challenging

Compositionality: microbiome data represent relative abundances, not absolute counts, which means that the data are constrained. Since all values are non-negative and sum to a constraint (i.e., sum to 1), the data exist in a simplex, not in standard Euclidean space. Furthermore, since the total abundance is fixed, spurious correlations and dependencies are created between taxa (i.e., an increase in the proportion of one taxa must be accompanied by decreases in the others, even if their absolute abundances did not change). Many statistical models assume that variables are independent and unconstrained in Euclidean space, and microbiome data violate these assumptions.

  • Solution: Log-ratio transformations move compositional data from the simplex into a standard Euclidean (unconstrained) space suitable for standard statistics. Alternatively, some methods model compositionality explicitly, like ALDEx2, which uses CLR transformation (a log-ratio transformation based on the geometric mean of all taxa in a given sample) and Monte Carlo sampling from a Dirichlet-multinomial distribution.

High dimensionality: microbiome datasets typically contain many more features (taxa) than samples (p >> n), whereas most standard statistical tests assume that the number of samples exceeds the number of features. In small datasets, chance fluctuations in relative abundances can appear as significant differences and high numbers of features relative to the number of samples increases the risk of this happening. Additionally, with high dimensional data, the regression problem becomes undetermined: you have more unknowns (coefficients) than equations (observations), so there are an infinite number of solutions for the coefficients. In logistic regression, this creates a flat optimization surface (ill-conditioned), which can result in non-convergence, numerical instability (very large coefficients) and overfitting.

  • Solution: Dimensionality reduction techniques (e.g., PCA or PCoA) reduce the number of features to a smaller set of composite (latent) features. Feature selection by filtering low-abundance or low-prevalence taxa, regularization techniques (e.g., Lasso) or recursive feature elimination can help to identify a smaller subset of features to be used in subsequent analyses.

Overdispersion and zero-inflation: in microbiome data, most taxa are absent (zero counts) in most samples, due to both rare taxa being below detection thresholds (rounded or sampling zeros) and due to true biological absence of certain taxa (true or structural zeros). Sparsity is a major contributor to the overdispersion observed in microbiome count data: a species may be absent in most samples, but have very large values in other samples when it is present. Overdispersion violates the assumptions made by models used with count data, like the Poisson distribution, which assume that the variance does not exceed the mean. Additionally, the sparsity of microbiome data also results in zero-inflation (i.e., more zeros than expected under typical count distributions), due to the presence of both sampling and structural zeros.

  • Solution: negative binomial models (e.g., in DESeq2) have a dispersion parameter that allows them to model variance independently from the mean and to therefore capture the overdispersion present in microbiome data. Zero-inflation can also be controlled for by using zero-inflated negative binomial (ZINB) models, which explicitly model the zero counts as separate processes (a structural zero process and a negative binomial count process). Additionally, feature filtering before analysis (e.g., removing features found in fewer than 10% of the samples) can help to reduce the impact of zero-inflation.

Alpha Diversity

Alpha diversity measures the diversity within a single sample. Traditional alpha diversity metrics come from classical ecology and include:

  • Richness: measures the total number of unique taxa.
  • Shannon index: measures how hard it is to predict the identity of an individual randomly drawn from the community. Increasing richness and/or increasing evenness increases the uncertainty and thus the Shannon index. Sensitive to rare taxa sicne they increase diversity.
  • Simpson index: measures the probability that two randomly drawn individuals are from different taxa. Increasing evenness decreases the Simpson index. Abundant taxa have a large effect on the measure, whereas rare taxa contribute very little.

Traditional diversity metrics assume an equal sampling effort (i.e., equal sequencing depth), but micobiome data typically have uneven sequencing depths.

  • Rarefaction: subsampling of each sample to the same sequencing depth (typically that of the shallowest sample) removes differences in sequencing depth (i.e., deeper samples won’t appear more diverse).
  • The use of rarefaction is disputed because it discards data (statistical power is decreased by shallower samples), introduces random variation (the subsampling is stochastic), can bias diversity estimation (especially for rare taxa) and does not correct for unobserved species.
  • Observed richness is very sensitive to sequencing depth and thus requires sampling depth to be normalized. Shannon diversity is moderately sensitive to sequencing depth and Simpson diveristy is barely effected by sequencing depth.

To avoid rarefaction and to better model the underlying microbial diversity, newer methods infer latent (unseen) diversity using statistical methods.

  • Chao1: estimates total species richness. Chao1 infers the number of unseen species by using the observed counts of singletons and doubletons and the assumption that individuals are randomly and independently sampled from the community.
  • Breakaway: estimates true species richness by fitting a rational function model (a ratio of polynomials) to the frequency count data (frequency of the frequencies) using weighted nonlinear least squares to model the unobserved portion of the community (unseen taxa). Weights are assigned based on the estimated variance of each frequency count (i.e., high variance points like singletons receive less weight).
  • DivNet: estimates Shannon and Simpson diversity by modeling latent relative abundances (true but unobserved proportions) of the taxa in the community. It assumes that the additive log-ratio (alr)-transformed latent proportions follow a multivariate normal distribution and that the observed counts result from a multinomial sampling process of the inverse alr-transformed latent proportions. DivNet uses maximum likelihood estimation to jointly estimate the mean vector (μ) and the covariance matrix (Σ) that define the latent distribution, and from that, derives the diversity estimates (the latent proportions most likely to have produced the observed counts) and their associated uncertainty.

Alpha diversity values often exhibit heteroscedasticity (unequal variances across groups) and non-normal distributions. As a result, non-parametric statistical tests are typically used. These tests do not assume normality and are based on ranked data rather than raw values.

  • Wilcoxon Rank-Sum test for comparing two groups.
  • Kruskal-Wallis test for comparing more than two groups.

Betta: a regression framework that performs statistical testing on alpha diversity estimates produced by breakaway or DivNet, while explicitly accounting for sampling variability in those estimates. It uses the alpha diversity values and their standard errors to fit a linear model, where the data points are weighed with respect to the precision (inverse variance) of each estimate. This allows for hypothesis testing or covariate modeling that properly incorporate uncertainty due to unobserved taxa, sequencing depth, and variability in taxon detection.

Betta and standard Wilcoxon and Kruskal-Wallis tests cannot account for the correlation of repeated measures (i.e., they assume that observations are independent).

  • Linear mixed-effects models (LMM): model both fixed effects (e.g., time and treatment) and random effects (e.g., subject id). Assumes that data can be transformed to be approximately normal (e.g., using a CLR transformation).
    • Estimated marginal means can be used to compute the mean response for each factor level, adjusted for other predictors in the model.
    • Breakaway and DivNet diversity estimates should be weighted by inverse variance to incorporate the computed uncertainty.
  • Generalized additive mixed models (GAMM): model both fixed effects (e.g., time and treatment) and random effects (e.g., subject id). They allow non-linear relationships between predictors and the response via smooth functions (splines). GAMMs can overfit if there are few time points per subject, so the number of basis functions/knots (k) for the spline should be chosen in accordance with the number of time points and the estimated degrees of freedom (edf) for each smooth term should be checked.
    • Breakaway and DivNet diversity estimates should be weighted by inverse variance to incorporate the computed uncertainty.

Beta-Diversity

Beta diversity measures between-sample differences in whole microbiome communities (it asks are samples from different groups more different than samples from within a given group).

Bray-Curtis distance: measures the compositional dissimilarity between two samples based on the abundances of taxa present in at least one of the samples. It is computed as the weighted sum of absolute differences, where weights are the abundances, thus the metric is dominated by abundant taxa.

  • Values range from 0 (samples are identical) to 1 (completely dissimilar/ no shared taxa)
  • Pre-processing: low prevalence taxa should be removed (sensitive to sparsity) and counts should be transformed to relative abundances to avoid bias from sequencing depth.
  • Limitations: does not explicitly account for compositionality and tends to be biased toward abundant taxa.

Canberra distance: measures compositional dissimilarity between two samples based on the abundances of taxa present in at least one of the samples. It sums the ratios of the absolute differences to the sums of the abundances for each taxon, thereby giving relatively equal weight to rare and abundant taxa (i.e., rare taxa contribute proportionally more than they would in Bray-Curtis).

  • Values range from 0 (samples are identical) to 1 (completely dissimilar/ no shared taxa)
  • Pre-processing: low prevalence taxa should be removed (sensitive to sparsity) and counts should be transformed to relative abundances to avoid bias from sequencing depth. Limitations: does not explicitly account for compositionality

Jaccard distance: measures the dissimilarity between two samples based on presence/absence of taxa, ignoring their abundances. It quantifies how many taxa are shared versus unique to each sample.

  • Values range from 0 (samples have identical sets of taxa) to 1 (no shared taxa)
  • Pre-processing: low prevalence taxa should be removed (sensitive to sparsity) and counts should be converted to presence/absence data (any taxon with a count > 0 becomes 1 (present), otherwise it becomes 0 (absent)). No normalization is needed since it does not depend on abundance.
  • Limitations: ignores abundance information and presence/absence data can be heavily influenced by sampling noise and sequencing depth.

Euclidean distance: the geometric (straight-line) distance between two samples in multidimensional space, where each dimension corresponds to a feature and and the coordinate of a sample along each axis is determined by the abundance of that feature.

  • Values range from zero (samples are identical) to larger positive values (greater overall differences in abundance)
  • Pre-processing: low prevalence taxa should be removed (sensitive to sparsity) and counts should be log transformed (reduces the influence of very abundant taxa)
  • Limitations: sensitive to magnitude differences (differences are squared) and thus influenced by highly abundant taxa and sequencing depths.

Aitchison distance: the Euclidean distance between samples after centered log-ratio (CLR) transformation (which projects the compositional data from the simplex to Euclidean space).

  • Values range from zero (log-ratios are identical) to large values (greater differences in log-ratio structure)
  • Pre-processing: counts need to be CLR-transformed (handles the compositional constraint of microbiome data).
  • Limitations: interpretation is less intuitive (represents how much log-ratios between taxa, rather than absolute or relative abundances, differ between samples).

DivNet: estimates Bray-Curtis and Euclidean distances by modeling latent relative abundances (true but unobserved proportions) of the taxa in each community. It assumes that the additive log-ratio (alr)-transformed latent proportions follow a multivariate normal distribution and that the observed counts result from a multinomial sampling process of the inverse alr-transformed latent proportions. DivNet uses maximum likelihood estimation to jointly estimate the mean vector (μ) and the covariance matrix (Σ) that define the latent distribution. Bray-Curtis and Euclidean distances are then calculated from the inferred latent relative abundances, capturing both observed data and sampling variability of each sample.

Ordination methods: reduce high dimensional data into a few dimensions (typically 2 to 3) while preserving relationships between the samples as much as possible in order to allow visualization and interpretation of patterns in beta diversity.

  • Principal Coordinate Analysis (PCoA): represents samples in a low-dimensional, Euclidean space, while preserving the pairwise distances (or dissimilarities) between them as faithfully as possible. It assumes that the input distance matrix (not the raw data) can be faithfully represented in Euclidean space. The resulting axes represent the amount of variation in the distance matrix explained by each coordinate (interpretable when using metric distances).

  • Non-metric Multidimensional Scaling (NMDS): represents samples in a low-dimensional space based on rank orders of pairwise distances (not the distances themselves). NMDS does not assume that the data can be projected into Euclidean space and calculates a stress value to quantify how well the configuration preserves the rank-order of the original dissimilarities. It works with any distance metric, but the axes have no direct meaning (no percentage of variance explained).

  • Principal Component Analysis (PCA): represents samples in a low-dimensional, Euclidean space by preserving the variance in the original feature table (e.g., taxa abundances, not pairwise distances). PCA assumes the data lie in Euclidean space and that linear combinations of features meaningfully capture variation. PCA is valid only when data have been transformed appropriately (e.g., CLR-transformed data). The resulting axes are ordered by the amount of variance they explain in the data.

  • Constrained ordination analysis: attempts to explain variation in community composition using external variables (e.g., treatment group) by combining multivariate regression with ordination to determine the portion of the community variation that is constraiend by these variables.

    • Redundancy Analysis (RDA): models variation in the feature table and assumes that the data can be represented using Euclidean distances and linear relationships.
    • Distance-based Redundancy Analysis (db-RDA): models variation in distance matrices and does not require the distance to be Euclidean.

testBetaDiversity: simulates multiple plausible versions of the latent relative compositions for each sample from the estimated multivariate normal distribution. For each simulation, it calculates Bray-Curtis and Euclidean distances between samples, reflecting both observed and unobserved diversity. To test whether within-group versus between-group distances differ, it permutes the sample labels across the simulations to construct a null distribution. This approach explicitly accounts for sampling uncertainty and compositional structure.

PERMANOVA: uses a single, fixed distance matrix computed from observed data (doesn’t account for sampling variability or latent uncertainty). In order to test whether the centroids of the groups differ in multivariate space, a null distribution is generated by permuting sample labels.

  • Does not assume normaility (i.e., non-parametric).
  • Assumes equal dispersion (variance) across groups. If this assumption is violated (e.g., one group is more dispersed), it can yield false positives even when group centroids are not meaningfully different, thus equal dispersion should be verified (betadisper).
  • Repeated measurements should be handled by restricting permutations within a given subject (strata = subject_id) to prevent false positives arising from correlation (non-independence) among repeated measurements from the same subject. strata does not model correlation between the repeated measurements.

Differential Abundance Analysis

The identification of individual microbial taxa whose relative abundances differ significantly between groups.

ALDEx2 (ANOVA-Like Differential Expression): assumes raw counts are the result of a random sampling process from an underlying set of true feature abundances that can be modeled using a multinomial distribution (i.e., the observed counts are drawn based on the true (but unknown) proportions of each feature). Since the true feature abundances are unknown, ALDEx2 uses a Dirichlet distribution to model the uncertainty about what the true proportions could be given the observed counts. The Dirichlet distribution represents a probability distribution over possible parameter values of the multinomial distributions (i.e., over plausible feature proportions).

ALDEx2 uses Monte Carlo sampling to draw 128 (by default) probability vectors from the Dirichlet posterior, each representing a plausible instance of the underlying feature composition. The 128 sampled compositions are transformed using the centered log-ratio (CLR) transformation to project the data into an unconstrained (Euclidean) space. The Wilcoxon test is then applied to each feature in each of the 128 Monte Carlo instances and the results are aggregated to produce estimates of the expected effect sizes (median per feature CLR differences between groups across all Monte Carlo instances) and p-values. Multiple testing correction is performed using the Benjamini-Hochberg FDR procedure.

MaAsLin2 (Multivariable Association with Linear Models): normalizes and transforms the count data according to user defined values. It is not inherently compositionally aware, but if normalization is set to CLR, MaAsLin2 will use total sum scaling to convert the data to relative abundances (which adjusts for differences in sequencing depth) and then will apply CLR transformation (which projects the data from the simplex into Euclidean space, allowing valid application of linear models).

MaAsLin2 fits a separate general linear model to each taxon to model how its abundance varies with respect to metadata variables (e.g., condition, sex, BMI), while adjusting for potential confounders. MaAsLin2 assumes that the relationships between covariates and abundance are linear, that samples are independent (unless modeled with random effects), that there is no multicollinearity (i.e., covariates are not highly correlated), and that residuals are normally distributed and homoscedastic (features have similar variances).

To test whether a covariate significantly affects abundance, MaAsLin2 uses t-tests with Satterthwaite-approximated degrees of freedom if random effects are included, and standard t-tests if the models do not include random effects to assess the significance of the regression coefficients. The effect size is the regression coefficient from the model fit for each feature and covariate. Multiple testing correction is performed using the Benjamini-Hochberg FDR procedure.

Corncob (COunt RegressioN for Correlated Observations with the Beta-binomial): assumes that the true underlying relative abundance of a taxon across samples follows a beta distribution (a probability distribution defined by two shape parameters, mean and dispersion) and that the observed count of a taxon in a given sample can be modeled as a binomial draw from the total number of reads in that sample, using the sample-specific true abundance as the success probability. Corncob can incorporate random effects by allowing subject-specific deviations to the mean abundance model (modeled as normally distributed offsets added to the linear predictor).

For each taxon, the mean and the dispersion parameters of the beta-distribution are modeled using logistic regression and estimated by iteratively maximizing a log-likelihood function (i.e., finding the parameter values that maximize the probability of observing the actual data), since the beta-binomial model does not have a closed-form solution. Corncob uses the BFGS optimization algorithm, which approximates the Hessian matrix, which is a matrix of second-derivatives of the log-likelihood with respect to the model parameters and representative of the curvature of the log-likelihood surface and reflective of the confidence of the model in the parameter estimates.

To test whether a covariate significantly affects the abundance and/or variability, corncob compares the log likelihoods of the full model (with covariate) to the reduced model (without covariate) using the likelihood ratio test (LRT) and evaluates the resulting test statistic using parametric bootstrapping or comparison to a chi-squared distribution. The effect size is the coefficient from the beta regression on the logit scale (i.e., the log odds change in relative abundance per unit change in the covariate). Multiple testing correction is performed using the Benjamini-Hochberg FDR procedure.

DESeq2: accounts for differences in sequencing depth by computing a size factor for each sample, based on the assumption that most taxa are not differentially abundant. For each sample, the count for each taxon is divided by the geometric mean for that taxon across all the samples, and the median of these ratios in a given sample is the size factor for that sample. The counts in each sample are then divided by the size factor to get normalized counts and these counts are assumed to follow a negative binomial distribution (a Poisson distribution with an extra parameter for dispersion).

Since dispersion and log fold change (effect size) can be noisy, especially for low abundance taxa and for small sample sizes, DESeq2 shrinks (regularizes) the dispersion estimate for each taxon toward the overall dispersion-mean trend line (for all taxa) using empirical Bayes shrinkage based on the precision of the dispersion estimate for each taxon and shrinks the log fold changes toward zero based on their uncertainty (standard error).

To test whether a covariate significantly affects the abundance, DESeq2 compares the group variable coefficient of the full model using the Wald test (test statistic assumed to follow a standard normal distribution) or compares the log likelihoods of the full model (with covariate) to the reduced model (without covariate) using the LRT (test statistic assumed to follow a chi-squared distribution). Multiple testing correction is performed using the Benjamini-Hochberg FDR procedure.

LinDA (Linear Models for Differential Abundance): applies CLR transformation (to project the data from the simplex into Euclidean space, allowing valid application of linear models) to either count or proportion data. LinDA assumes that the relationships between covariates and transformed abundance are linear, that samples are independent (unless modeled with random effects), that there is no multicollinearity (i.e., covariates are not highly correlated), and that residuals are normally distributed and homoscedastic (features have similar variances).

After fitting a ordinary least squares of linear mixed-effects model (when repeated measures are present) to each taxon, LinDA corrects for compositional bias by subtracting the mode of the regression coefficients across all taxa from each taxon’s coefficient (based on the assumption that most taxa are not associated with the covariate and thus the central tendency of their estimated effects represents compositional bias).

To assess significance, LinDA performs t-tests on the regression coefficients derived from each taxon-specific model. The effect size corresponds to the estimated regression coefficient on the CLR scale, representing the expected change in log-transformed abundance per unit change in the covariate. Multiple testing correction is performed using the Benjamini-Hochberg FDR procedure.

glmmTMB (Generalized linear mixed models using Template Model Builder): can directly model the count data by fitting either a negative binomial (counts with overdispersion relative to Poisson) or zero-inflated, negative binomial (counts with overdisperson and excess zeros) models to each taxon. glmmTMB assumes that samples are independent (unless modeled with random effects) and that there is no multicollinearity (i.e., covariates are not highly correlated).

To assess significance, glmmTMB performs Wald z-tests (assumes asymptotic normality of coefficients) on the regression coefficients derived from each taxon-specific model. The effect size corresponds to the estimated regression coefficient. Multiple testing correction needs to be applied manually after fitting the models independently to each taxon (e.g., using the Benjamini-Hochberg FDR procedure).

(GAMMs) Generalized Additive Mixed Models: generalized additive models (GAMs) extend generalized linear models (GLMs) by allowing non-linear relationships between the response and the covariates through the addition of smooth functions. However, any fixed covariates that are not included in a smooth term are assumed to have a linear effect on the scale of the link function (e.g., identity for Gaussian, log for Poisson and logit for binomial).

Smooth functions in GAMs are represented using basis functions like splines. For example, cubic regression splines are construced from piecewise cubic polynomials joined at points called knots (more knots means more flexibility, but also a higher risk of overfitting). GAMs estimate the smooth functions by maximizing a penalized likelihood that combines likelihood from the data and a penalty on wiggliness (to minimize overfitting) using iteratively reweighted least squares (P-IRLS).

GAMMs extend GAMs by including random effects to account for repeated measurements or hierarchical structure. Likelihood is iteratively approximated using penalized quasi-likelihood, wherein the model is repeatedly converted into a working linear mixed model in which smooth terms are represented as penalized random-effects-like components. Each iteration fits the mixed model using lme(), updates the smoothness parameters and random effect variances, and repeats this process until convergence is reached.

The parametric coefficients (covariates not included in the smooth terms) and smooth terms are both estimated using mixed model inference. The regression coefficient of the parametric coefficients represent their effect size and significance is determined using a t-test. The significance of the smooth terms is determined using a Wald-like test based on the estimated spline coefficients and their covariance matrix, and degrees of freedom are derived from the effective degrees of freedom of the spline. Since GAMMs are fit to each taxon separately, multiple testing correction needs to be performed (e.g., using the Benjamini-Hochberg FDR procedure).

sPLS-DA and DIABLO

sPLS-DA (sparse Partial Least Squares Discriminant Analysis) is a supervised, multivariate method to reduce dimensionality, classify samples based on a categorical outcome and select the most discriminative features. The method works well with data that is high-dimensional and/or multicollinear.

  • Dimensionality is reduced by projecting the data onto a few latent components and through feature selection (only the most informative variables per component are retained).
  • Multicollinearity does not cause instability because the method does not require inversion of the covariance matrix (sPLS-DA operates in latent space rather than directly estimating coefficients from the raw variable space).
  • Weighted linear combinations of features are constructed and the combined signal of multiple correlated features is shared across the latent component, which avoids arbitrarily favoring one of the correlated features over the others.

DIABLO (Data Integration for Biomarker discovery using Latent cOmponents) extends the framework of sPLS-DA to integrate multiple omics datasets (blocks) measured on the same samples.

  • DIABLO constructs latent components in a manner similar to that of sPLS-DA, but the components in DIABLO also maximize the correlation between the data blocks in addition to separating the classes.

Data used in both sPLS-DA and DIABLO analysis should be scaled to unit variance to ensure that high-variance variables do not dominate. This is particularly important for multi-block analysis, since different data types can have very different levels of variance.

Both sPLS-DA and DIABLO estimate regression coefficients from latent features, not directly from the original features (DIABLO estimates the coefficients independently for each omics block). The latent components are computed as weighted linear combinations of the original features that maximize the covariance between the predictors and a dummy-coded class membership matrix (which optimizes the separation of the classes in the latent space) and in the case of DIABLO, also maximizes the correlation between component across blocks (as specified by a user-designed design matrix).

Soft-thresholding is applied to retain only the top keepX features per component that contribute the most strongly to class separation (i.e., those that have the highest absolute weights/loadings) and additionally to inter-block correlation in the case of DIABLO. The weights of the selected features are then iteratively optimized to maximize class separation (and agreement between omics layers for DIABLO) in the latent component space (shared latent space for DIABLO).

Once convergence is reached for a given component (i.e., when the features weights stabilize below a set tolerance or a set number of iterations is reached), the explained variance is removed, and a new orthogonal component is estimated to capture the remaining variance that was not explained by the earlier components.

Multi-Omics Factor Analysis v2 (MOFA2)

MOFA2 is an unsupervised, probabilistic, multivariate method for integrating multiple omics datasets (views) measured on the same (or partially overlapping) set of samples. MOFA2 constructs latent factors that capture the most important sources of variation within and across data views using Bayesian inference. The latent factors can be shared across views (capturing covariation between omics datasets) or view-specific (capturing variation unique to one omics dataset), and are modeled as linear combinations of features.

MOFA2 initializes a probabilistic model structure that includes latent factors, feature weights (loadings) per view and noise and likelihood parameters (Gaussian for continuous data). Variational inference (an iterative optimization algorithm) is used to approximate the posterior distribution of the parameters by optimizing the Evidence Lower BOund (ELBO), which balances fit to the data with model complexity via regularization from priors. Sparsity-inducing priors, such as Automatic Relevance Determination (ARD), which shrinks irrelevant weights and factor-view combinations toward zero, and spike-and-slab priors, which shrink irrelevant weights exactly to zero, promote feature selection and enhance interpretability of the latent factors. The parameters are iteratively updated until convergence or stopping criteria (e.g., maximum number of iterations) is reached.

Network Construction and comparison for Microbiome data (NetCoMi)

Microbial association networks are graphs that model statistical associations (edges) between microbial taxa (nodes) based on abundance data. These associations can represent correlations (marginal relationships) or conditional dependencies (direct relationships). Edges can be weighted (indicating the strength of the association) and signed (where a positive sign suggests co-occurrence and negative sign suggests co-exclusion).

Centrality measures are used to identify important nodes (hubs) within the network.

  • Degree: the number of adjacent (directly connected) nodes.
  • Betweenness: the fraction of shortest paths between all pairs of nodes that pass through the given node.
  • Closeness: the reciprocal of the sum of the shortest path distances from the node to all other nodes.
  • Eigenvector: A measure of how well-connected a node is to other well-connected nodes.

Clusters are groups of taxa that are more densely interconnected with each other than with taxa outside the group (i.e., they are densely connected subgraphs within the larger network). Clustering can be performed using:

  • Hierarchical clustering (based on dissimilarity): a dendrogram is created based on pairwise dissimilarity and then cut at a specified height to define the clusters.
  • Edge-betweenness (based on dissimilarity): iteratively removes edges with the highest betweenness to reveal community structure.
  • Modularity-based clustering (based on similarity): for example, fast-greedy modularity optimization (default in NetCoMi) partitions the network into modules by maximizing modularity (a measure that favors many edges within modules and few edges between them).

Sparse Correlations for Compositional data (SparCC): uses log-ratios of relative abundances to estimate the log-ratios of the unobserved absolute abundances (the true microbial counts). This is based on the assumption that the total community size is roughly constant across samples, so that log(Arel/Brel) ≈ log(Aabs/Babs) = log(Aabs) - log(Babs).

The use of log-ratios removes the constant-sum constraint inherent in compositional data, allowing the use of standard correlation and covariance tools. The use of log-ratios relies on the assumption that the true (absolute) microbial abundances follow a log-normal distribution.

SparCC uses an iterative algorithm to approximate the covariance matrix of the log-transformed absolute abundances, under the assumption of sparsity (i.e., that most taxa are not strongly correlated). The algorithm estimates initial covariances between all taxon pairs based on log-ratio variances, zeros out the weak covariances to enforce sparsity and then recomputes the variances of the remaining individual taxa given the updated covariances. These steps are repeated iteratively until the covariance matrix converges.

Finally, the covariance matrix is scaled by the standard deviation of each taxon to produce a correlation matrix. These correlations represent marginal relationships (i.e., they describe how two taxa vary together without adjusting for the influence of the other taxa).

Correlation inference for Compositional data through Lasso (CCLasso): estimates conditional dependencies between taxa by approximating the inverse covariance matrix (i.e., the precision matrix) of log-transformed compositional data (typically using the centered log-ratio transformation). It assumes that most taxa pairs are conditionally independent (i.e., most taxa pairs are not directly associated).

Instead of directly inverting the covariance matrix, which can be numerically unstable with high-dimensional or highly correlated data, CCLasso formulates a convex optimization problem to estimate a sparse precision matrix. It does this by applying an L1 (Lasso) penalty to the entries of the precision matrix. This sparsity-inducing penalty shrinks small coefficients toward zero, producing a solution that is sparse and numerically stable. CCLasso selects the optimal sparsity level using k-fold validation, minimizing a log-likelihood-based loss function to balance model fit and model sparsity.

In the resulting precision matrix, non-zero entries indicate conditional dependencies between taxa (i.e., the two taxa are directly associated after accounting for all other taxa in the network).

Sparse InversE Covariance estimation for Ecological Association and Statistical Inference (SPIEC-EASI): estimates conditional dependencies between taxa by approximating the inverse covariance matrix (i.e., the precision matrix) of log-transformed compositional data (typically using the centered log-ratio transformation). It assumes that most taxa pairs are conditionally independent (i.e., most taxa pairs are not directly associated).

Instead of directly inverting the covariance matrix, which can be numerically unstable with high-dimensional or highly correlated data, SPIEC-EASI formulates a convex optimization problem to estimate a sparse precision matrix. It does this by applying an L1 penalty to promote sparsity, shrinking small coefficients toward zero and producing a solution that is sparse and numerically stable, using one of two methods:

  • Graphical Lasso (glasso): directly estimates the precision matrix by solving a penalized maximum likelihood problem with an L1 penalty applied to the precision matrix entries.
  • Meinshausen-Bühlmann Neighborhood Selection (mb): performs a series of L1-penalized regressions, where each taxon is regressed against all others and the resulting associations are then symmetrized to construct the network.

SPIEC-EASI selects the optimal sparsity level using the Stability Approach to Regularization Selection (StARS), which selects a regularization parameter that balances network sparsity with stability (assesses edge variability across subsamples).

In the resulting precision matrix, non-zero entries indicate conditional dependencies between taxa (i.e., the two taxa are directly associated after accounting for all other taxa in the network).

Enterotyping

The unsupervised categorization of microbiome samples into clusters (enterotypes) based on community composition (i.e., the relative abundance of taxa that dominate different groups).

Dirichlet-Multinomial Mixture (DMM) model: is one of the most widely used statistical models for enterotyping.

The DMM model assumes that each cluster (enterotype) has a separate Dirichlet-Multinomial distribution. For each sample within a given cluster, a probability vector p of taxa abundances is randomly sampled from the Dirichlet distribution for that cluster. The selected probability vector and the sample’s own library size n are used to define the multinomial distribution from which that sample’s counts are drawn. Samples are then probabilistically assigned to the cluster whose Dirichlet-Multinomial most likely explains their observed data.

The Dirichlet parameters for each cluster are estimated from the observed count data using maximum likelihood, implemented via the Expectation-Maximization algorithm (EM). During each iteration, the Dirichlet parameters are updated based on a weighted average of the observed counts across all samples, where the weights correspond to each sample’s posterior probability of belonging to that cluster. This process is repeated iteratively to maximize the overall likelihood of the observed data and is continued until convergence is reached.

The model fitting process is repeated separately for a range of user-defined cluster counts (e.g., k = 1 to k = 5), and the optimal number of clusters is selected using a model selection criterion, such as Bayesian Information Criterion or Laplace approximation.

About

Code and statistical explanations for commonly used microbiome analyses

Topics

Resources

Stars

Watchers

Forks

Contributors