MOFAcellular - Multicellular Factor Analysis

Background

In Ramirez et al, 2023, we recently showed a repurposed use of the statistical framework of multi-omics factor analysis (MOFA) and MOFA+ to analyse cross-condition single-cell atlases. Specifically, we represented the cross-conditional (e.g. healthy vs. diseased) single-cell transcriptomics data as a multi-view structure, where each cell type represents an individual view that contains summarized information per sample (e.g. pseudobulk). We then applied MOFA to estimate a latent space that captures the variability of the data across samples and cell types.

In this tutorial, we will guide the generation of a multi-view structure from single-cell transcriptomics data and a basic application of MOFA to capture multicellular variability. We make use of the MuData/muon infrastructure.

Load Packages

Install mofa, decoupler, and omnipath via pip with the following commands:

pip install "decoupler>=0.1.4"
pip install mofax
pip install muon
pip install omnipath
[1]:
import numpy as np
import pandas as pd

import scanpy as sc

import plotnine as p9

import liana as li

# load muon and mofax
import muon as mu
import mofax as mofa
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Load & Prep Data

As a simple example, we will look at ~25k PBMCs from 8 pooled patient lupus samples, each before and after IFN-beta stimulation (Kang et al., 2018; GSE96583). Note that by focusing on PBMCs, for the purpose of this tutorial, we assume that coordinated events occur among them.

This dataset is downloaded from a link on Figshare; preprocessed for pertpy.

[2]:
adata = li.testing.datasets.kang_2018()
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/liana/testing/datasets.py:36: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 24673 × 15706
    obs: 'nCount_RNA', 'nFeature_RNA', 'tsne1', 'tsne2', 'condition', 'cluster', 'cell_type', 'patient', 'nCount_SCT', 'nFeature_SCT', 'integrated_snn_res.0.4', 'seurat_clusters', 'sample', 'cell_abbr'
    var: 'name'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'

Define columns of interest from .obs

Note that we use cell abbreviations because MOFA will use them as labels for the views.

[4]:
sample_key = 'sample'
condition_key = 'condition'
groupby = 'cell_abbr'

Basic QC

Note that this data has been largely pre-processed & annotated, we refer the user to the Quality Control and other relevant chapters from the best-practices book for information about pre-processing and annotation steps.

[5]:
# filter cells and genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
[6]:
# Store the counts to use them later
adata.layers["counts"] = adata.X.copy()

# Normalize and find marker genes per cell type
# We will use those to remove potential noise from the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.tl.rank_genes_groups(adata, groupby=groupby)
... storing 'sample' as categorical

Showcase the data

[7]:
# Show pre-computed UMAP
sc.pl.umap(adata, color=[condition_key, sample_key, 'cell_type', groupby], frameon=False, ncols=2)
../_images/notebooks_mofacellular_17_0.png

Create a Multi-View Structure

To construct a multi-view structure, we need to define the views. In this case, we will use the adata_to_views function from liana to create a list of views (stored in a MuData object), where each view corresponds to a cell type. Simply put, we summarize the samples to pseudobulks by cell type, and then we create a view for each cell type.

We refer users to decoupler’s get_pseudobulk function for more information about filtering and aggregation options.

[8]:
mdata = li.multi.adata_to_views(adata,
                                groupby=groupby,
                                sample_key=sample_key,
                                keep_stats=True,
                                obs_keys=['condition', 'patient'], # add those to mdata.obs
                                min_prop=0.05, # min nnz values (filter features)
                                min_smpls=3, # min samples per view (filter features)
                                min_cells=25, # min cells per view (filter samples)
                                min_counts=100, # min counts per view (filter samples)
                                mode='sum', # mode of aggregation
                                verbose=True,
                                large_n=5, # edgeR-like filtering
                                min_total_count=15,
                                min_count=10,
                                layer='counts',
                                )
100%|██████████| 8/8 [00:04<00:00,  1.87it/s]

We see that we have 7 modalities (views) that correspond to the sufficiently abundant cell types & genes in the data.

We can also explore the statistics at the pseodobulk level, when keep_stats=True:

[9]:
mdata.uns['psbulk_stats'].head()
[9]:
CD14:psbulk_n_cells CD14:psbulk_counts CD4T:psbulk_n_cells CD4T:psbulk_counts DCs:psbulk_n_cells DCs:psbulk_counts NK:psbulk_n_cells NK:psbulk_counts CD8T:psbulk_n_cells CD8T:psbulk_counts B:psbulk_n_cells B:psbulk_counts FGR3:psbulk_n_cells FGR3:psbulk_counts
ctrl&101 207.0 531210.0 353.0 403847.0 25.0 95330.0 65.0 57550.0 29.0 27892.0 106.0 124446.0 72.0 197588.0
ctrl&1015 859.0 1793578.0 951.0 1113616.0 40.0 146189.0 179.0 177960.0 85.0 94175.0 459.0 556570.0 164.0 438020.0
ctrl&1016 393.0 742765.0 502.0 528665.0 27.0 88298.0 90.0 79082.0 504.0 457417.0 125.0 152335.0 81.0 208274.0
ctrl&1039 126.0 301741.0 201.0 249014.0 NaN NaN NaN NaN NaN NaN 26.0 35016.0 NaN NaN
ctrl&107 210.0 466961.0 180.0 216841.0 NaN NaN 37.0 33243.0 NaN NaN 44.0 52655.0 NaN NaN

Pre-process the pseudobulk profiles

(Optional) Let’s remove what one would consider background marker genes. The intuition here is that marker genes from some cell types should not be deregulated across conditions in others, and thus, they should not be captured by the latent space.

[10]:
# create dictionary of markers for each cell type
markers = {}
top_n = 25
for cell_type in mdata.mod.keys():
    markers[cell_type] = (sc.get.rank_genes_groups_df(adata, group=cell_type).
                          sort_values("scores", key=abs, ascending=False).
                          head(top_n)['names'].
                          tolist()
                          )
[11]:
li.multi.filter_view_markers(mdata, markers=markers, var_column=None, inplace=True)
mdata.update()

Next, we will normalize each of the views independently to ensure that samples are comparable. We will also identify the highly-variable genes per view (i.e. across samples) - this is optional & dependent on our assumptions.

[12]:
for view in mdata.mod.keys():

    sc.pp.normalize_total(mdata.mod[view], target_sum=1e4)
    sc.pp.log1p(mdata.mod[view])

    sc.pp.highly_variable_genes(mdata.mod[view])
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.

We will use the filter_view_markers to remove genes that are considered background.

We suggest users to also consider internal muon functions (e.g. mu.pp.filter_obs & mu.pp.filter_var) to remove cells and variables that are not informative.

Note

View Representation

MOFA supports the flexible representation of views, where each view can represent a different type of features (e.g. genes, proteins, metabolites, etc.). In this case, we simply allow for different genes to be used in each view.

Fitting a MOFA model

Now that the single-cell data is transformed into a multi-view representation, we can use MOFA to run a multicellular factor analysis.

We will attempt to capture the variability across samples and the different cell-types by reducing the data into a number of factors, where each factor captures the coordinated gene expression across cell types.

[13]:
mu.tl.mofa(mdata,
           use_obs='union',
           convergence_mode='medium',
           n_factors=5,
           seed=1337,
           outfile='models/mofacellx.h5ad',
           use_var='highly_variable'
           )

        #########################################################
        ###           __  __  ____  ______                    ###
        ###          |  \/  |/ __ \|  ____/\    _             ###
        ###          | \  / | |  | | |__ /  \ _| |_           ###
        ###          | |\/| | |  | |  __/ /\ \_   _|          ###
        ###          | |  | | |__| | | / ____ \|_|            ###
        ###          |_|  |_|\____/|_|/_/    \_\              ###
        ###                                                   ###
        #########################################################



Loaded view='CD14' group='group1' with N=16 samples and D=4061 features...
Loaded view='CD4T' group='group1' with N=16 samples and D=5457 features...
Loaded view='DCs' group='group1' with N=16 samples and D=1389 features...
Loaded view='NK' group='group1' with N=16 samples and D=1371 features...
Loaded view='CD8T' group='group1' with N=16 samples and D=787 features...
Loaded view='B' group='group1' with N=16 samples and D=1997 features...
Loaded view='FGR3' group='group1' with N=16 samples and D=1757 features...


Model options:
- Automatic Relevance Determination prior on the factors: True
- Automatic Relevance Determination prior on the weights: True
- Spike-and-slab prior on the factors: False
- Spike-and-slab prior on the weights: True
Likelihoods:
- View 0 (CD14): gaussian
- View 1 (CD4T): gaussian
- View 2 (DCs): gaussian
- View 3 (NK): gaussian
- View 4 (CD8T): gaussian
- View 5 (B): gaussian
- View 6 (FGR3): gaussian




######################################
## Training the model with seed 1337 ##
######################################



Converged!



#######################
## Training finished ##
#######################


Warning: Output file models/mofacellx.h5ad already exists, it will be replaced
Saving model in models/mofacellx.h5ad...
Saved MOFA embeddings in .obsm['X_mofa'] slot and their loadings in .varm['LFs'].

Exploring the MOFA model

For convenience, we provide simple getter function to access the model parameters, in addition to those available via the MuData API & the MOFA model itself.

[14]:
# obtain factor scores
factor_scores = li.ut.get_factor_scores(mdata, obsm_key='X_mofa', obs_keys=['condition', 'patient'])
factor_scores.head()
[14]:
sample Factor1 Factor2 Factor3 Factor4 Factor5 condition patient
0 ctrl&101 0.783439 -0.003390 -0.011637 -0.041593 -0.025567 ctrl patient_101
1 ctrl&1015 0.615917 0.000925 -0.013685 -0.020768 -0.008345 ctrl patient_1015
2 ctrl&1016 0.618631 -0.012139 -0.025390 -0.049042 -0.018271 ctrl patient_1016
3 ctrl&1039 0.684361 -0.020115 -0.009010 -0.010894 0.094717 ctrl patient_1039
4 ctrl&107 0.737042 0.076451 0.075220 -0.005448 0.000316 ctrl patient_107

Let’s check if any of the factors are associated with the sample condition:

[15]:
 # we use a paired t-test as the samples are paired
from scipy.stats import ttest_rel
[16]:
# split in control and stimulated
group1 = factor_scores[factor_scores['condition']=='ctrl']
group2 = factor_scores[factor_scores['condition']=='stim']

# get all columns that contain factor & loop
factors = [col for col in factor_scores.columns if 'Factor' in col]
for factor in factors:
    print(ttest_rel(group1[factor], group2[factor]))
TtestResult(statistic=55.78013342160475, pvalue=1.5609246865653282e-10, df=7)
TtestResult(statistic=0.43713843548812376, pvalue=0.6751792840118616, df=7)
TtestResult(statistic=-0.8234150426439611, pvalue=0.43742190236387524, df=7)
TtestResult(statistic=-2.0057407304296926, pvalue=0.08489746148937447, df=7)
TtestResult(statistic=0.335859769656824, pvalue=0.7468188651111132, df=7)

We can see that the first factor is associated with the sample condition, let’s plot the factor scores:

[17]:
# scatterplot
(p9.ggplot(factor_scores) +
 p9.aes(x='condition', colour='condition', y='Factor1') +
 p9.geom_violin() +
 p9.geom_jitter(size=4, width=0.2) +
 p9.theme_bw() +
 p9.scale_colour_manual(values=['#1f77b4', '#c20019'])
 )
Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt
../_images/notebooks_mofacellular_45_1.png

Now that we have identified a factor that is associated with the sample condition, we can check the features with the highest loadings associated with each cell type:

[18]:
variable_loadings =  li.ut.get_variable_loadings(mdata, varm_key='LFs', view_sep=':') # get loadings
# order features by absolute value for Factor 1
variable_loadings = variable_loadings.sort_values(by='Factor1', key=lambda x: abs(x), ascending=False)
variable_loadings.head()
[18]:
view variable Factor1 Factor2 Factor3 Factor4 Factor5
1592 CD14 IFIT3 -2.540583 -0.316007 -0.139563 -0.341058 0.438565
857 CD14 CXCL11 -2.526854 0.111960 -0.300244 0.035542 0.054032
1590 CD14 IFIT1 -2.513345 -0.175255 -0.173300 -0.365148 0.295961
13856 B ISG15 -2.450193 -0.764503 -0.985759 -0.235561 3.109997
3016 CD14 RSAD2 -2.440152 -0.091798 0.019702 -0.162527 0.084759
[19]:
# get top genes with highest absolute loadings across all views
top_genes = variable_loadings['variable'].head(30)
top_loadings = variable_loadings[variable_loadings['variable'].isin(top_genes)]
# ^ Note that the genes with the lowest loadings are equally interesting

# plot them
# dotplot of variable, view, loadings
(p9.ggplot(top_loadings) +
 p9.aes(x='view', y='variable', fill='Factor1') +
 p9.geom_tile() +
 p9.scale_fill_gradient2(low='#1f77b4', mid='lightgray', high='#c20019') +
 p9.theme_minimal() +
 p9.theme(axis_text_x=p9.element_text(angle=90, hjust=0.5), figure_size=(5, 5))
 )
Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt
../_images/notebooks_mofacellular_49_1.png

We can see that some genes are present in only some of the cell type views and that some of the genes with highest (absolute) loadings tend to be cell type specific. Many genes are also be shared across views, but it’s challenging to to decipher if they are shared because they are coordinated by the same biological process or because of technical issues with the data. Thus, an essential step in the analysis is to select the genes for each cell type (view) in a way that is consistent with the biological knowledge of those cell types.

NB: the interpretation of the sign of the feature loadings is relevant only to the sign of the factor scores themselves. In other words, a negative sign would mean that the features are negatively associated with the scores of a given factor, and the contrary for positive loadings!

Explore the model

Finally, we can also explore the MOFA model itself and we will specifically check the variance explained by each cell type.

[20]:
model = mofa.mofa_model("models/mofacell.h5ad")
model
[20]:
MOFA+ model: mofacell.h5ad
Samples (cells): 16
Features: 17206
Groups: group1 (16)
Views: B (2047), CD14 (4126), CD4T (5509), CD8T (838), DCs (1446), FGR3 (1817), NK (1423)
Factors: 5
Expectations: W, Z
[21]:
mofa.plot_r2(model, x='View')
[21]:
../_images/notebooks_mofacellular_54_0.png

We can see that Factor 1 captures the majority of the variance explained, and that CD14+ Monocytes have the highest R-squared for the first factor. In other words, CD14+ Monocytes are the cell type with the highest variance explained by the model and specifically factor 1, thus this suggests that the variability within this view is mostly associated with the difference between samples from the control and condition.

Outlook & Further Analysis

This tutorial is just a short introduction of the use of MOFA, we thus refer the users to the available MOFA & muon tutorials for more applications & details.

[22]:
model.close()
[ ]: