MOFAcellular - Multicellular Factor Analysis¶
Background¶
In Ramirez et al, 2023, we recently showed a repurposed use of the statistical framework of multi-omics factor analysis (MOFA) and MOFA+ to analyse cross-condition single-cell atlases. Specifically, we represented the cross-conditional (e.g. healthy vs. diseased) single-cell transcriptomics data as a multi-view structure, where each cell type represents an individual view that contains summarized information per sample (e.g. pseudobulk). We then applied MOFA to estimate a latent space that captures the variability of the data across samples and cell types.
In this tutorial, we will guide the generation of a multi-view structure from single-cell transcriptomics data and a basic application of MOFA to capture multicellular variability. We make use of the MuData/muon infrastructure.
Load Packages¶
mofa, decoupler, omnipath, and marsilea can be installed via pip with the following commands:
pip install "decoupler>=0.1.4"
pip install mofax
pip install muon
pip install omnipath
pip install marsilea
[15]:
import numpy as np
import pandas as pd
import scanpy as sc
import plotnine as p9
import liana as li
# load muon and mofax
import muon as mu
import mofax as mofa
import decoupler as dc
Load & Prep Data¶
As a simple example, we will look at ~25k PBMCs from 8 pooled patient lupus samples, each before and after IFN-beta stimulation (Kang et al., 2018; GSE96583). Note that by focusing on PBMCs, for the purpose of this tutorial, we assume that coordinated events occur among them.
This dataset is downloaded from a link on Figshare; preprocessed for pertpy.
[2]:
adata = li.testing.datasets.kang_2018()
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/liana/testing/datasets.py:36: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 24673 × 15706
obs: 'nCount_RNA', 'nFeature_RNA', 'tsne1', 'tsne2', 'condition', 'cluster', 'cell_type', 'patient', 'nCount_SCT', 'nFeature_SCT', 'integrated_snn_res.0.4', 'seurat_clusters', 'sample', 'cell_abbr'
var: 'name'
obsm: 'X_pca', 'X_umap'
layers: 'counts'
Define columns of interest from .obs
Note that we use cell abbreviations because MOFA will use them as labels for the views.
[4]:
sample_key = 'sample'
condition_key = 'condition'
groupby = 'cell_abbr'
Basic QC¶
Note that this data has been largely pre-processed & annotated, we refer the user to the Quality Control and other relevant chapters from the best-practices book for information about pre-processing and annotation steps.
[5]:
# filter cells and genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
[6]:
# Store the counts to use them later
adata.layers["counts"] = adata.X.copy()
# Normalize and find marker genes per cell type
# We will use those to remove potential noise from the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.tl.rank_genes_groups(adata, groupby=groupby)
... storing 'sample' as categorical
Showcase the data¶
[7]:
# Show pre-computed UMAP
sc.pl.umap(adata, color=[condition_key, sample_key, 'cell_type', groupby], frameon=False, ncols=2)

Create a Multi-View Structure¶
To construct a multi-view structure, we need to define the views. In this case, we will use the adata_to_views
function from liana to create a list of views (stored in a MuData object), where each view corresponds to a cell type. Simply put, we summarize the samples to pseudobulks by cell type, and then we create a view for each cell type.
We refer users to decoupler’s get_pseudobulk
function for more information about filtering and aggregation options.
[8]:
mdata = li.multi.adata_to_views(adata,
groupby=groupby,
sample_key=sample_key,
keep_stats=True,
obs_keys=['condition', 'patient'], # add those to mdata.obs
min_prop=0.05, # min nnz values (filter features)
min_smpls=3, # min samples per view (filter features)
min_cells=25, # min cells per view (filter samples)
min_counts=100, # min counts per view (filter samples)
mode='sum', # mode of aggregation
verbose=True,
large_n=5, # edgeR-like filtering
min_total_count=15,
min_count=10,
layer='counts',
)
100%|██████████| 8/8 [00:03<00:00, 2.31it/s]
We see that we have 7 modalities (views) that correspond to the sufficiently abundant cell types & genes in the data.
We can also explore the statistics at the pseodobulk level, when keep_stats=True
:
[9]:
mdata.uns['psbulk_stats'].head()
[9]:
CD14:psbulk_n_cells | CD14:psbulk_counts | CD4T:psbulk_n_cells | CD4T:psbulk_counts | DCs:psbulk_n_cells | DCs:psbulk_counts | NK:psbulk_n_cells | NK:psbulk_counts | CD8T:psbulk_n_cells | CD8T:psbulk_counts | B:psbulk_n_cells | B:psbulk_counts | FGR3:psbulk_n_cells | FGR3:psbulk_counts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ctrl&101 | 207.0 | 531210.0 | 353.0 | 403847.0 | 25.0 | 95330.0 | 65.0 | 57550.0 | 29.0 | 27892.0 | 106.0 | 124446.0 | 72.0 | 197588.0 |
ctrl&1015 | 859.0 | 1793578.0 | 951.0 | 1113616.0 | 40.0 | 146189.0 | 179.0 | 177960.0 | 85.0 | 94175.0 | 459.0 | 556570.0 | 164.0 | 438020.0 |
ctrl&1016 | 393.0 | 742765.0 | 502.0 | 528665.0 | 27.0 | 88298.0 | 90.0 | 79082.0 | 504.0 | 457417.0 | 125.0 | 152335.0 | 81.0 | 208274.0 |
ctrl&1039 | 126.0 | 301741.0 | 201.0 | 249014.0 | NaN | NaN | NaN | NaN | NaN | NaN | 26.0 | 35016.0 | NaN | NaN |
ctrl&107 | 210.0 | 466961.0 | 180.0 | 216841.0 | NaN | NaN | 37.0 | 33243.0 | NaN | NaN | 44.0 | 52655.0 | NaN | NaN |
Pre-process the pseudobulk profiles¶
(Optional) Let’s remove what one would consider background marker genes. The intuition here is that marker genes from some cell types should not be deregulated across conditions in others, and thus, they should not be captured by the latent space.
[10]:
# create dictionary of markers for each cell type
markers = {}
top_n = 25
for cell_type in mdata.mod.keys():
markers[cell_type] = (sc.get.rank_genes_groups_df(adata, group=cell_type).
sort_values("scores", key=abs, ascending=False).
head(top_n)['names'].
tolist()
)
[11]:
li.multi.filter_view_markers(mdata, markers=markers, var_column=None, inplace=True)
mdata.update()
Next, we will normalize each of the views independently to ensure that samples are comparable. We will also identify the highly-variable genes per view (i.e. across samples) - this is optional & dependent on our assumptions.
[12]:
for view in mdata.mod.keys():
sc.pp.normalize_total(mdata.mod[view], target_sum=1e4)
sc.pp.log1p(mdata.mod[view])
sc.pp.highly_variable_genes(mdata.mod[view])
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
/home/dbdimitrov/anaconda3/envs/mofatalk/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:206: UserWarning: Received a view of an AnnData. Making a copy.
We will use the filter_view_markers
to remove genes that are considered background.
We suggest users to also consider internal muon
functions (e.g. mu.pp.filter_obs
& mu.pp.filter_var
) to remove cells and variables that are not informative.
Note
View Representation
MOFA supports the flexible representation of views, where each view can represent a different type of features (e.g. genes, proteins, metabolites, etc.). In this case, we simply allow for different genes to be used in each view.
Fitting a MOFA model¶
Now that the single-cell data is transformed into a multi-view representation, we can use MOFA to run a multicellular factor analysis.
We will attempt to capture the variability across samples and the different cell-types by reducing the data into a number of factors, where each factor captures the coordinated gene expression across cell types.
[13]:
mu.tl.mofa(mdata,
use_obs='union',
convergence_mode='medium',
n_factors=5,
seed=1337,
outfile='models/mofacellx.h5ad',
use_var='highly_variable'
)
#########################################################
### __ __ ____ ______ ###
### | \/ |/ __ \| ____/\ _ ###
### | \ / | | | | |__ / \ _| |_ ###
### | |\/| | | | | __/ /\ \_ _| ###
### | | | | |__| | | / ____ \|_| ###
### |_| |_|\____/|_|/_/ \_\ ###
### ###
#########################################################
Loaded view='CD14' group='group1' with N=16 samples and D=4061 features...
Loaded view='CD4T' group='group1' with N=16 samples and D=5457 features...
Loaded view='DCs' group='group1' with N=16 samples and D=1389 features...
Loaded view='NK' group='group1' with N=16 samples and D=1371 features...
Loaded view='CD8T' group='group1' with N=16 samples and D=787 features...
Loaded view='B' group='group1' with N=16 samples and D=1997 features...
Loaded view='FGR3' group='group1' with N=16 samples and D=1757 features...
Model options:
- Automatic Relevance Determination prior on the factors: True
- Automatic Relevance Determination prior on the weights: True
- Spike-and-slab prior on the factors: False
- Spike-and-slab prior on the weights: True
Likelihoods:
- View 0 (CD14): gaussian
- View 1 (CD4T): gaussian
- View 2 (DCs): gaussian
- View 3 (NK): gaussian
- View 4 (CD8T): gaussian
- View 5 (B): gaussian
- View 6 (FGR3): gaussian
######################################
## Training the model with seed 1337 ##
######################################
Converged!
#######################
## Training finished ##
#######################
Warning: Output file models/mofacellx.h5ad already exists, it will be replaced
Saving model in models/mofacellx.h5ad...
Saved MOFA embeddings in .obsm['X_mofa'] slot and their loadings in .varm['LFs'].
Exploring the MOFA model¶
For convenience, we provide simple getter function to access the model parameters, in addition to those available via the MuData API & the MOFA model itself.
Explore Metadata Associations to the Factor Scores¶
[21]:
dc.get_metadata_associations(
mdata,
obs_keys = ['patient', 'condition'], # Metadata columns to associate to PCs
obsm_key='X_mofa', # Where the PCs are stored
uns_key='mofa_anova', # Where the results are stored
inplace=True,
)
dc.plot_associations(
mdata,
uns_key='mofa_anova', # Summary statistics from the anova tests
obsm_key='X_mofa', # where the PCs are stored
stat_col='p_adj', # Which summary statistic to plot
obs_annotation_cols = ['patient', 'condition'], # which sample annotations to plot
titles=['Principle component scores', 'Adjusted p-values from ANOVA'],
figsize=(7, 5),
n_factors=10,
)

[22]:
# obtain factor scores
factor_scores = li.ut.get_factor_scores(mdata, obsm_key='X_mofa', obs_keys=['condition', 'patient'])
factor_scores.head()
[22]:
sample | Factor1 | Factor2 | Factor3 | Factor4 | Factor5 | condition | patient | |
---|---|---|---|---|---|---|---|---|
0 | ctrl&101 | 0.783439 | -0.003390 | -0.011637 | -0.041593 | -0.025567 | ctrl | patient_101 |
1 | ctrl&1015 | 0.615917 | 0.000925 | -0.013685 | -0.020768 | -0.008345 | ctrl | patient_1015 |
2 | ctrl&1016 | 0.618631 | -0.012139 | -0.025390 | -0.049042 | -0.018271 | ctrl | patient_1016 |
3 | ctrl&1039 | 0.684361 | -0.020115 | -0.009010 | -0.010894 | 0.094717 | ctrl | patient_1039 |
4 | ctrl&107 | 0.737042 | 0.076451 | 0.075220 | -0.005448 | 0.000316 | ctrl | patient_107 |
Let’s check if any of the factors are associated with the sample condition:
[23]:
# we use a paired t-test as the samples are paired
from scipy.stats import ttest_rel
[24]:
# split in control and stimulated
group1 = factor_scores[factor_scores['condition']=='ctrl']
group2 = factor_scores[factor_scores['condition']=='stim']
# get all columns that contain factor & loop
factors = [col for col in factor_scores.columns if 'Factor' in col]
for factor in factors:
print(ttest_rel(group1[factor], group2[factor]))
TtestResult(statistic=55.78013342160475, pvalue=1.5609246865653282e-10, df=7)
TtestResult(statistic=0.43713843548812376, pvalue=0.6751792840118616, df=7)
TtestResult(statistic=-0.8234150426439611, pvalue=0.43742190236387524, df=7)
TtestResult(statistic=-2.0057407304296926, pvalue=0.08489746148937447, df=7)
TtestResult(statistic=0.335859769656824, pvalue=0.7468188651111132, df=7)
We can see that the first factor is associated with the sample condition, let’s plot the factor scores:
[25]:
# scatterplot
(p9.ggplot(factor_scores) +
p9.aes(x='condition', colour='condition', y='Factor1') +
p9.geom_violin() +
p9.geom_jitter(size=4, width=0.2) +
p9.theme_bw() +
p9.scale_colour_manual(values=['#1f77b4', '#c20019'])
)
Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt

Explore Feature loadings¶
Now that we have identified a factor that is associated with the sample condition, we can check the features with the highest loadings associated with each cell type:
[26]:
variable_loadings = li.ut.get_variable_loadings(mdata, varm_key='LFs', view_sep=':') # get loadings
# order features by absolute value for Factor 1
variable_loadings = variable_loadings.sort_values(by='Factor1', key=lambda x: abs(x), ascending=False)
variable_loadings.head()
[26]:
view | variable | Factor1 | Factor2 | Factor3 | Factor4 | Factor5 | |
---|---|---|---|---|---|---|---|
1592 | CD14 | IFIT3 | -2.540583 | -0.316007 | -0.139563 | -0.341058 | 0.438565 |
857 | CD14 | CXCL11 | -2.526854 | 0.111960 | -0.300244 | 0.035542 | 0.054032 |
1590 | CD14 | IFIT1 | -2.513345 | -0.175255 | -0.173300 | -0.365148 | 0.295961 |
13856 | B | ISG15 | -2.450193 | -0.764503 | -0.985759 | -0.235561 | 3.109997 |
3016 | CD14 | RSAD2 | -2.440152 | -0.091798 | 0.019702 | -0.162527 | 0.084759 |
[27]:
# get top genes with highest absolute loadings across all views
top_genes = variable_loadings['variable'].head(30)
top_loadings = variable_loadings[variable_loadings['variable'].isin(top_genes)]
# ^ Note that the genes with the lowest loadings are equally interesting
# plot them
# dotplot of variable, view, loadings
(p9.ggplot(top_loadings) +
p9.aes(x='view', y='variable', fill='Factor1') +
p9.geom_tile() +
p9.scale_fill_gradient2(low='#1f77b4', mid='lightgray', high='#c20019') +
p9.theme_minimal() +
p9.theme(axis_text_x=p9.element_text(angle=90, hjust=0.5), figure_size=(5, 5))
)
Fontsize 0.00 < 1.0 pt not allowed by FreeType. Setting fontsize = 1 pt

We can see that some genes are present in only some of the cell type views and that some of the genes with highest (absolute) loadings tend to be cell type specific. Many genes are also be shared across views, but it’s challenging to to decipher if they are shared because they are coordinated by the same biological process or because of technical issues with the data. Thus, an essential step in the analysis is to select the genes for each cell type (view) in a way that is consistent with the biological knowledge of those cell types.
NB: the interpretation of the sign of the feature loadings is relevant only to the sign of the factor scores themselves. In other words, a negative sign would mean that the features are negatively associated with the scores of a given factor, and the contrary for positive loadings!
Explore the model¶
Finally, we can also explore the MOFA model itself and we will specifically check the variance explained by each cell type.
[28]:
model = mofa.mofa_model("models/mofacell.h5ad")
model
[28]:
MOFA+ model: mofacell.h5ad
Samples (cells): 16
Features: 17206
Groups: group1 (16)
Views: B (2047), CD14 (4126), CD4T (5509), CD8T (838), DCs (1446), FGR3 (1817), NK (1423)
Factors: 5
Expectations: W, Z
[29]:
mofa.plot_r2(model, x='View')
[29]:

We can see that Factor 1 captures the majority of the variance explained, and that CD14+ Monocytes have the highest R-squared for the first factor. In other words, CD14+ Monocytes are the cell type with the highest variance explained by the model and specifically factor 1, thus this suggests that the variability within this view is mostly associated with the difference between samples from the control and condition.
Outlook & Further Analysis¶
This tutorial is just a short introduction of the use of MOFA, we thus refer the users to the available MOFA & muon tutorials for more applications & details.
[30]:
model.close()
[ ]: