Summarizing cell and gene metadata

This notebook provides examples for basic axis metadata handling using Pandas. The Census stores obs (cell) and var (gene) metadata in SOMADataFrame objects via the TileDB-SOMA API (documentation), which can be queried and read as a Pandas DataFrame using TileDB-SOMA.

Note that Pandas DataFrame is an in-memory object, therefore queries should be small enough for results to fit in memory.

Contents

Opening the Census
Summarizing cell metadata
1. Example: Summarize all cell types
2. Example: Summarize a subset of cell types, selected with a value_filter
Full Census metadata stats

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

[1]:

import cellxgene_census

# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
    ...

# or, directly open the census (don't forget to close it!)
census = cellxgene_census.open_soma()

The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

Summarizing cell metadata

Once the Census is open you can use its TileDB-SOMA methods as it is itself a SOMACollection. You can thus access the metadata SOMADataFrame objects encoding cell and gene metadata.

Tips:

You can read an entire SOMADataFrame into a Pandas DataFrame using soma_df.read().concat().to_pandas(), allowing the use of the standard Pandas API.
Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g., column_names=["cell_type_ontology_term_id"]).
You can also further refine query results by using a value_filter, which will filter the census for matching records.

Example: Summarize all cell types

This example reads the cell metadata (obs) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API.

[2]:

# Read entire _obs_ into a pandas dataframe.
obs_df = cellxgene_census.get_obs(census, "homo_sapiens", column_names=["cell_type_ontology_term_id"])

# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()

# Display only the first 10, as there are a LOT!
print(
    f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:",
    unique_cell_type_ontology_term_id[0:10].tolist(),
)

# Using Pandas API, count the instances of each cell type term and return the top 10.
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print("\nThe top 10 cell types and their counts are:")
print(top_10)

There are 613 cell types in the Census! The first 10 are: ['CL:0000525', 'CL:2000060', 'CL:0008036', 'CL:0002488', 'CL:0002343', 'CL:0000084', 'CL:0001078', 'CL:0000815', 'CL:0000235', 'CL:3000001']

The top 10 cell types and their counts are:
cell_type_ontology_term_id
CL:0000540    7665340
CL:0000679    1894047
CL:0000128    1881077
CL:0000624    1508920
CL:0000625    1477453
CL:0000235    1419507
CL:0000057    1397813
CL:0000860    1369142
CL:0000003    1308000
CL:4023040    1229658
Name: count, dtype: int64

Example: Summarize a subset of cell types, selected with a `value_filter`

This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id equal to UBERON:0002048 (lung tissue), and summarizes the query result using Pandas.

[3]:

# Count cell_type occurrences for cells with tissue == 'lung'

# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE = "UBERON:0002048"

obs_df = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    column_names=["cell_type_ontology_term_id"],
    value_filter=f"tissue_ontology_term_id == '{LUNG_TISSUE}'",
)

# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()

print(
    f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:",
    unique_cell_type_ontology_term_id[0:10].tolist(),
)

# Use Pandas API to count, and grab 10 most common
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print(f"\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}")
print(top_10)

There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:0002063', 'CL:0000775', 'CL:0001044', 'CL:0001050', 'CL:0000814', 'CL:0000071', 'CL:0000192', 'CL:0002503', 'CL:0000235', 'CL:0002370']

Top 10 cell types where tissue_ontology_term_id == UBERON:0002048
cell_type_ontology_term_id
CL:0000003    562038
CL:0000583    526859
CL:0000625    323985
CL:0000624    323610
CL:0000235    266333
CL:0002063    255425
CL:0000860    205013
CL:0000623    164944
CL:0001064    149067
CL:0002632    132243
Name: count, dtype: int64

You can also define much more complex value filters. For example:

combine terms with and and or
use the in operator to query on multiple values

[4]:

# You can also do more complex queries, such as testing for inclusion in a list of values and "and" operations
VENTRICLES = ["UBERON:0002082", "UBERON:OOO2084", "UBERON:0002080"]

obs_df = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    column_names=["cell_type_ontology_term_id"],
    value_filter=f"tissue_ontology_term_id in {VENTRICLES} and is_primary_data == True",
)

# Use Pandas API to summarize
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
display(top_10)

cell_type_ontology_term_id
CL:0000746    49929
CL:0008034    33361
CL:0002548    33180
CL:0002131    30915
CL:0000115    30054
CL:0000003    18391
CL:0000763    14408
CL:0000669    13552
CL:0000057     9690
CL:0002144     9025
Name: count, dtype: int64

Full Census metadata stats

This example queries all organisms in the Census, and summarizes the diversity of various metadata lables.

[5]:

COLS_TO_QUERY = [
    "cell_type_ontology_term_id",
    "assay_ontology_term_id",
    "tissue_ontology_term_id",
]

obs_df = {
    name: cellxgene_census.get_obs(census, name, column_names=COLS_TO_QUERY) for name in census["census_data"].keys()
}

# Use Pandas API to summarize each organism
print(f"Complete census contains {sum(len(df) for df in obs_df.values())} cells.")
for organism, df in obs_df.items():
    print(organism)
    for col in COLS_TO_QUERY:
        print(f"\tUnique {col} values: {len(df[col].unique())}")

Complete census contains 61656118 cells.
mus_musculus
        Unique cell_type_ontology_term_id values: 248
        Unique assay_ontology_term_id values: 9
        Unique tissue_ontology_term_id values: 66
homo_sapiens
        Unique cell_type_ontology_term_id values: 613
        Unique assay_ontology_term_id values: 19
        Unique tissue_ontology_term_id values: 220

Close the census

[6]:

census.close()