Summarizing cell and gene metadata

This notebook provides examples for basic axis metadata handling using Pandas. The Census stores obs (cell) and var (gene) metadata in SOMADataFrame objects via the TileDB-SOMA API (documentation), which can be queried and read as a Pandas DataFrame using TileDB-SOMA.

Note that Pandas DataFrame is an in-memory object, therefore queries should be small enough for results to fit in memory.

Contents

  1. Opening the Census

  2. Summarizing cell metadata

    1. Example: Summarize all cell types

    2. Example: Summarize a subset of cell types, selected with a value_filter

  3. Full Census metadata stats

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

[1]:
import cellxgene_census

# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
    ...

# or, directly open the census (don't forget to close it!)
census = cellxgene_census.open_soma()
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

Summarizing cell metadata

Once the Census is open you can use its TileDB-SOMA methods as it is itself a SOMACollection. You can thus access the metadata SOMADataFrame objects encoding cell and gene metadata.

Tips:

  • You can read an entire SOMADataFrame into a Pandas DataFrame using soma_df.read().concat().to_pandas(), allowing the use of the standard Pandas API.

  • Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g., column_names=["cell_type_ontology_term_id"]).

  • You can also further refine query results by using a value_filter, which will filter the census for matching records.

Example: Summarize all cell types

This example reads the cell metadata (obs) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API.

[2]:
# Read entire _obs_ into a pandas dataframe.
obs_df = cellxgene_census.get_obs(census, "homo_sapiens", column_names=["cell_type_ontology_term_id"])

# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()

# Display only the first 10, as there are a LOT!
print(
    f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:",
    unique_cell_type_ontology_term_id[0:10].tolist(),
)

# Using Pandas API, count the instances of each cell type term and return the top 10.
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print("\nThe top 10 cell types and their counts are:")
print(top_10)
There are 613 cell types in the Census! The first 10 are: ['CL:0000525', 'CL:2000060', 'CL:0008036', 'CL:0002488', 'CL:0002343', 'CL:0000084', 'CL:0001078', 'CL:0000815', 'CL:0000235', 'CL:3000001']

The top 10 cell types and their counts are:
cell_type_ontology_term_id
CL:0000540    7665340
CL:0000679    1894047
CL:0000128    1881077
CL:0000624    1508920
CL:0000625    1477453
CL:0000235    1419507
CL:0000057    1397813
CL:0000860    1369142
CL:0000003    1308000
CL:4023040    1229658
Name: count, dtype: int64

Example: Summarize a subset of cell types, selected with a value_filter

This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id equal to UBERON:0002048 (lung tissue), and summarizes the query result using Pandas.

[3]:
# Count cell_type occurrences for cells with tissue == 'lung'

# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE = "UBERON:0002048"

obs_df = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    column_names=["cell_type_ontology_term_id"],
    value_filter=f"tissue_ontology_term_id == '{LUNG_TISSUE}'",
)

# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()

print(
    f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:",
    unique_cell_type_ontology_term_id[0:10].tolist(),
)

# Use Pandas API to count, and grab 10 most common
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print(f"\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}")
print(top_10)
There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:0002063', 'CL:0000775', 'CL:0001044', 'CL:0001050', 'CL:0000814', 'CL:0000071', 'CL:0000192', 'CL:0002503', 'CL:0000235', 'CL:0002370']

Top 10 cell types where tissue_ontology_term_id == UBERON:0002048
cell_type_ontology_term_id
CL:0000003    562038
CL:0000583    526859
CL:0000625    323985
CL:0000624    323610
CL:0000235    266333
CL:0002063    255425
CL:0000860    205013
CL:0000623    164944
CL:0001064    149067
CL:0002632    132243
Name: count, dtype: int64

You can also define much more complex value filters. For example:

  • combine terms with and and or

  • use the in operator to query on multiple values

[4]:
# You can also do more complex queries, such as testing for inclusion in a list of values and "and" operations
VENTRICLES = ["UBERON:0002082", "UBERON:OOO2084", "UBERON:0002080"]

obs_df = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    column_names=["cell_type_ontology_term_id"],
    value_filter=f"tissue_ontology_term_id in {VENTRICLES} and is_primary_data == True",
)

# Use Pandas API to summarize
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
display(top_10)
cell_type_ontology_term_id
CL:0000746    49929
CL:0008034    33361
CL:0002548    33180
CL:0002131    30915
CL:0000115    30054
CL:0000003    18391
CL:0000763    14408
CL:0000669    13552
CL:0000057     9690
CL:0002144     9025
Name: count, dtype: int64

Full Census metadata stats

This example queries all organisms in the Census, and summarizes the diversity of various metadata lables.

[5]:
COLS_TO_QUERY = [
    "cell_type_ontology_term_id",
    "assay_ontology_term_id",
    "tissue_ontology_term_id",
]

obs_df = {
    name: cellxgene_census.get_obs(census, name, column_names=COLS_TO_QUERY) for name in census["census_data"].keys()
}

# Use Pandas API to summarize each organism
print(f"Complete census contains {sum(len(df) for df in obs_df.values())} cells.")
for organism, df in obs_df.items():
    print(organism)
    for col in COLS_TO_QUERY:
        print(f"\tUnique {col} values: {len(df[col].unique())}")
Complete census contains 61656118 cells.
mus_musculus
        Unique cell_type_ontology_term_id values: 248
        Unique assay_ontology_term_id values: 9
        Unique tissue_ontology_term_id values: 66
homo_sapiens
        Unique cell_type_ontology_term_id values: 613
        Unique assay_ontology_term_id values: 19
        Unique tissue_ontology_term_id values: 220

Close the census

[6]:
census.close()