Summarizing cell and gene metadata
This notebook provides examples for basic axis metadata handling using Pandas. The Census stores obs
(cell) and var
(gene) metadata in SOMADataFrame
objects via the TileDB-SOMA API (documentation), which can be queried and read as a Pandas DataFrame
using TileDB-SOMA
.
Note that Pandas DataFrame
is an in-memory object, therefore queries should be small enough for results to fit in memory.
Contents
Opening the Census
Summarizing cell metadata
Example: Summarize all cell types
Example: Summarize a subset of cell types, selected with a
value_filter
Full Census metadata stats
⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data
which is described in the Census schema.
Opening the Census
The cellxgene_census
python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. open_soma()
returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.
You can learn more about the cellxgene_census
methods by accessing their corresponding documentation via help()
. For example help(cellxgene_census.open_soma)
.
[1]:
import cellxgene_census
# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
...
# or, directly open the census (don't forget to close it!)
census = cellxgene_census.open_soma()
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
Summarizing cell metadata
Once the Census is open you can use its TileDB-SOMA
methods as it is itself a SOMACollection
. You can thus access the metadata SOMADataFrame
objects encoding cell and gene metadata.
Tips:
You can read an entire
SOMADataFrame
into a PandasDataFrame
usingsoma_df.read().concat().to_pandas()
, allowing the use of the standard Pandas API.Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g.,
column_names=["cell_type_ontology_term_id"]
).You can also further refine query results by using a
value_filter
, which will filter the census for matching records.
Example: Summarize all cell types
This example reads the cell metadata (obs
) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API.
[2]:
# Read entire _obs_ into a pandas dataframe.
obs_df = cellxgene_census.get_obs(census, "homo_sapiens", column_names=["cell_type_ontology_term_id"])
# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()
# Display only the first 10, as there are a LOT!
print(
f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:",
unique_cell_type_ontology_term_id[0:10].tolist(),
)
# Using Pandas API, count the instances of each cell type term and return the top 10.
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print("\nThe top 10 cell types and their counts are:")
print(top_10)
There are 613 cell types in the Census! The first 10 are: ['CL:0000525', 'CL:2000060', 'CL:0008036', 'CL:0002488', 'CL:0002343', 'CL:0000084', 'CL:0001078', 'CL:0000815', 'CL:0000235', 'CL:3000001']
The top 10 cell types and their counts are:
cell_type_ontology_term_id
CL:0000540 7665340
CL:0000679 1894047
CL:0000128 1881077
CL:0000624 1508920
CL:0000625 1477453
CL:0000235 1419507
CL:0000057 1397813
CL:0000860 1369142
CL:0000003 1308000
CL:4023040 1229658
Name: count, dtype: int64
Example: Summarize a subset of cell types, selected with a value_filter
This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id
equal to UBERON:0002048
(lung tissue), and summarizes the query result using Pandas.
[3]:
# Count cell_type occurrences for cells with tissue == 'lung'
# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE = "UBERON:0002048"
obs_df = cellxgene_census.get_obs(
census,
"homo_sapiens",
column_names=["cell_type_ontology_term_id"],
value_filter=f"tissue_ontology_term_id == '{LUNG_TISSUE}'",
)
# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.
unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()
print(
f"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:",
unique_cell_type_ontology_term_id[0:10].tolist(),
)
# Use Pandas API to count, and grab 10 most common
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
print(f"\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}")
print(top_10)
There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:0002063', 'CL:0000775', 'CL:0001044', 'CL:0001050', 'CL:0000814', 'CL:0000071', 'CL:0000192', 'CL:0002503', 'CL:0000235', 'CL:0002370']
Top 10 cell types where tissue_ontology_term_id == UBERON:0002048
cell_type_ontology_term_id
CL:0000003 562038
CL:0000583 526859
CL:0000625 323985
CL:0000624 323610
CL:0000235 266333
CL:0002063 255425
CL:0000860 205013
CL:0000623 164944
CL:0001064 149067
CL:0002632 132243
Name: count, dtype: int64
You can also define much more complex value filters. For example:
combine terms with
and
andor
use the
in
operator to query on multiple values
[4]:
# You can also do more complex queries, such as testing for inclusion in a list of values and "and" operations
VENTRICLES = ["UBERON:0002082", "UBERON:OOO2084", "UBERON:0002080"]
obs_df = cellxgene_census.get_obs(
census,
"homo_sapiens",
column_names=["cell_type_ontology_term_id"],
value_filter=f"tissue_ontology_term_id in {VENTRICLES} and is_primary_data == True",
)
# Use Pandas API to summarize
top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]
display(top_10)
cell_type_ontology_term_id
CL:0000746 49929
CL:0008034 33361
CL:0002548 33180
CL:0002131 30915
CL:0000115 30054
CL:0000003 18391
CL:0000763 14408
CL:0000669 13552
CL:0000057 9690
CL:0002144 9025
Name: count, dtype: int64
Full Census metadata stats
This example queries all organisms in the Census, and summarizes the diversity of various metadata lables.
[5]:
COLS_TO_QUERY = [
"cell_type_ontology_term_id",
"assay_ontology_term_id",
"tissue_ontology_term_id",
]
obs_df = {
name: cellxgene_census.get_obs(census, name, column_names=COLS_TO_QUERY) for name in census["census_data"].keys()
}
# Use Pandas API to summarize each organism
print(f"Complete census contains {sum(len(df) for df in obs_df.values())} cells.")
for organism, df in obs_df.items():
print(organism)
for col in COLS_TO_QUERY:
print(f"\tUnique {col} values: {len(df[col].unique())}")
Complete census contains 61656118 cells.
mus_musculus
Unique cell_type_ontology_term_id values: 248
Unique assay_ontology_term_id values: 9
Unique tissue_ontology_term_id values: 66
homo_sapiens
Unique cell_type_ontology_term_id values: 613
Unique assay_ontology_term_id values: 19
Unique tissue_ontology_term_id values: 220
Close the census
[6]:
census.close()