Exploring pre-calculated summary cell counts

This tutorial describes how to access pre-calculated summary cell counts. Each Census contains a top-level dataframe summarizing counts of various cell labels, this is the census_summary_cell_counts dataframe . You can read this into a Pandas DataFrame

Contents

  1. Fetching the census_summary_cell_counts dataframe.

  2. Creating summary counts beyond pre-calculated values.

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Fetching the census_summary_cell_counts dataframe

[1]:
import cellxgene_census

census = cellxgene_census.open_soma()
census_summary_cell_counts = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()

# Dropping the soma_joinid column as it isn't useful in this demo
census_summary_cell_counts = census_summary_cell_counts.drop(columns=["soma_joinid"])

census_summary_cell_counts
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
[1]:
organism category ontology_term_id unique_cell_count total_cell_count label
0 Homo sapiens all na 33364242 56400873 na
1 Homo sapiens assay EFO:0008722 264166 279635 Drop-seq
2 Homo sapiens assay EFO:0008780 25652 51304 inDrop
3 Homo sapiens assay EFO:0008919 89477 206754 Seq-Well
4 Homo sapiens assay EFO:0008931 78750 188248 Smart-seq2
... ... ... ... ... ... ...
1357 Mus musculus tissue_general UBERON:0002113 179684 208324 kidney
1358 Mus musculus tissue_general UBERON:0002365 15577 31154 exocrine gland
1359 Mus musculus tissue_general UBERON:0002367 37715 130135 prostate gland
1360 Mus musculus tissue_general UBERON:0002368 13322 26644 endocrine gland
1361 Mus musculus tissue_general UBERON:0002371 90225 144962 bone marrow

1362 rows × 6 columns

Creating summary counts beyond pre-calculated values.

The dataframe above is precomputed from the experiments in the Census, providing a quick overview of the Census contents.

You can do similar group statistics using Pandas groupby functions.

The code below reproduces the above counts using full obs dataframe in the Homo_sapiens experiment.

Keep in mind that the Census is very large, and any queries will return significant amount of data. You can manage that by narrowing the query request using column_names and value_filter in your query.

[2]:
human = census["census_data"]["homo_sapiens"]
obs_df = human.obs.read(column_names=["cell_type_ontology_term_id", "cell_type"]).concat().to_pandas()
obs_df.groupby(by=["cell_type_ontology_term_id", "cell_type"], as_index=False, observed=True).size()
[2]:
cell_type_ontology_term_id cell_type size
0 CL:0000001 primary cultured cell 80
1 CL:0000003 native cell 1308000
2 CL:0000006 neuronal receptor cell 2502
3 CL:0000015 male germ cell 621
4 CL:0000019 sperm 22
... ... ... ...
608 CL:4028006 alveolar type 2 fibroblast cell 38250
609 CL:4030009 epithelial cell of proximal tubule segment 1 777
610 CL:4030011 epithelial cell of proximal tubule segment 3 989
611 CL:4030018 kidney connecting tubule principal cell 107
612 CL:4030023 respiratory hillock cell 10170

613 rows × 3 columns

Close the census when complete.

[3]:
census.close()