Querying and fetching the single-cell data and cell/gene metadata.
This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into common in-memory Python objects, including pandas.DataFrame and anndata.AnnData.
Contents
- Opening the census. 
- Querying expression data. 
- Querying cell metadata (obs). 
- Querying gene metadata (var). 
⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.
Opening the census
The cellxgene_census python package contains a convenient API to open the latest version of the Census.
[1]:
import cellxgene_census
census = cellxgene_census.open_soma()
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).
Querying expression data
A convenient way to query and fetch expression data is to use the get_anndata method of the cellxgene_census API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.
The method will return an anndata.AnnData object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:
- obs_column_namesand- var_column_names— a pair of arguments whose values are lists of strings indicating the columns to select for cell (- obs) and gene (- var) metadata respectively.
- obs_value_filter— python expression with selection conditions to fetch cells meeting a criteria. For full details see tiledb.QueryCondition.
- var_value_filter— python expression with selection conditions to fetch genes meeting a criteria. Details as above. For full details see tiledb.QueryCondition.
For example if we want to fetch the expression data for:
- Genes - "ENSG00000161798"and- "ENSG00000188229".
- All - "B cells"of- "lung"with- "COVID-19"from non-duplicated cells.
- With all gene metadata and adding - sexcell metadata.
[2]:
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19' and is_primary_data == True",
    obs_column_names=["sex"],
)
And now we can take a look at the results.
[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 2313 × 2
    obs: 'sex', 'cell_type', 'tissue_general', 'disease', 'is_primary_data'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
[4]:
adata.obs
[4]:
| sex | cell_type | tissue_general | disease | is_primary_data | |
|---|---|---|---|---|---|
| 0 | male | B cell | lung | COVID-19 | True | 
| 1 | male | B cell | lung | COVID-19 | True | 
| 2 | unknown | B cell | lung | COVID-19 | True | 
| 3 | male | B cell | lung | COVID-19 | True | 
| 4 | unknown | B cell | lung | COVID-19 | True | 
| ... | ... | ... | ... | ... | ... | 
| 2308 | male | B cell | lung | COVID-19 | True | 
| 2309 | male | B cell | lung | COVID-19 | True | 
| 2310 | male | B cell | lung | COVID-19 | True | 
| 2311 | male | B cell | lung | COVID-19 | True | 
| 2312 | male | B cell | lung | COVID-19 | True | 
2313 rows × 5 columns
[5]:
adata.var
[5]:
| soma_joinid | feature_id | feature_name | feature_length | |
|---|---|---|---|---|
| 0 | 8626 | ENSG00000161798 | AQP5 | 1884 | 
| 1 | 27047 | ENSG00000188229 | TUBB4B | 2037 | 
For a full description of get_anndata() refer to help(cellxgene_census.get_anndata)
Don’t forget to close the census!
Querying cell metadata (obs)
The human gene metadata of the Census, for RNA assays, is located at census["census_data"]["homo_sapiens"].obs. This is a SOMADataFrame and as such it can be materialized as a pandas.DataFrame via the methods read().concat().to_pandas(). See also, the helper function cellxgene_census.get_obs which removes some boiler plate.
The mouse cell metadata is at census["census_data"]["mus_musculus"].obs.
For slicing the cell metadata there are two relevant arguments that can be passed through read():
- column_names— list of strings indicating what metadata columns to fetch.
- value_filter— Python expression with selection conditions to fetch rows, it is similar to pandas.DataFrame.query(), for full details see tiledb.QueryCondition shortly:- Expressions are one or more comparisons 
- Comparisons are one of - <column> <op> <value>or- <column> <op> <column>
- Expressions can combine comparisons using and, or, & or | 
- op is one of < | > | <= | >= | == | != or in 
 
To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.
[6]:
keys = list(census["census_data"]["homo_sapiens"].obs.keys())
keys
[6]:
['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_general',
 'tissue_general_ontology_term_id']
soma_joinid is a special SOMADataFrame column that is used for join operations. The definition for all other columns can be found at the Census schema.
All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.
For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex.
[7]:
sex_cell_metadata = cellxgene_census.get_obs(census, "homo_sapiens", column_names=["sex"])
sex_cell_metadata.drop_duplicates()
[7]:
| sex | |
|---|---|
| 0 | unknown | 
| 669 | female | 
| 385437 | male | 
As you can see there are only three different values for sex, that is "male", "female" and "unknown".
With this information we can fetch all cell metatadata for a specific sex value, for example "unknown".
[8]:
cell_metadata_all_unknown_sex = cellxgene_census.get_obs(census, "homo_sapiens", value_filter="sex == 'unknown'")
cell_metadata_all_unknown_sex
[8]:
| soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | is_primary_data | self_reported_ethnicity | self_reported_ethnicity_ontology_term_id | sex | sex_ontology_term_id | suspension_type | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | 10x 3' v3 | EFO:0009922 | syncytiotrophoblast cell | CL:0000525 | 9th week post-fertilization human stage | HsapDv:0000046 | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | nucleus | decidua basalis | UBERON:0000453 | placenta | UBERON:0001987 | 
| 1 | 1 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | 10x 3' v3 | EFO:0009922 | placental villous trophoblast | CL:2000060 | 9th week post-fertilization human stage | HsapDv:0000046 | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | nucleus | decidua basalis | UBERON:0000453 | placenta | UBERON:0001987 | 
| 2 | 2 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | 10x 3' v3 | EFO:0009922 | syncytiotrophoblast cell | CL:0000525 | 9th week post-fertilization human stage | HsapDv:0000046 | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | nucleus | decidua basalis | UBERON:0000453 | placenta | UBERON:0001987 | 
| 3 | 3 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | 10x 3' v3 | EFO:0009922 | syncytiotrophoblast cell | CL:0000525 | 9th week post-fertilization human stage | HsapDv:0000046 | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | nucleus | decidua basalis | UBERON:0000453 | placenta | UBERON:0001987 | 
| 4 | 4 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | 10x 3' v3 | EFO:0009922 | extravillous trophoblast | CL:0008036 | 9th week post-fertilization human stage | HsapDv:0000046 | normal | PATO:0000461 | ... | False | unknown | unknown | unknown | unknown | nucleus | decidua basalis | UBERON:0000453 | placenta | UBERON:0001987 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 3251329 | 56274573 | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf | microwell-seq | EFO:0030002 | cord blood hematopoietic stem cell | CL:2000095 | newborn human stage | HsapDv:0000082 | normal | PATO:0000461 | ... | True | Han Chinese | HANCESTRO:0027 | unknown | unknown | cell | umbilical cord blood | UBERON:0012168 | blood | UBERON:0000178 | 
| 3251330 | 56274574 | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf | microwell-seq | EFO:0030002 | cord blood hematopoietic stem cell | CL:2000095 | newborn human stage | HsapDv:0000082 | normal | PATO:0000461 | ... | True | Han Chinese | HANCESTRO:0027 | unknown | unknown | cell | umbilical cord blood | UBERON:0012168 | blood | UBERON:0000178 | 
| 3251331 | 56274575 | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf | microwell-seq | EFO:0030002 | cord blood hematopoietic stem cell | CL:2000095 | newborn human stage | HsapDv:0000082 | normal | PATO:0000461 | ... | True | Han Chinese | HANCESTRO:0027 | unknown | unknown | cell | umbilical cord blood | UBERON:0012168 | blood | UBERON:0000178 | 
| 3251332 | 56274576 | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf | microwell-seq | EFO:0030002 | cord blood hematopoietic stem cell | CL:2000095 | newborn human stage | HsapDv:0000082 | normal | PATO:0000461 | ... | True | Han Chinese | HANCESTRO:0027 | unknown | unknown | cell | umbilical cord blood | UBERON:0012168 | blood | UBERON:0000178 | 
| 3251333 | 56274577 | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf | microwell-seq | EFO:0030002 | cord blood hematopoietic stem cell | CL:2000095 | newborn human stage | HsapDv:0000082 | normal | PATO:0000461 | ... | True | Han Chinese | HANCESTRO:0027 | unknown | unknown | cell | umbilical cord blood | UBERON:0012168 | blood | UBERON:0000178 | 
3251334 rows × 21 columns
You can use both column_names and value_filter to perform specific queries. For example let’s fetch the disease columns for the cell_type "B cell" in the tissue_general "lung" and from non-duplicated cells.
[9]:
cell_metadata_b_cell = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data==True",
    column_names=["disease"],
)
cell_metadata_b_cell.value_counts()
[9]:
disease                                cell_type  tissue_general  is_primary_data
lung adenocarcinoma                    B cell     lung            True               42720
squamous cell lung carcinoma           B cell     lung            True               10631
non-small cell lung carcinoma          B cell     lung            True                8742
normal                                 B cell     lung            True                8187
COVID-19                               B cell     lung            True                2313
chronic obstructive pulmonary disease  B cell     lung            True                2083
lung large cell carcinoma              B cell     lung            True                1534
pulmonary emphysema                    B cell     lung            True                1512
pulmonary fibrosis                     B cell     lung            True                1474
pleomorphic carcinoma                  B cell     lung            True                1210
interstitial lung disease              B cell     lung            True                 332
small cell lung carcinoma              B cell     lung            True                 204
lymphangioleiomyomatosis               B cell     lung            True                 133
pneumonia                              B cell     lung            True                  50
Name: count, dtype: int64
Querying gene metadata (var)
The human gene metadata of the Census is located at census["census_data"]["homo_sapiens"].ms["RNA"].var. Similarly to the cell metadata, it is a SOMADataFrame and thus we can also use its method read().
The mouse gene metadata is at census["census_data"]["mus_musculus"].ms["RNA"].var.
Let’s take a look at the metadata available for column selection and row filtering.
[10]:
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())
keys
[10]:
['soma_joinid', 'feature_id', 'feature_name', 'feature_length']
With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.
For example, to get the feature_name and feature_length of the genes "ENSG00000161798" and "ENSG00000188229" we can do the following.
[11]:
gene_metadata = cellxgene_census.get_var(
    census,
    "homo_sapiens",
    value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    column_names=["feature_name", "feature_length"],
)
gene_metadata
[11]:
| feature_name | feature_length | feature_id | |
|---|---|---|---|
| 0 | AQP5 | 1884 | ENSG00000161798 | 
| 1 | TUBB4B | 2037 | ENSG00000188229 | 
[12]:
census.close()