Understanding and filtering out duplicate cells

This tutorial provides an explanation for the existence of duplicate cells in the Census, and it showcases different ways to handle these cells when performing queries on the Census using the is_primary_data cell metadata variable.

Contents

Why are there duplicate cells in the Census?
An example: duplicate cells in the Tabula Muris Senis data.
Filtering out duplicates cells.
1. Filtering out duplicate cells when reading the obs data frame.
2. Filtering out duplicate cells when creating an AnnData.
3. Filtering out duplicate cells for out-of-core operations.

Why are there duplicate cells in the Census?

Duplicate cells are labeled on the is_primary_data cell metadata variable as False. To learn more about this please take a look at the corresponding section of the dataset schema.

The Census data is a concatenation of most RNA data from CZ CELLxGENE Discover and these data are ingested one dataset at a time. You can take a look at what data is included in the Census here.

In some cases data from the same cell exists in different datasets, therefore cells can be duplicated throughout CELLxGENE Discover and by extension the Census.

The following are a few examples where cells are duplicated in CELLxGENE Discover:

There are datasets that combine data from other, pre-existing datasets.

For exampleTabula Sapienshas one dataset with all of its cells and separate datasets with cells divided by high-level lineage (i.e. immune, epithelial, stromal, endothelial)

A dataset may provide a meta-analysis of pre-existing datasets.

For exampleJin et al.performed a meta-analysis of COVID-19 data, and they included both the individual datasets as well as one concatenated dataset

The Census has all of these data to allow for the execution of dataset-based queries, which would be otherwise be limited if only non-duplicate cells were included.

An example: duplicate cells in the Tabula Muris Senis data

Let’s take a look at an example from the Census using the Tabula Muris Senis data. Some of its datasets contain duplicated cells.

We can obtain cell metadata for the main Tabula Muris Senis dataset: “All - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x”, which contains the original (non-duplicated) cells.

And remember we must include the is_primary_data column.

[1]:

import cellxgene_census

tabula_muris_dataset_id = "48b37086-25f7-4ecd-be66-f5bb378e3aea"
census = cellxgene_census.open_soma()

tabula_muris_obs = cellxgene_census.get_obs(
    census,
    "mus_musculus",
    value_filter=f"dataset_id == '{tabula_muris_dataset_id}'",
    column_names=["tissue", "is_primary_data"],
)

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

Now let’s take a look at counts for the unique combinations of values.

[2]:

tabula_muris_obs.value_counts()

[2]:

tissue           is_primary_data  dataset_id
bone marrow      True             48b37086-25f7-4ecd-be66-f5bb378e3aea    40220
spleen           True             48b37086-25f7-4ecd-be66-f5bb378e3aea    35718
limb muscle      True             48b37086-25f7-4ecd-be66-f5bb378e3aea    28867
lung             True             48b37086-25f7-4ecd-be66-f5bb378e3aea    24540
kidney           True             48b37086-25f7-4ecd-be66-f5bb378e3aea    21647
tongue           True             48b37086-25f7-4ecd-be66-f5bb378e3aea    20680
mammary gland    True             48b37086-25f7-4ecd-be66-f5bb378e3aea    12295
thymus           True             48b37086-25f7-4ecd-be66-f5bb378e3aea     9275
bladder lumen    True             48b37086-25f7-4ecd-be66-f5bb378e3aea     8945
heart            True             48b37086-25f7-4ecd-be66-f5bb378e3aea     8613
trachea          True             48b37086-25f7-4ecd-be66-f5bb378e3aea     7976
liver            True             48b37086-25f7-4ecd-be66-f5bb378e3aea     7294
adipose tissue   True             48b37086-25f7-4ecd-be66-f5bb378e3aea     6777
pancreas         True             48b37086-25f7-4ecd-be66-f5bb378e3aea     6201
skin of body     True             48b37086-25f7-4ecd-be66-f5bb378e3aea     4454
large intestine  True             48b37086-25f7-4ecd-be66-f5bb378e3aea     1887
Name: count, dtype: int64

You can see all cells across the tissues are labelled as True for is_primary_data.

But what if we select cells from the dataset that only contains cells from the liver: “Liver - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x”.

[3]:

tabula_muris_liver_dataset_id = "6202a243-b713-4e12-9ced-c387f8483dea"

tabula_muris_liver_obs = cellxgene_census.get_obs(
    census,
    "mus_musculus",
    value_filter=f"dataset_id == '{tabula_muris_liver_dataset_id}'",
    column_names=["tissue", "is_primary_data"],
)

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

And we take a look at counts for the unique combinations of values.

[4]:

tabula_muris_liver_obs.value_counts()

[4]:

tissue  is_primary_data  dataset_id
liver   False            6202a243-b713-4e12-9ced-c387f8483dea    7294
Name: count, dtype: int64

You can see that:

This dataset only contains cells from liver.
All cells are labelled as False for is_primary_data. This is because the cells are marked as duplicate cells of the main Tabula Muris Senis dataset.

Filtering out duplicate cells

In some cases you may be interested in getting all cells for a specific biological context, for example “all natural killer cells from blood of female cells with COVID-19” but you need to be aware that there is a chance you end up with some duplicate cells.

We therefore recommend that you always look at is_primary_data and use that information based on your needs.

If you know a priori that you don’t want duplicated cells this section shows you how to efficiently exclude them from your queries.

Filtering out duplicate cells when reading the `obs` data frame.

Let’s say you are interested in looking at the cell metadata of “all natural killer cells from blood of female cells with COVID-19” but you want to exclude duplicate cells, then you can use value_filter when reading the data frame to only include cells with is_primary_data as True.

Let’s first read the cell metadata including all cells:

[5]:

nk_cells = cellxgene_census.get_obs(
    census,
    "mus_musculus",
    value_filter="cell_type == 'natural killer cell' "
    "and disease == 'COVID-19' "
    "and sex == 'female'"
    "and tissue_general == 'blood'",
)

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

[6]:

nk_cells.shape

[6]:

(80935, 21)

And now we repeat the query only using cells marked as True for is_primary_data.

[7]:

nk_cells_primary = cellxgene_census.get_obs(
    census,
    "mus_musculus",
    value_filter="cell_type == 'natural killer cell' "
    "and disease == 'COVID-19' "
    "and tissue_general == 'blood'"
    "and sex == 'female'"
    "and is_primary_data == True",
)

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

[8]:

nk_cells_primary.shape

[8]:

(59109, 21)

You can see a clear reduction in the number of cells.

Filtering out duplicate cells when creating an AnnData

You can also utilize is_primary_data on the obs_value_filter of get_anndata.

Let’s repeat the process above. First querying by including all cells. To reduce the bandwidth and memory usage, let’s just fetch data for one gene.

[9]:

adata = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    var_value_filter="feature_name == 'AQP5'",
    obs_value_filter="cell_type == 'natural killer cell' "
    "and disease == 'COVID-19' "
    "and sex == 'female'"
    "and tissue_general == 'blood'",
)

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

[10]:

len(adata.obs)

[10]:

And now we repeat the query only using cells marked as True for is_primary_data.

[11]:

adata_primary = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    var_value_filter="feature_name == 'AQP5'",
    obs_value_filter="cell_type == 'natural killer cell' "
    "and disease == 'COVID-19' "
    "and sex == 'female' "
    "and tissue_general == 'blood'"
    "and is_primary_data == True",
)

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

[12]:

len(adata_primary.obs)

[12]:

In this case you can also observe a clear reduction in the number of cells.

Filtering out duplicate cells for out-of-core operations.

Finally we can utilize is_primary_data on the value_filter of obs of an “Axis Query” to perform out-of-core operations.

In this example we only include the version with duplicated cells removed.

[13]:

import tiledbsoma

human = census["census_data"]["homo_sapiens"]

# initialize lazy query
query = human.axis_query(
    measurement_name="RNA",
    obs_query=tiledbsoma.AxisQuery(
        value_filter="cell_type == 'natural killer cell' "
        "and disease == 'COVID-19' "
        "and tissue_general == 'blood' "
        "and sex == 'female' "
        "and is_primary_data == True"
    ),
)

# get iterator for X
iterator = query.X("raw").tables()

# iterate in chunks
for chunk in iterator:
    print(chunk)

    # since this is a demo we stop right away
    break

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[8448858,8448858,8448858,8448858,8448858,...,52812487,52812553,52812556,52812556,52812566]]
soma_dim_1: [[59,60,62,113,170,...,37033,37052,36904,36919,37033]]
soma_data: [[1,1,1,1,1,...,1,1,1,1,2]]