Exploring all data from a tissue

This tutorial provides a series of examples for how to explore and query the Census in the context of a single tissue, lung. We will summarize cell and gene metadata, then fetch the single-cell expression counts and perform some basic data explorations via Scanpy

Contents

Learning about the human lung data.
1. Learning about cells of the lung.
2. Learning about genes of the lung .
Fetching all single-cell human lung data from the Census.
Calculating QC metrics of the lung data.
Creating a normalized expression layer and embeddings.

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Learning about the lung data in the Census

First we will open the Census. If you are not familiar with the basics of the Census API you should take a look at notebook Learning about the CZ CELLxGENE Census

[1]:

import cellxgene_census
import numpy as np
import pandas as pd
import scanpy as sc

census = cellxgene_census.open_soma()

The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

Let’s first take a look at the number of cells from human lung:

[2]:

summary_table = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()

summary_table.query("organism == 'Homo sapiens' & category == 'tissue_general' & label =='lung'")

[2]:

	soma_joinid	organism	category	ontology_term_id	unique_cell_count	total_cell_count	label
980	980	Homo sapiens	tissue_general	UBERON:0002048	2907156	6011592	lung

There you can see the total of cells of under total_cell_count and the unique number cells under unique_cell_count (i.e. after removing cells that were included in multiple datasets).

Let’s now take a look at the cell and gene information of this slice of the Census.

Learning about cells of lung data

Let’s load the cell metadata for all lung cells and select only the unique cells using is_primary_data.

[3]:

lung_obs = cellxgene_census.get_obs(
    census, "homo_sapiens", value_filter="tissue_general == 'lung' and is_primary_data == True"
)
lung_obs

[3]:

	soma_joinid	dataset_id	assay	assay_ontology_term_id	cell_type	cell_type_ontology_term_id	development_stage	development_stage_ontology_term_id	disease	disease_ontology_term_id	...	is_primary_data	self_reported_ethnicity	self_reported_ethnicity_ontology_term_id	sex	sex_ontology_term_id	suspension_type	tissue	tissue_ontology_term_id	tissue_general	tissue_general_ontology_term_id
0	5945423	9f222629-9e39-47d0-b83f-e08d610c7479	10x 3' v2	EFO:0009899	native cell	CL:0000003	unknown	unknown	normal	PATO:0000461	...	True	unknown	unknown	unknown	unknown	cell	lung	UBERON:0002048	lung	UBERON:0002048
1	5945426	9f222629-9e39-47d0-b83f-e08d610c7479	Drop-seq	EFO:0008722	ciliated columnar cell of tracheobronchial tree	CL:0002145	57-year-old human stage	HsapDv:0000151	pulmonary fibrosis	MONDO:0002771	...	True	unknown	unknown	male	PATO:0000384	cell	lung	UBERON:0002048	lung	UBERON:0002048
2	5945428	9f222629-9e39-47d0-b83f-e08d610c7479	10x 3' transcription profiling	EFO:0030003	CD8-positive, alpha-beta T cell	CL:0000625	unknown	unknown	squamous cell lung carcinoma	MONDO:0005097	...	True	unknown	unknown	unknown	unknown	cell	lung	UBERON:0002048	lung	UBERON:0002048
3	5945432	9f222629-9e39-47d0-b83f-e08d610c7479	10x 3' v2	EFO:0009899	CD4-positive, alpha-beta T cell	CL:0000624	unknown	unknown	lung adenocarcinoma	MONDO:0005061	...	True	unknown	unknown	unknown	unknown	cell	lung	UBERON:0002048	lung	UBERON:0002048
4	5945441	9f222629-9e39-47d0-b83f-e08d610c7479	10x 3' v2	EFO:0009899	CD8-positive, alpha-beta T cell	CL:0000625	unknown	unknown	lung adenocarcinoma	MONDO:0005061	...	True	unknown	unknown	unknown	unknown	cell	lung	UBERON:0002048	lung	UBERON:0002048
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2907151	56400868	8c42cfd0-0b0a-46d5-910c-fc833d83c45e	10x 3' v2	EFO:0009899	pericyte	CL:0000669	51-year-old human stage	HsapDv:0000145	normal	PATO:0000461	...	True	unknown	unknown	female	PATO:0000383	cell	lung	UBERON:0002048	lung	UBERON:0002048
2907152	56400869	8c42cfd0-0b0a-46d5-910c-fc833d83c45e	10x 3' v2	EFO:0009899	pericyte	CL:0000669	51-year-old human stage	HsapDv:0000145	normal	PATO:0000461	...	True	unknown	unknown	female	PATO:0000383	cell	lung	UBERON:0002048	lung	UBERON:0002048
2907153	56400870	8c42cfd0-0b0a-46d5-910c-fc833d83c45e	10x 3' v2	EFO:0009899	pericyte	CL:0000669	51-year-old human stage	HsapDv:0000145	normal	PATO:0000461	...	True	unknown	unknown	female	PATO:0000383	cell	lung	UBERON:0002048	lung	UBERON:0002048
2907154	56400871	8c42cfd0-0b0a-46d5-910c-fc833d83c45e	10x 3' v2	EFO:0009899	pericyte	CL:0000669	51-year-old human stage	HsapDv:0000145	normal	PATO:0000461	...	True	unknown	unknown	female	PATO:0000383	cell	lung	UBERON:0002048	lung	UBERON:0002048
2907155	56400872	8c42cfd0-0b0a-46d5-910c-fc833d83c45e	10x 3' v2	EFO:0009899	pericyte	CL:0000669	51-year-old human stage	HsapDv:0000145	normal	PATO:0000461	...	True	unknown	unknown	female	PATO:0000383	cell	lung	UBERON:0002048	lung	UBERON:0002048

2907156 rows × 21 columns

You can see that the number or rows represents the total number of unique lung cells in the Census. Now let’s take a deeper dive into the characteristics of these cells.

Datasets

First let’s start by looking at what are the datasets and collections from CELLxGENE Discover contributing to lung. For this we will use the dataset table at census["census-info"]["datasets"] that contains metadata of all datasets used to build this Census.

[4]:

census_datasets = (
    census["census_info"]["datasets"]
    .read(column_names=["collection_name", "dataset_title", "dataset_id", "soma_joinid"])
    .concat()
    .to_pandas()
)
census_datasets = census_datasets.set_index("dataset_id")
census_datasets

[4]:

	collection_name	dataset_title	soma_joinid
dataset_id
f171db61-e57e-4535-a06a-35d8b6ef8f2b	Spatial multiomics map of trophoblast developm...	donor_p13_trophoblasts	0
ecf2e08e-2032-4a9e-b466-b65b395f4a02	Spatial multiomics map of trophoblast developm...	All donors trophoblasts	1
74cff64f-9da9-4b2a-9b3b-8a04a1598040	Spatial multiomics map of trophoblast developm...	All donors all cell states (in vivo)	2
5af90777-6760-4003-9dba-8f945fec6fdf	Mapping single-cell transcriptomes in the intr...	Single-cell transcriptomic datasets of Renal c...	3
bd65a70f-b274-4133-b9dd-0d1431b6af34	Single-cell sequencing links multiregional imm...	Single-cell sequencing links multiregional imm...	4
...	...	...	...
f9ad5649-f372-43e1-a3a8-423383e5a8a2	Molecular characterization of selectively vuln...	Molecular characterization of selectively vuln...	588
456e8b9b-f872-488b-871d-94534090a865	Single-cell atlas of peripheral immune respons...	Single-cell atlas of peripheral immune respons...	589
2adb1f8a-a6b1-4909-8ee8-484814e2d4bf	Construction of a human cell landscape at sing...	Construction of a human cell landscape at sing...	590
e04daea4-4412-45b5-989e-76a9be070a89	A molecular cell atlas of the human lung from ...	Krasnow Lab Human Lung Cell Atlas, Smart-seq2	591
8c42cfd0-0b0a-46d5-910c-fc833d83c45e	A molecular cell atlas of the human lung from ...	Krasnow Lab Human Lung Cell Atlas, 10X	592

593 rows × 3 columns

The obs cell metadata pandas.DataFrame contains a column dataset_id that can be used for joining to the census_dataset pandas.DataFrame we just created.

So let’s take a look at the cell counts per dataset_id of the lung slice and then join to the dataset table to append the human-readable labels.

[5]:

dataset_cell_counts = pd.DataFrame(lung_obs[["dataset_id"]].value_counts())
dataset_cell_counts = dataset_cell_counts.rename(columns={0: "cell_counts"})
dataset_cell_counts = dataset_cell_counts.merge(census_datasets, on="dataset_id")

dataset_cell_counts

[5]:

	count	collection_name	dataset_title	soma_joinid
dataset_id
1e6a6ef9-7ec9-4c90-bbfb-2ad3c3165fd1	1028006	High-resolution single-cell atlas reveals dive...	The single-cell lung cancer atlas (LuCA) -- ex...	314
9f222629-9e39-47d0-b83f-e08d610c7479	784630	The integrated Human Lung Cell Atlas	An integrated cell atlas of the human lung in ...	56
f7c1c579-2dc0-47e2-ba19-8165c5a0e353	217738	A human cell atlas of fetal gene expression	Survey of human embryonic development	483
d8da613f-e681-4c69-b463-e94f5e66847f	116313	A molecular single-cell lung atlas of lethal C...	A molecular single-cell lung atlas of lethal C...	80
576f193c-75d0-4a11-bd25-8676587e6dc2	90384	HTAN MSK - Single cell profiling reveals novel...	Combined samples	377
d41f45c1-1b7b-4573-a998-ac5c5acb1647	82991	HTAN MSK - Transcriptional connectivity of reg...	Transcriptional connectivity of regulatory T c...	58
3dc61ca1-ce40-46b6-8337-f27260fd9a03	71752	A human fetal lung cell atlas uncovers proxima...	All cells	325
8c42cfd0-0b0a-46d5-910c-fc833d83c45e	60993	A molecular cell atlas of the human lung from ...	Krasnow Lab Human Lung Cell Atlas, 10X	592
2672b679-8048-4f5e-9786-f1b196ccfd08	57019	scRNA-seq assessment of the human lung, spleen...	Lung Parenchyma	416
9dbab10c-118d-496b-966a-67f1763a6b7d	49014	COVID-19 immune features revealed by a large-s...	Large-scale single-cell analysis reveals criti...	482
9968be68-ab65-4a38-9e1a-c9b6abece194	47909	Charting human development using a multi-endod...	Developing Human Atlas	78
3de0ad6d-4378-4f62-b37b-ec0b75a50d94	46500	LungMAP — Human data from a broad age healthy ...	Single-cell multiomic profiling of human lungs...	456
2f132ec9-24b5-422f-9be0-ccef03b4fe28	39778	SARS-CoV-2 receptor ACE2 and TMPRSS2 are prima...	Lung	312
1e5bd3b8-6a0e-4959-8d69-cafed30fe814	35699	Emphysema Cell Atlas	immune cells	130
53d208b0-2cfd-4366-9866-c3c6114081bc	35682	Tabula Sapiens	Tabula Sapiens - All Cells	475
1b9d8702-5af8-4142-85ed-020eb06ec4f6	35419	Cross-tissue immune cell analysis reveals tiss...	Global	411
4ed927e9-c099-49af-b8ce-a2652d069333	35284	Single-nucleus cross-tissue molecular referenc...	Single-nucleus cross-tissue molecular referenc...	367
2adb1f8a-a6b1-4909-8ee8-484814e2d4bf	33698	Construction of a human cell landscape at sing...	Construction of a human cell landscape at sing...	590
4b6af54a-4a21-46e0-bc8d-673c0561a836	18386	Emphysema Cell Atlas	non-immune cells	128
01209dce-3575-4bed-b1df-129f57fbc031	11059	Single-cell transcriptomics of human T cells r...	Single-cell transcriptomics of human T cells r...	531
e04daea4-4412-45b5-989e-76a9be070a89	8657	A molecular cell atlas of the human lung from ...	Krasnow Lab Human Lung Cell Atlas, Smart-seq2	591
f9846bb4-784d-4582-92c1-3f279e4c6f0c	176	A human fetal lung cell atlas uncovers proxima...	Fibroblast and smooth muscle	317
f64e1be1-de15-4d27-8da4-82225cd4c035	55	HTAN MSK - Single cell profiling reveals novel...	Immune cells	370
810ac45f-8969-4698-b42c-652f802f75c2	10	A human fetal lung cell atlas uncovers proxima...	Endothelium	320
0ba16f4b-cb87-4fa3-9363-19fc51eec6e7	4	A human fetal lung cell atlas uncovers proxima...	Myeloid	326

These are all the datasets lung cells whose counts are reprensented in the column cell_counts. The top collections with lung data are:

Assays

Let’s use similar logic to take a look at all the assays available for human lung data. This tells us that most assays are from 10x technologies and sci-RNA-seq.

[6]:

lung_obs[["assay"]].value_counts()

[6]:

assay
10x 3' v2                                   1236968
10x 3' v3                                    702074
10x 5' v1                                    262323
sci-RNA-seq                                  217738
BD Rhapsody Whole Transcriptome Analysis     122902
10x 3' transcription profiling                97432
Drop-seq                                      65220
single cell library construction              58981
10x 5' v2                                     41852
microwell-seq                                 33698
Smart-seq2                                    25662
inDrop                                        25652
10x 3' v1                                      8638
Seq-Well                                       8016
Name: count, dtype: int64

Disease

And now let’s take a look at diseased cell counts, with normal indicating non-diseased cells.

[7]:

lung_obs[["disease"]].value_counts()

[7]:

disease
normal                                   1164084
lung adenocarcinoma                       772120
COVID-19                                  331019
squamous cell lung carcinoma              209675
non-small cell lung carcinoma             120796
chronic obstructive pulmonary disease      55254
pulmonary fibrosis                         51343
interstitial lung disease                  45714
pneumonia                                  31923
pulmonary emphysema                        31792
small cell lung carcinoma                  31540
lung large cell carcinoma                  21167
cystic fibrosis                            17590
lymphangioleiomyomatosis                   12374
pleomorphic carcinoma                      10765
Name: count, dtype: int64

Sex

There doesn’t seem to be strong biases for sex.

[8]:

lung_obs[["sex"]].value_counts()

[8]:

sex
male       1402565
female     1122990
unknown     381601
Name: count, dtype: int64

Cell vs nucleus

The majority of data are from cells and not nucleus.

[9]:

lung_obs[["suspension_type"]].value_counts()

[9]:

suspension_type
cell               2468587
nucleus             438569
Name: count, dtype: int64

Cell types

Let’s take a look at the counts of the top 20 cell types.

[10]:

lung_obs[["cell_type"]].value_counts().head(20)

[10]:

cell_type
alveolar macrophage                                291507
native cell                                        263362
CD4-positive, alpha-beta T cell                    211456
CD8-positive, alpha-beta T cell                    189471
macrophage                                         154415
type II pneumocyte                                 128463
epithelial cell of lower respiratory tract         105090
classical monocyte                                 102303
natural killer cell                                 95953
T cell                                              92846
stromal cell                                        87714
B cell                                              81125
malignant cell                                      75917
plasma cell                                         64551
epithelial cell                                     59353
fibroblast                                          45305
capillary endothelial cell                          39416
regulatory T cell                                   36381
ciliated columnar cell of tracheobronchial tree     36049
epithelial cell of lung                             35467
Name: count, dtype: int64

Sub-tissues

We can look at the original tissue annotations that were mapped to “lung”.

[11]:

lung_obs[["tissue"]].value_counts()

[11]:

tissue
lung                         2576327
lung parenchyma               147410
alveolus of lung               54085
lingula of left lung           35284
upper lobe of right lung       32099
lower lobe of left lung        17854
right lung                     12880
upper lobe of left lung        10113
left lung                       9276
lower lobe of right lung        7981
middle lobe of right lung       3847
Name: count, dtype: int64

Learning about genes of lung data

Let’s load the gene metadata of the Census.

[12]:

lung_var = cellxgene_census.get_var(census, "homo_sapiens")
lung_var

[12]:

	soma_joinid	feature_id	feature_name	feature_length
0	0	ENSG00000121410	A1BG	3999
1	1	ENSG00000268895	A1BG-AS1	3374
2	2	ENSG00000148584	A1CF	9603
3	3	ENSG00000175899	A2M	6318
4	4	ENSG00000245105	A2M-AS1	2948
...	...	...	...	...
60659	60659	ENSG00000288719	RP4-669P10.21	4252
60660	60660	ENSG00000288720	RP11-852E15.3	7007
60661	60661	ENSG00000288721	RP5-973N23.5	7765
60662	60662	ENSG00000288723	RP11-553N16.6	1015
60663	60663	ENSG00000288724	RP13-546I2.2	625

60664 rows × 4 columns

You can see the total number of genes represented by the number of rows. This number is actually misleading because it is the join of all genes in the Census. However we know that the lung data comes from a subset of datasets.

So let’s take a look at the number of genes that were measured in each of those datasets.

To accomplish this we can use the “dataset presence matrix” at census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]. This is a boolean matrix N x M where N is the number of datasets and M is the number of genes in the Census.

So we can select the rows corresponding to the lung datasets and perform a row-wise sum.

[13]:

presence_matrix = cellxgene_census.get_presence_matrix(census, "Homo sapiens", "RNA")
presence_matrix = presence_matrix[dataset_cell_counts.soma_joinid, :]

[14]:

presence_matrix.sum(axis=1).A1

[14]:

array([17811, 50259, 44150, 34265, 22447, 23642, 26347, 20921, 24672,
       27705, 27243, 26323, 27181, 23203, 57042, 32610, 29620, 26454,
       23705, 38676, 47307, 23740, 22552, 20594, 19952], dtype=uint64)

[15]:

genes_measured = presence_matrix.sum(axis=1).A1
dataset_cell_counts["genes_measured"] = genes_measured
dataset_cell_counts

[15]:

	count	collection_name	dataset_title	soma_joinid	genes_measured
dataset_id
1e6a6ef9-7ec9-4c90-bbfb-2ad3c3165fd1	1028006	High-resolution single-cell atlas reveals dive...	The single-cell lung cancer atlas (LuCA) -- ex...	314	17811
9f222629-9e39-47d0-b83f-e08d610c7479	784630	The integrated Human Lung Cell Atlas	An integrated cell atlas of the human lung in ...	56	50259
f7c1c579-2dc0-47e2-ba19-8165c5a0e353	217738	A human cell atlas of fetal gene expression	Survey of human embryonic development	483	44150
d8da613f-e681-4c69-b463-e94f5e66847f	116313	A molecular single-cell lung atlas of lethal C...	A molecular single-cell lung atlas of lethal C...	80	34265
576f193c-75d0-4a11-bd25-8676587e6dc2	90384	HTAN MSK - Single cell profiling reveals novel...	Combined samples	377	22447
d41f45c1-1b7b-4573-a998-ac5c5acb1647	82991	HTAN MSK - Transcriptional connectivity of reg...	Transcriptional connectivity of regulatory T c...	58	23642
3dc61ca1-ce40-46b6-8337-f27260fd9a03	71752	A human fetal lung cell atlas uncovers proxima...	All cells	325	26347
8c42cfd0-0b0a-46d5-910c-fc833d83c45e	60993	A molecular cell atlas of the human lung from ...	Krasnow Lab Human Lung Cell Atlas, 10X	592	20921
2672b679-8048-4f5e-9786-f1b196ccfd08	57019	scRNA-seq assessment of the human lung, spleen...	Lung Parenchyma	416	24672
9dbab10c-118d-496b-966a-67f1763a6b7d	49014	COVID-19 immune features revealed by a large-s...	Large-scale single-cell analysis reveals criti...	482	27705
9968be68-ab65-4a38-9e1a-c9b6abece194	47909	Charting human development using a multi-endod...	Developing Human Atlas	78	27243
3de0ad6d-4378-4f62-b37b-ec0b75a50d94	46500	LungMAP — Human data from a broad age healthy ...	Single-cell multiomic profiling of human lungs...	456	26323
2f132ec9-24b5-422f-9be0-ccef03b4fe28	39778	SARS-CoV-2 receptor ACE2 and TMPRSS2 are prima...	Lung	312	27181
1e5bd3b8-6a0e-4959-8d69-cafed30fe814	35699	Emphysema Cell Atlas	immune cells	130	23203
53d208b0-2cfd-4366-9866-c3c6114081bc	35682	Tabula Sapiens	Tabula Sapiens - All Cells	475	57042
1b9d8702-5af8-4142-85ed-020eb06ec4f6	35419	Cross-tissue immune cell analysis reveals tiss...	Global	411	32610
4ed927e9-c099-49af-b8ce-a2652d069333	35284	Single-nucleus cross-tissue molecular referenc...	Single-nucleus cross-tissue molecular referenc...	367	29620
2adb1f8a-a6b1-4909-8ee8-484814e2d4bf	33698	Construction of a human cell landscape at sing...	Construction of a human cell landscape at sing...	590	26454
4b6af54a-4a21-46e0-bc8d-673c0561a836	18386	Emphysema Cell Atlas	non-immune cells	128	23705
01209dce-3575-4bed-b1df-129f57fbc031	11059	Single-cell transcriptomics of human T cells r...	Single-cell transcriptomics of human T cells r...	531	38676
e04daea4-4412-45b5-989e-76a9be070a89	8657	A molecular cell atlas of the human lung from ...	Krasnow Lab Human Lung Cell Atlas, Smart-seq2	591	47307
f9846bb4-784d-4582-92c1-3f279e4c6f0c	176	A human fetal lung cell atlas uncovers proxima...	Fibroblast and smooth muscle	317	23740
f64e1be1-de15-4d27-8da4-82225cd4c035	55	HTAN MSK - Single cell profiling reveals novel...	Immune cells	370	22552
810ac45f-8969-4698-b42c-652f802f75c2	10	A human fetal lung cell atlas uncovers proxima...	Endothelium	320	20594
0ba16f4b-cb87-4fa3-9363-19fc51eec6e7	4	A human fetal lung cell atlas uncovers proxima...	Myeloid	326	19952

You can see the genes measured in each dataset represented in genes_measured. Now lets get the genes that were measured in all datasets.

[16]:

var_somaid = np.nonzero(presence_matrix.sum(axis=0).A1 == presence_matrix.shape[0])[0].tolist()

[17]:

lung_var = lung_var.query(f"soma_joinid in {var_somaid}")
lung_var

[17]:

	soma_joinid	feature_id	feature_name	feature_length
0	0	ENSG00000121410	A1BG	3999
3	3	ENSG00000175899	A2M	6318
8	8	ENSG00000128274	A4GALT	3358
10	10	ENSG00000094914	AAAS	4727
11	11	ENSG00000081760	AACS	16039
...	...	...	...	...
29951	29951	ENSG00000177272	KCNA3	2476
30157	30157	ENSG00000184709	LRRC26	1209
30185	30185	ENSG00000087250	MT3	1679
30202	30202	ENSG00000136352	NKX2-1	3165
30512	30512	ENSG00000231439	WASIR2	1054

11595 rows × 4 columns

The number of rows represents the genes that were measured in all lung datasets.

Summary of lung metadata

In the previous sections, using the Census we learned the following information:

The total number of unique lung cells and their composition for:
- Number of datasets.
- Number sequencing technologies, most of which are 10x
- Mostly human data, but some diseases exist, primarily “lung adenocarcinoma” and “COVID-19 infected”
- No sex biases.
- Mostly data from cells (~80%) rather than nucleus (~20%)
A total of ~12k genes were measured across all cells.

Fetching all single-cell human lung data from the Census

Since loading the entire lung data is resource-intensive, for the sake of this exercise let’s load a subset of the lung data into an anndata.AnnData object and perform some exploratory analysis.

We will subset to 100,000 random unique cells using the lung_obs pandas.DataFrame we previously created.

[18]:

lung_cell_subsampled_n = 100000
lung_cell_subsampled_ids = lung_obs["soma_joinid"].sample(lung_cell_subsampled_n, random_state=1).tolist()

Now we can directly use the values of soma_joinid for querying the Census data and obtaining an AnnData object.

[19]:

lung_gene_ids = lung_var["soma_joinid"].to_numpy()
lung_adata = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_coords=lung_cell_subsampled_ids,
    var_coords=lung_gene_ids,
)

lung_adata.var_names = lung_adata.var["feature_name"]

[20]:

lung_adata

[20]:

AnnData object with n_obs × n_vars = 100000 × 11595
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

We are done with the census, so close it

[21]:

census.close()
del census

Calculating QC metrics of the lung data

Now let’s take a look at some QC metrics

Top genes per cell

[22]:

sc.pl.highest_expr_genes(lung_adata, n_top=20)

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_39_0.png

Number of sequenced genes by assay

[23]:

sc.pp.calculate_qc_metrics(lung_adata, percent_top=None, log1p=False, inplace=True)
sc.pl.violin(lung_adata, "n_genes_by_counts", groupby="assay", jitter=0.4, rotation=90)

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_41_0.png

Total counts by assay

[24]:

sc.pl.violin(lung_adata, "total_counts", groupby="assay", jitter=0.4, rotation=90)

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_43_0.png

You can see that Smart-Seq2 is an outlier for the total counts per cell, so let’s exlcude it to see how the rest of the assays look like

[25]:

sc.pl.violin(
    lung_adata[lung_adata.obs["assay"] != "Smart-seq2",],
    "total_counts",
    groupby="assay",
    jitter=0.4,
    rotation=90,
)

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_45_0.png

Creating a normalized expression layer and embeddings

Let’s perform a bread and butter normalization and take a look at UMAP embeddings, but for all the data below we’ll exclude Smart-seq2 as this requires an extra step to normalize based on gene lengths

[26]:

lung_adata = lung_adata[lung_adata.obs["assay"] != "Smart-seq2",].copy()
lung_adata.layers["counts"] = lung_adata.X

Now let’s do some basic normalization:

Normalize by sequencing depth
Transform to log-scale
Select 500 highly variable genes
Scale values across the gene axis

[27]:

sc.pp.normalize_total(lung_adata, target_sum=1e4)
sc.pp.log1p(lung_adata)
sc.pp.highly_variable_genes(lung_adata, n_top_genes=500, flavor="seurat_v3", layer="counts")
lung_adata = lung_adata[:, lung_adata.var.highly_variable]
sc.pp.scale(lung_adata, max_value=10)

/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning: `flavor='seurat_v3'` expects raw count data, but non-integers were found.
  warnings.warn(
/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

And reduce dimensionality by obtaining UMAP embeddings.

[28]:

sc.tl.pca(lung_adata)
sc.pp.neighbors(lung_adata)
sc.tl.umap(lung_adata)

/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

And plot these embeddings.

[29]:

n_cell_types = len(lung_adata.obs["cell_type"].drop_duplicates())

from random import randint

colors = []

for i in range(len(lung_adata.obs["cell_type"].drop_duplicates())):
    colors.append("#%06X" % randint(0, 0xFFFFFF))

[30]:

sc.pl.umap(lung_adata, color="cell_type", palette=colors, legend_loc=None)

/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_54_1.png

Let’s color by assay.

[31]:

sc.pl.umap(lung_adata, color="assay")

/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_56_1.png

Given the high number of cell types it makes it hard to visualize, so let’s look at the top 20 most abundant cell types.

[32]:

top_cell_types = lung_adata.obs["cell_type"].value_counts()
top_cell_types = list(top_cell_types.reset_index().head(20)["cell_type"])

[33]:

lung_adata_top_cell_types = lung_adata[[i in top_cell_types for i in lung_adata.obs["cell_type"]], :]
sc.pl.umap(lung_adata_top_cell_types, color="cell_type")

/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_59_1.png

Let’s color by assay of this subset of the data.

[34]:

sc.pl.umap(lung_adata_top_cell_types, color="assay")

/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

../../_images/notebooks_analysis_demo_comp_bio_explore_and_load_lung_data_61_1.png