cellxgene_census.experimental.pp.get_highly_variable_genes

cellxgene_census.experimental.pp.get_highly_variable_genes(census: Collection, organism: str, measurement_name: str = 'RNA', X_name: str = 'raw', obs_value_filter: str | None = None, obs_coords: None | bytes | Slice[bytes] | Sequence[bytes] | float | Slice[float] | Sequence[float] | int | Slice[int] | Sequence[int] | slice | Slice[slice] | Sequence[slice] | str | Slice[str] | Sequence[str] | datetime64 | Slice[datetime64] | Sequence[datetime64] | TimestampType | Slice[TimestampType] | Sequence[TimestampType] | Array | ChunkedArray | ndarray[Any, dtype[integer]] | ndarray[Any, dtype[datetime64]] = None, var_value_filter: str | None = None, var_coords: None | bytes | Slice[bytes] | Sequence[bytes] | float | Slice[float] | Sequence[float] | int | Slice[int] | Sequence[int] | slice | Slice[slice] | Sequence[slice] | str | Slice[str] | Sequence[str] | datetime64 | Slice[datetime64] | Sequence[datetime64] | TimestampType | Slice[TimestampType] | Sequence[TimestampType] | Array | ChunkedArray | ndarray[Any, dtype[integer]] | ndarray[Any, dtype[datetime64]] = None, n_top_genes: int = 1000, flavor: Literal['seurat_v3'] = 'seurat_v3', span: float = 0.3, batch_key: str | Sequence[str] | None = None, max_loess_jitter: float = 1e-06, batch_key_func: Callable[[...], Any] | None = None) DataFrame

Convience wrapper around tiledbsoma.Experiment query and cellxgene_census.experimental.pp.highly_variable_genes() function, to build and execute a query, and annotate the query result genes (var dataframe) based upon variability.

Parameters:
  • census – The Census object, usually returned by open_soma().

  • organism – The organism to query, usually one of "Homo sapiens" or "Mus musculus".

  • measurement_name – The measurement object to query. Defaults to "RNA".

  • X_name – The X layer to query. Defaults to "raw".

  • obs_value_filter – Value filter for the obs metadata. Value is a filter query written in the SOMA value_filter syntax.

  • obs_coords – Coordinates for the obs axis, which is indexed by the soma_joinid value. May be an int, a list of int, or a slice. The default, None, selects all.

  • var_value_filter – Value filter for the var metadata. Value is a filter query written in the SOMA value_filter syntax.

  • var_coords – Coordinates for the var axis, which is indexed by the soma_joinid value. May be an int, a list of int, or a slice. The default, None, selects all.

  • n_top_genes – Number of genes to rank.

  • flavor – Method used to annotate genes. Must be "seurat_v3".

  • span – If flavor="seurat_v3", the fraction of obs/cells used to estimate the LOESS variance model fit.

  • batch_key – If specified, gene selection will be done by batch and combined. Specify the obs column name, or list of column names, identifying the batches. If not specified, all gene selection is done as a single batch. If multiple batch keys are specified, and no batch_key_func is specified, the batch key will be generated by converting values to string and concatenating them.

  • max_lowess_jitter – The maximum jitter to add to data in case of LOESS failure (can occur when dataset has low entry counts.)

  • batch_key_func – Optional function to create a user-defined batch key. Function will be called once per row in the obs dataframe. Function will receive a single argument: a pandas.Series containing values specified in the batch_key argument.

Returns:

pandas.DataFrame containing annotations for all var values specified by the query.

Raises:

ValueError – if the flavor paramater is not "seurat_v3".

Examples

Fetch a pandas.DataFrame containing var annotations for a subset of the cells matching the obs_value_filter:

>>> hvg = get_highly_variable_genes(
        census,
        organism="Mus musculus",
        obs_value_filter="is_primary_data == True and tissue_general == 'lung'",
        n_top_genes = 500
    )

Fetch an anndata.AnnData with top 500 genes:

>>> with cellxgene_census.open_soma(census_version="stable") as census:
        organism = "mus_musculus"
        obs_value_filter = "is_primary_data == True and tissue_general == 'lung'"
        # Get the highly variable genes
        hvg = cellxgene_census.experimental.pp.get_highly_variable_genes(
            census,
            organism=organism,
            obs_value_filter=obs_value_filter,
            n_top_genes = 500
        )
        # Fetch AnnData - all cells matching obs_value_filter, just the HVGs
        hvg_soma_ids = hvg[hvg.highly_variable].index.values
        adata = cellxgene_census.get_anndata(
            census, organism=organism, obs_value_filter=obs_value_filter, var_coords=hvg_soma_ids
        )

Lifecycle

experimental