cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer

class cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 2048, special_token: bool = False, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any)

Generate a Hugging Face Dataset containing Geneformer token sequences for each cell in CELLxGENE Census ExperimentAxisQuery results (human).

This class requires the Geneformer package to be installed separately with: pip install git+https://huggingface.co/ctheodoris/Geneformer@eb038a6

Example usage:

``` import cellxgene_census import tiledbsoma from cellxgene_census.experimental.ml.huggingface import GeneformerTokenizer

with cellxgene_census.open_soma(census_version=”latest”) as census:
with GeneformerTokenizer(

census[“census_data”][“homo_sapiens”], # set obs_query to define some subset of Census cells: obs_query=tiledbsoma.AxisQuery(value_filter=”is_primary_data == True and tissue_general == ‘tongue’”), obs_column_names=(

“soma_joinid”, “cell_type_ontology_term_id”,

),

) as tokenizer:

dataset = tokenizer.build()

```

Dataset item contents: - input_ids: Geneformer token sequence for the cell - length: Length of the token sequence - and the specified obs_column_names (cell metadata from the experiment obs dataframe)

__init__(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 2048, special_token: bool = False, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any) None

Initialize GeneformerTokenizer.

Args: - experiment: Census Experiment to query - obs_query: obs AxisQuery defining the set of Census cells to process (default all) - obs_column_names: obs dataframe columns (cell metadata) to propagate into attributes

of each Dataset item

  • max_input_tokens: maximum length of Geneformer input token sequence (default 2048)

  • special_token: whether to affix separator tokens to the sequence (default False)

  • token_dictionary_file, gene_median_file: pickle files supplying the mapping of Ensembl human gene IDs onto Geneformer token numbers and median expression values. By default, these will be loaded from the Geneformer package.

  • gene_mapping_file: optional pickle file with mapping for Census gene IDs to model’s

Methods

X(layer_name, *[, batch_size, partitions, ...])

Returns an X layer as a sparse read.

__init__(experiment, *[, obs_column_names, ...])

Initialize GeneformerTokenizer.

build([from_generator_kwargs])

Build the dataset from query results.

cell_item(cell_joinid, cell_Xrow)

Given the expression vector for one cell, compute the Dataset item providing the Geneformer inputs (token sequence and metadata).

close()

Releases resources associated with this query.

obs(*[, column_names, batch_size, ...])

Returns obs as an Arrow table iterator.

obs_joinids()

Returns obs soma_joinids as an Arrow array.

obsm(layer)

Returns an obsm layer as a sparse read.

obsp(layer)

Returns an obsp layer as a sparse read.

to_anndata(X_name, *[, column_names, ...])

Executes the query and return result as an AnnData in-memory object.

var(*[, column_names, batch_size, ...])

Returns var as an Arrow table iterator.

var_joinids()

Returns var soma_joinids as an Arrow array.

varm(layer)

Returns a varm layer as a sparse read.

varp(layer)

Returns a varp layer as a sparse read.

Attributes

indexer

A soma_joinid indexer for both obs and var axes.

model_cls_token

model_eos_token

n_obs

The number of obs axis query results.

n_vars

The number of var axis query results.

obs_column_names

max_input_tokens

special_token

model_gene_map

model_gene_tokens

model_gene_medians