cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer
- class cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 4096, special_token: bool = True, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any)
- Generate a Hugging Face Dataset containing Geneformer token sequences for each cell in CELLxGENE Census ExperimentAxisQuery results (human). - This class requires the Geneformer package to be installed separately with: pip install transformers[torch]<4.50 git+https://huggingface.co/ctheodoris/Geneformer@ebc1e096 - DEPRECATION NOTICE: this is planned for removal from the cellxgene_census API and migrated into git:cellxgene-census/tools/models/geneformer. - Example usage: - ``` import cellxgene_census import tiledbsoma from cellxgene_census.experimental.ml.huggingface import GeneformerTokenizer - with cellxgene_census.open_soma(census_version=”latest”) as census:
- with GeneformerTokenizer(
- census[“census_data”][“homo_sapiens”], # set obs_query to define some subset of Census cells: obs_query=tiledbsoma.AxisQuery(value_filter=”is_primary_data == True and tissue_general == ‘tongue’”), obs_column_names=( - “soma_joinid”, “cell_type_ontology_term_id”, - ), 
- ) as tokenizer:
- dataset = tokenizer.build() 
 
 - Dataset item contents: - input_ids: Geneformer token sequence for the cell - length: Length of the token sequence - and the specified obs_column_names (cell metadata from the experiment obs dataframe) - __init__(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 4096, special_token: bool = True, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any) None
- Initialize GeneformerTokenizer. - Args: - experiment: Census Experiment to query - obs_query: obs AxisQuery defining the set of Census cells to process (default all) - obs_column_names: obs dataframe columns (cell metadata) to propagate into attributes - of each Dataset item - max_input_tokens: maximum length of Geneformer input token sequence (default 4096) 
- special_token: whether to affix separator tokens to the sequence (default True) 
- token_dictionary_file, gene_median_file: pickle files supplying the mapping of Ensembl human gene IDs onto Geneformer token numbers and median expression values. By default, these will be loaded from the Geneformer package. 
- gene_mapping_file: optional pickle file with mapping for Census gene IDs to model’s 
 
 - Methods - X(layer_name, *[, batch_size, partitions, ...])- Returns an - Xlayer as a sparse read.- __init__(experiment, *[, obs_column_names, ...])- Initialize GeneformerTokenizer. - build([from_generator_kwargs])- Build the dataset from query results. - cell_item(cell_joinid, cell_Xrow)- Given the expression vector for one cell, compute the Dataset item providing the Geneformer inputs (token sequence and metadata). - close()- Releases resources associated with this query. - obs(*[, column_names, batch_size, ...])- Returns - obsas an Arrow table iterator.- obs_joinids()- Returns - obs- soma_joinidsas an Arrow array.- obs_scene_ids()- Returns a pyarrow array with scene ids that contain obs from this query. - obsm(layer)- Returns an - obsmlayer as a sparse read.- obsp(layer)- Returns an - obsplayer as a sparse read.- to_anndata(X_name, *[, column_names, ...])- Exports the query to an in-memory - AnnDataobject.- to_spatialdata(X_name, *[, column_names, ...])- Returns a SpatialData object containing the query results - var(*[, column_names, batch_size, ...])- Returns - varas an Arrow table iterator.- var_joinids()- Returns - var- soma_joinidsas an Arrow array.- var_scene_ids()- Return a pyarrow array with scene ids that contain var from this query. - varm(layer)- Returns a - varmlayer as a sparse read.- varp(layer)- Returns a - varplayer as a sparse read.- Attributes - indexer- A - soma_joinidindexer for both- obsand- varaxes.- model_cls_token- model_eos_token- n_obs- The number of - obsaxis query results.- n_vars- The number of - varaxis query results.- obs_column_names- max_input_tokens- special_token- model_gene_map- model_gene_tokens- model_gene_medians