cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe
- class cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe(experiment: Experiment, measurement_name: str = 'RNA', X_name: str = 'raw', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = True, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = 64, use_eager_fetch: bool = True, shuffle_chunk_count: int | None = 2000, encoders: list[cellxgene_census.experimental.ml.encoders.Encoder] | None = None)
An
torchdata.datapipes.iter.IterDataPipethat readsobsandXdata from atiledbsoma.Experiment, based upon the specified queries along theobsandvaraxes. Provides an iterator over these data when the object is passed to Python’s built-initerfunction.>>> for batch in iter(ExperimentDataPipe(...)): X_batch, y_batch = batch
The
batch_sizeparameter controls the number of rows ofobsandXdata that are returned in each iteration. If thebatch_sizeis 1, then each Tensor will have rank 1:>>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]), # X data tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
For larger
batch_sizevalues, the returned Tensors will have rank 2:>>> DataLoader(..., batch_size=3, ...): (tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0.], # X batch [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.]]), tensor([[2415, 0, 0], # obs batch [2416, 0, 4], [2417, 0, 3]], dtype=torch.int64))
The
return_sparse_Xparameter controls whether theXdata is returned as a dense or sparsetorch.Tensor. If the model supports use of sparsetorch.Tensors, this will reduce memory usage.The
obs_column_namesparameter determines the data columns that are returned in theobsTensor. String-typed columns are encoded as integer values. If needed, these values can be decoded by obtaining the encoder for a givenobscolumn name and calling itsinverse_transformmethod:>>> exp_data_pipe.obs_encoders["<obs_attr_name>"].inverse_transform(encoded_values)
Lifecycle
experimental
- __init__(experiment: Experiment, measurement_name: str = 'RNA', X_name: str = 'raw', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = True, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = 64, use_eager_fetch: bool = True, shuffle_chunk_count: int | None = 2000, encoders: list[cellxgene_census.experimental.ml.encoders.Encoder] | None = None) None
Construct a new
ExperimentDataPipe.Deprecated since version Use: TileDB-SOMA-ML instead.
- Parameters:
experiment – The
tiledbsoma.Experimentfrom which to read data.measurement_name – The name of the
tiledbsoma.Measurementto read. Defaults to"RNA".X_name – The name of the X layer to read. Defaults to
"raw".obs_query – The query used to filter along the
obsaxis. If not specified, allobsandXdata will be returned, which can be very large.var_query – The query used to filter along the
varaxis. If not specified, allvarcolumns (genes/features) will be returned.obs_column_names – The names of the
obscolumns to return. If custom encoders are passed, this parameter must not be used, since the columns will be inferred automatically from the encoders.batch_size – The number of rows of
obsandXdata to return in each iteration. Defaults to1. A value of1will result intorch.Tensorof rank 1 being returns (a single row); larger values will result intorch.Tensors of rank 2 (multiple rows).shuffle – Whether to shuffle the
obsandXdata being returned. Defaults toTrue. For performance reasons, shuffling is not performed globally across all rows, but rather in chunks. More specifically, we selectshuffle_chunk_countnon-contiguous chunks across all the observations in the query, concatenate the chunks and shuffle the associated observations. The randomness of the shuffling is therefore determined by the (soma_chunk_size,shuffle_chunk_count) selection. The default values have been determined to yield a good trade-off between randomness and performance. Further tuning may be required for different type of models. Note that memory usage is correlated to the productsoma_chunk_size * shuffle_chunk_count.seed – The random seed used for shuffling. Defaults to
None(no seed). This must be specified when usingtorch.nn.parallel.DistributedDataParallelto ensure data partitions are disjoint across worker processes.return_sparse_X – Controls whether the
Xdata is returned as a dense or sparsetorch.Tensor. AsXdata is very sparse, setting this toTruewill reduce memory usage, if the model supports use of sparsetorch.Tensors. Defaults toFalse, since sparsetorch.Tensors are still experimental in PyTorch.soma_chunk_size – The number of
obs/Xrows to retrieve when reading data from SOMA. This impacts two aspects of this class’s behavior: 1) The maximum memory utilization, with larger values providing better read performance, but also requiring more memory; 2) The granularity of the global shuffling step (seeshuffleparameter for details). The default value of 64 works well in conjunction with the defaultshuffle_chunk_countvalue.use_eager_fetch – Fetch the next SOMA chunk of
obsandXdata immediately after a previously fetched SOMA chunk is made available for processing via the iterator. This allows network (or filesystem) requests to be made in parallel with client-side processing of the SOMA data, potentially improving overall performance at the cost of doubling memory utilization. Defaults toTrue.shuffle_chunk_count – The number of contiguous blocks (chunks) of rows sampled to then concatenate and shuffle. Larger numbers correspond to more randomness per training batch. If
shuffle == False, this parameter is ignored. Defaults to2000.encoders – Specify custom encoders to be used. If not specified, a LabelEncoder will be created and used for each column in
obs_column_names. If specified, only columns for which an encoder has been registered will be returned in theobstensor. Each encoder needs to have a unique name. If this parameter is specified, theobs_column_namesparameter must not be used, since the columns will be inferred automatically from the encoders.
Lifecycle
deprecated
Methods
__init__(experiment[, measurement_name, ...])Construct a new
ExperimentDataPipe.register_datapipe_as_function(function_name, ...)register_function(function_name, function)reset()Reset the IterDataPipe to the initial state.
set_getstate_hook(hook_fn)set_reduce_ex_hook(hook_fn)stats()Get data loading stats for this
cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe.Attributes
functionsgetstate_hookobs_encodersReturns a dictionary of
sklearn.preprocessing.LabelEncoderobjects, keyed onobscolumn names, which were used to encode theobscolumn values.reduce_ex_hookrepr_hookshapeGet the shape of the data that will be returned by this
cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe.str_hook