cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe¶
- class cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe(experiment: Experiment, measurement_name: str = 'raw', X_name: str = 'X', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = False, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = None, use_eager_fetch: bool = True)¶
An
torchdata.datapipes.iter.IterDataPipe
that readsobs
andX
data from atiledbsoma.Experiment
, based upon the specified queries along theobs
andvar
axes. Provides an iterator over these data when the object is passed to Python’s built-initer
function.>>> for batch in iter(ExperimentDataPipe(...)): X_batch, y_batch = batch
The
batch_size
parameter controls the number of rows ofobs
andX
data that are returned in each iteration. If thebatch_size
is 1, then each Tensor will have rank 1:>>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]), # X data tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
For larger
batch_size
values, the returned Tensors will have rank 2:>>> DataLoader(..., batch_size=3, ...): (tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0.], # X batch [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.]]), tensor([[2415, 0, 0], # obs batch [2416, 0, 4], [2417, 0, 3]], dtype=torch.int64))
The
return_sparse_X
parameter controls whether theX
data is returned as a dense or sparsetorch.Tensor
. If the model supports use of sparsetorch.Tensor
s, this will reduce memory usage.The
obs_column_names
parameter determines the data columns that are returned in theobs
Tensor. The first element is always thesoma_joinid
of theobs
pandas.DataFrame
(or, equivalently, thesoma_dim_0
of theX
matrix). The remaining elements are theobs
columns specified byobs_column_names
, and string-typed columns are encoded as integer values. If needed, these values can be decoded by obtaining the encoder for a givenobs
column name and calling itsinverse_transform
method:>>> exp_data_pipe.obs_encoders["<obs_attr_name>"].inverse_transform(encoded_values)
Lifecycle
experimental
- __init__(experiment: Experiment, measurement_name: str = 'raw', X_name: str = 'X', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = False, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = None, use_eager_fetch: bool = True) None ¶
Construct a new
ExperimentDataPipe
.- Parameters:
experiment – The
tiledbsoma.Experiment
from which to read data.measurement_name – The name of the
tiledbsoma.Measurement
to read. Defaults to"raw"
.X_name – The name of the X layer to read. Defaults to
"X"
.obs_query – The query used to filter along the
obs
axis. If not specified, allobs
andX
data will be returned, which can be very large.var_query – The query used to filter along the
var
axis. If not specified, allvar
columns (genes/features) will be returned.obs_column_names – The names of the
obs
columns to return. Thesoma_joinid
index “column” does not need to be specified and will always be returned. If not specified, only thesoma_joinid
will be returned.batch_size – The number of rows of
obs
andX
data to return in each iteration. Defaults to1
. A value of1
will result intorch.Tensor
of rank 1 being returns (a single row); larger values will result intorch.Tensor
s of rank 2 (multiple rows).shuffle – Whether to shuffle the
obs
andX
data being returned. Defaults toFalse
(no shuffling). For performance reasons, shuffling is performed in two steps: 1) a global shuffling, where contiguous rows are grouped into chunks and the order of the chunks is randomized, and then 2) a local shuffling, where the rows within each chunk are shuffled. Since this class must retrieve data in chunks (to keep memory requirements to a fixed size), global shuffling ensures that a given row in the shuffled result can originate from any position in the non-shuffled result ordering. If shuffling only occurred within each chunk (i.e. “local” shuffling), the first chunk’s rows would always be returned first, the second chunk’s rows would always be returned second, and so on. The chunk size is determined by thesoma_chunk_size
parameter. Note that rows within a chunk will maintain proximity, even after shuffling, so some experimentation may be required to ensure the shuffling is sufficient for the model training process. To this end, thesoma_chunk_size
can be treated as a hyperparameter that can be tuned.seed – The random seed used for shuffling. Defaults to
None
(no seed). This must be specified when usingtorch.nn.parallel.DistributedDataParallel
to ensure data partitions are disjoint across worker processes.return_sparse_X – Controls whether the
X
data is returned as a dense or sparsetorch.Tensor
. AsX
data is very sparse, setting this toTrue
will reduce memory usage, if the model supports use of sparsetorch.Tensor
s. Defaults toFalse
, since sparsetorch.Tensor
s are still experimental in PyTorch.soma_chunk_size – The number of
obs
/X
rows to retrieve when reading data from SOMA. This impacts two aspects of this class’s behavior: 1) The maximum memory utilization, with larger values providing better read performance, but also requiring more memory; 2) The granularity of the global shuffling step (seeshuffle
parameter for details). If not specified, the value is set to utilize ~1 GiB of RAM per SOMA chunk read, based upon the number ofvar
columns (cells/features) being requested and assuming X data sparsity of 95%; the number of rows per chunk will depend on the number ofvar
columns being read.use_eager_fetch – Fetch the next SOMA chunk of
obs
andX
data immediately after a previously fetched SOMA chunk is made available for processing via the iterator. This allows network (or filesystem) requests to be made in parallel with client-side processing of the SOMA data, potentially improving overall performance at the cost of doubling memory utilization. Defaults toTrue
.
Lifecycle
experimental
Methods
__init__
(experiment[, measurement_name, ...])Construct a new
ExperimentDataPipe
.register_datapipe_as_function
(function_name, ...)register_function
(function_name, function)reset
()Reset the IterDataPipe to the initial state.
set_getstate_hook
(hook_fn)set_reduce_ex_hook
(hook_fn)stats
()Get data loading stats for this
cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe
.Attributes
functions
getstate_hook
obs_encoders
Returns a dictionary of
sklearn.preprocessing.LabelEncoder
objects, keyed onobs
column names, which were used to encode theobs
column values.reduce_ex_hook
repr_hook
shape
Get the shape of the data that will be returned by this
cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe
.str_hook