hdmf.data_utils module
- hdmf.data_utils.append_data(data, arg)
- hdmf.data_utils.extend_data(data, arg)
Add all the elements of the iterable arg to the end of data.
- Parameters:
data (list, DataIO, numpy.ndarray, h5py.Dataset) – The array to extend
- class hdmf.data_utils.AbstractDataChunkIterator
Bases:
object
Abstract iterator class used to iterate over DataChunks.
Derived classes must ensure that all abstract methods and abstract properties are implemented, in particular, dtype, maxshape, __iter__, ___next__, recommended_chunk_shape, and recommended_data_shape.
Iterating over AbstractContainer objects is not yet supported.
- abstract __iter__()
Return the iterator object
- abstract __next__()
Return the next data chunk or raise a StopIteration exception if all chunks have been retrieved.
HINT: numpy.s_ provides a convenient way to generate index tuples using standard array slicing. This is often useful to define the DataChunk.selection of the current chunk
- Returns:
DataChunk object with the data and selection of the current chunk
- Return type:
- abstract recommended_chunk_shape()
Recommend the chunk shape for the data array.
- Returns:
NumPy-style shape tuple describing the recommended shape for the chunks of the target array or None. This may or may not be the same as the shape of the chunks returned in the iteration process.
- abstract recommended_data_shape()
Recommend the initial shape for the data array.
This is useful in particular to avoid repeated resized of the target array when reading from this data iterator. This should typically be either the final size of the array or the known minimal shape of the array.
- Returns:
NumPy-style shape tuple indicating the recommended initial shape for the target array. This may or may not be the final full shape of the array, i.e., the array is allowed to grow. This should not be None.
- abstract property dtype
Define the data type of the array
- Returns:
NumPy style dtype or otherwise compliant dtype string
- abstract property maxshape
Property describing the maximum shape of the data array that is being iterated over
- Returns:
NumPy-style shape tuple indicating the maximum dimensions up to which the dataset may be resized. Axes with None are unlimited.
- class hdmf.data_utils.GenericDataChunkIterator(buffer_gb=None, buffer_shape=None, chunk_mb=None, chunk_shape=None, display_progress=False, progress_bar_class=None, progress_bar_options=None)
Bases:
AbstractDataChunkIterator
DataChunkIterator that lets the user specify chunk and buffer shapes.
Break a dataset into buffers containing multiple chunks to be written into an HDF5 dataset.
Basic users should set the buffer_gb argument to as much free RAM space as can be safely allocated. Advanced users are offered full control over the shape parameters for the buffer and the chunks; however, the chunk shape must perfectly divide the buffer shape along each axis.
HDF5 recommends chunk size in the range of 2 to 16 MB for optimal cloud performance. https://youtu.be/rcS5vt-mKok?t=621
- Parameters:
buffer_gb (
float
orint
) – If buffer_shape is not specified, it will be inferred as the smallest chunk below the buffer_gb threshold.Defaults to 1GB.buffer_shape (
tuple
) – Manually defined shape of the buffer.chunk_mb (
float
orint
) – (‘If chunk_shape is not specified, it will be inferred as the smallest chunk below the chunk_mb threshold.’, ‘Defaults to 10MB.’)chunk_shape (
tuple
) – Manually defined shape of the chunks.display_progress (
bool
) – Display a progress bar with iteration rate and estimated completion time.progress_bar_class (
Callable
) – The progress bar class to use. Defaults to tqdm.tqdm if the TQDM package is installed.progress_bar_options (
dict
) – Dictionary of keyword arguments to be passed directly to tqdm.
- abstract _get_data(selection: Tuple[slice]) ndarray
Retrieve the data specified by the selection using minimal I/O.
The developer of a new implementation of the GenericDataChunkIterator must ensure the data is actually loaded into memory, and not simply mapped.
- Parameters:
selection (Tuple[slice]) – tuple of slices, each indicating the selection indexed with respect to maxshape for that axis. Each axis of tuple is a slice of the full shape from which to pull data into the buffer.
- Returns:
Array of data specified by selection
- Return type:
- abstract _get_maxshape() Tuple[int, ...]
Retrieve the maximum bounds of the data shape using minimal I/O.
- recommended_chunk_shape() Tuple[int, ...]
Recommend the chunk shape for the data array.
- Returns:
NumPy-style shape tuple describing the recommended shape for the chunks of the target array or None. This may or may not be the same as the shape of the chunks returned in the iteration process.
- recommended_data_shape() Tuple[int, ...]
Recommend the initial shape for the data array.
This is useful in particular to avoid repeated resized of the target array when reading from this data iterator. This should typically be either the final size of the array or the known minimal shape of the array.
- Returns:
NumPy-style shape tuple indicating the recommended initial shape for the target array. This may or may not be the final full shape of the array, i.e., the array is allowed to grow. This should not be None.
- class hdmf.data_utils.DataChunkIterator(data=None, maxshape=None, dtype=None, buffer_size=1, iter_axis=0)
Bases:
AbstractDataChunkIterator
Custom iterator class used to iterate over chunks of data.
This default implementation of AbstractDataChunkIterator accepts any iterable and assumes that we iterate over a single dimension of the data array (default: the first dimension). DataChunkIterator supports buffered read, i.e., multiple values from the input iterator can be combined to a single chunk. This is useful for buffered I/O operations, e.g., to improve performance by accumulating data in memory and writing larger blocks at once.
Note
DataChunkIterator assumes that the iterator that it wraps returns one element along the iteration dimension at a time. I.e., the iterator is expected to return chunks that are one dimension lower than the array itself. For example, when iterating over the first dimension of a dataset with shape (1000, 10, 10), then the iterator would return 1000 chunks of shape (10, 10) one-chunk-at-a-time. If this pattern does not match your use-case then using
GenericDataChunkIterator
orAbstractDataChunkIterator
may be more appropriate.- Initialize the DataChunkIterator.
If ‘data’ is an iterator and ‘dtype’ is not specified, then next is called on the iterator in order to determine the dtype of the data.
- Parameters:
data (None) – The data object used for iteration
maxshape (
tuple
) – The maximum shape of the full data array. Use None to indicate unlimited dimensionsdtype (
dtype
) – The Numpy data type for the arraybuffer_size (
int
) – Number of values to be buffered in a chunkiter_axis (
int
) – The dimension to iterate over
- classmethod from_iterable(data=None, maxshape=None, dtype=None, buffer_size=1, iter_axis=0)
- Parameters:
data (None) – The data object used for iteration
maxshape (
tuple
) – The maximum shape of the full data array. Use None to indicate unlimited dimensionsdtype (
dtype
) – The Numpy data type for the arraybuffer_size (
int
) – Number of values to be buffered in a chunkiter_axis (
int
) – The dimension to iterate over
- next()
Return the next data chunk or raise a StopIteration exception if all chunks have been retrieved.
Tip
numpy.s_
provides a convenient way to generate index tuples using standard array slicing. This is often useful to define the DataChunk.selection of the current chunk- Returns:
DataChunk object with the data and selection of the current chunk
- Return type:
- recommended_chunk_shape()
Recommend a chunk shape.
To optimize iterative write the chunk should be aligned with the common shape of chunks returned by __next__ or if those chunks are too large, then a well-aligned subset of those chunks. This may also be any other value in case one wants to recommend chunk shapes to optimize read rather than write. The default implementation returns None, indicating no preferential chunking option.
- recommended_data_shape()
- Recommend an initial shape of the data. This is useful when progressively writing data and
we want to recommend an initial size for the dataset
- property maxshape
Get a shape tuple describing the maximum shape of the array described by this DataChunkIterator.
Note
If an iterator is provided and no data has been read yet, then the first chunk will be read (i.e., next will be called on the iterator) in order to determine the maxshape. The iterator is expected to return single chunks along the iterator dimension, this means that maxshape will add an additional dimension along the iteration dimension. E.g., if we iterate over the first dimension and the iterator returns chunks of shape (10, 10), then the maxshape would be (None, 10, 10) or (len(self.data), 10, 10), depending on whether size of the iteration dimension is known.
- Returns:
Shape tuple. None is used for dimensions where the maximum shape is not known or unlimited.
- property dtype
Get the value data type
- Returns:
np.dtype object describing the datatype
- class hdmf.data_utils.DataChunk(data=None, selection=None)
Bases:
object
Class used to describe a data chunk. Used in DataChunkIterator.
- Parameters:
data (
ndarray
) – Numpy array with the data value(s) of the chunkselection (None) – Numpy index tuple describing the location of the chunk
- astype(dtype)
Get a new DataChunk with the self.data converted to the given type
- property dtype
Data type of the values in the chunk
- Returns:
np.dtype of the values in the DataChunk
- hdmf.data_utils.assertEqualShape(data1, data2, axes1=None, axes2=None, name1=None, name2=None, ignore_undetermined=True)
Ensure that the shape of data1 and data2 match along the given dimensions
- Parameters:
data1 (List, Tuple, numpy.ndarray, DataChunkIterator) – The first input array
data2 (List, Tuple, numpy.ndarray, DataChunkIterator) – The second input array
name1 – Optional string with the name of data1
name2 – Optional string with the name of data2
axes1 (int, Tuple(int), List(int), None) – The dimensions of data1 that should be matched to the dimensions of data2. Set to None to compare all axes in order.
axes2 – The dimensions of data2 that should be matched to the dimensions of data1. Must have the same length as axes1. Set to None to compare all axes in order.
ignore_undetermined – Boolean indicating whether non-matching unlimited dimensions should be ignored, i.e., if two dimension don’t match because we can’t determine the shape of either one, then should we ignore that case or treat it as no match
- Returns:
Bool indicating whether the check passed and a string with a message about the matching process
- class hdmf.data_utils.ShapeValidatorResult(result=False, message=None, ignored=(), unmatched=(), error=None, shape1=(), shape2=(), axes1=(), axes2=())
Bases:
object
Class for storing results from validating the shape of multi-dimensional arrays.
This class is used to store results generated by ShapeValidator
- Variables:
result – Boolean indicating whether results matched or not
message – Message indicating the result of the matching procedure
- Parameters:
result (
bool
) – Result of the shape validationmessage (
str
) – Message describing the result of the shape validationignored (
tuple
) – Axes that have been ignored in the validation processunmatched (
tuple
) – List of axes that did not match during shape validationerror (
str
) – Error that may have occurred. One of ERROR_TYPEshape1 (
tuple
) – Shape of the first array for comparisonshape2 (
tuple
) – Shape of the second array for comparisonaxes1 (
tuple
) – Axes for the first array that should matchaxes2 (
tuple
) – Axes for the second array that should match
- SHAPE_ERROR = {'AXIS_LEN_ERROR': 'Unequal length of axes.', 'AXIS_OUT_OF_BOUNDS': 'Axis index for comparison out of bounds.', 'NUM_AXES_ERROR': 'Unequal number of axes for comparison.', 'NUM_DIMS_ERROR': 'Unequal number of dimensions.', None: 'All required axes matched'}
Dict where the Keys are the type of errors that may have occurred during shape comparison and the values are strings with default error messages for the type.
- class hdmf.data_utils.DataIO(data=None, dtype=None, shape=None)
Bases:
object
Base class for wrapping data arrays for I/O. Derived classes of DataIO are typically used to pass dataset-specific I/O parameters to the particular HDMFIO backend.
- Parameters:
data (
ndarray
orlist
ortuple
orDataset
orArray
orStrDataset
orHDMFDataset
orAbstractDataChunkIterator
) – the data to be writtendtype (
type
ordtype
) – the data type of the dataset. Not used if data is specified.shape (
tuple
) – the shape of the dataset. Not used if data is specified.
- get_io_params()
Returns a dict with the I/O parameters specified in this DataIO.
- property data
Get the wrapped data object
- property dtype
Get the wrapped data object
- property shape
Get the wrapped data object
- append(arg)
- extend(arg)
- __getitem__(item)
Delegate slicing to the data object
- property valid
bool indicating if the data object is valid