hdmf.data_utils module

hdmf.data_utils.append_data(data, arg)

hdmf.data_utils.extend_data(data, arg)

Add all the elements of the iterable arg to the end of data.

Parameters:: data (list, DataIO, numpy.ndarray, h5py.Dataset) – The array to extend

class hdmf.data_utils.AbstractDataChunkIterator

Bases: object

Abstract iterator class used to iterate over DataChunks.

Derived classes must ensure that all abstract methods and abstract properties are implemented, in particular, dtype, maxshape, __iter__, ___next__, recommended_chunk_shape, and recommended_data_shape.

Iterating over AbstractContainer objects is not yet supported.

abstractmethod __iter__(): Return the iterator object

abstractmethod __next__()

Return the next data chunk or raise a StopIteration exception if all chunks have been retrieved.

HINT: numpy.s_ provides a convenient way to generate index tuples using standard array slicing. This is often useful to define the DataChunk.selection of the current chunk

Returns:: DataChunk object with the data and selection of the current chunk
Return type:: DataChunk

abstractmethod recommended_chunk_shape()

Recommend the chunk shape for the data array.

Returns:: NumPy-style shape tuple describing the recommended shape for the chunks of the target array or None. This may or may not be the same as the shape of the chunks returned in the iteration process.

abstractmethod recommended_data_shape()

Recommend the initial shape for the data array.

This is useful in particular to avoid repeated resized of the target array when reading from this data iterator. This should typically be either the final size of the array or the known minimal shape of the array.

Returns:: NumPy-style shape tuple indicating the recommended initial shape for the target array. This may or may not be the final full shape of the array, i.e., the array is allowed to grow. This should not be None.

abstract property dtype

Define the data type of the array

Returns:: NumPy style dtype or otherwise compliant dtype string

abstract property maxshape

Property describing the maximum shape of the data array that is being iterated over

Returns:: NumPy-style shape tuple indicating the maximum dimensions up to which the dataset may be resized. Axes with None are unlimited.

class hdmf.data_utils.GenericDataChunkIterator(buffer_gb=None, buffer_shape=None, chunk_mb=None, chunk_shape=None, display_progress=False, progress_bar_class=None, progress_bar_options=None)

Bases: AbstractDataChunkIterator

DataChunkIterator that lets the user specify chunk and buffer shapes.

Break a dataset into buffers containing multiple chunks to be written into an HDF5 dataset.

Basic users should set the buffer_gb argument to as much free RAM space as can be safely allocated. Advanced users are offered full control over the shape parameters for the buffer and the chunks; however, the chunk shape must perfectly divide the buffer shape along each axis.

HDF5 recommends chunk size in the range of 2 to 16 MB for optimal cloud performance. https://youtu.be/rcS5vt-mKok?t=621

Parameters:

buffer_gb (float or int) – If buffer_shape is not specified, it will be inferred as the smallest chunk below the buffer_gb threshold.Defaults to 1GB.
buffer_shape (tuple) – Manually defined shape of the buffer.
chunk_mb (float or int) – (‘If chunk_shape is not specified, it will be inferred as the smallest chunk below the chunk_mb threshold.’, ‘Defaults to 10MB.’)
chunk_shape (tuple) – Manually defined shape of the chunks.
display_progress (bool) – Display a progress bar with iteration rate and estimated completion time.
progress_bar_class (Callable) – The progress bar class to use. Defaults to tqdm.tqdm if the TQDM package is installed.
progress_bar_options (dict) – Dictionary of keyword arguments to be passed directly to tqdm.

abstractmethod _get_data(selection: tuple[slice]) → ndarray

Retrieve the data specified by the selection using minimal I/O.

The developer of a new implementation of the GenericDataChunkIterator must ensure the data is actually loaded into memory, and not simply mapped.

Parameters:: selection (tuple[slice]) – tuple of slices, each indicating the selection indexed with respect to maxshape for that axis. Each axis of tuple is a slice of the full shape from which to pull data into the buffer.
Returns:: Array of data specified by selection
Return type:: numpy.ndarray

abstractmethod _get_maxshape() → tuple[int, ...]: Retrieve the maximum bounds of the data shape using minimal I/O.

abstractmethod _get_dtype() → dtype: Retrieve the dtype of the data using minimal I/O.

recommended_chunk_shape() → tuple[int, ...]

Recommend the chunk shape for the data array.

Returns:: NumPy-style shape tuple describing the recommended shape for the chunks of the target array or None. This may or may not be the same as the shape of the chunks returned in the iteration process.

recommended_data_shape() → tuple[int, ...]

Recommend the initial shape for the data array.

This is useful in particular to avoid repeated resized of the target array when reading from this data iterator. This should typically be either the final size of the array or the known minimal shape of the array.

Returns:: NumPy-style shape tuple indicating the recommended initial shape for the target array. This may or may not be the final full shape of the array, i.e., the array is allowed to grow. This should not be None.

property maxshape: tuple[int, ...]

Property describing the maximum shape of the data array that is being iterated over

Returns:: NumPy-style shape tuple indicating the maximum dimensions up to which the dataset may be resized. Axes with None are unlimited.

property dtype: dtype

Define the data type of the array

Returns:: NumPy style dtype or otherwise compliant dtype string

class hdmf.data_utils.DataChunkIterator(data=None, maxshape=None, dtype=None, buffer_size=1, iter_axis=0)

Bases: AbstractDataChunkIterator

Custom iterator class used to iterate over chunks of data.

This default implementation of AbstractDataChunkIterator accepts any iterable and assumes that we iterate over a single dimension of the data array (default: the first dimension). DataChunkIterator supports buffered read, i.e., multiple values from the input iterator can be combined to a single chunk. This is useful for buffered I/O operations, e.g., to improve performance by accumulating data in memory and writing larger blocks at once.

Note

DataChunkIterator assumes that the iterator that it wraps returns one element along the iteration dimension at a time. I.e., the iterator is expected to return chunks that are one dimension lower than the array itself. For example, when iterating over the first dimension of a dataset with shape (1000, 10, 10), then the iterator would return 1000 chunks of shape (10, 10) one-chunk-at-a-time. If this pattern does not match your use-case then using GenericDataChunkIterator or AbstractDataChunkIterator may be more appropriate.

Initialize the DataChunkIterator. If ‘data’ is an iterator and ‘dtype’ is not specified, then next is called on the iterator in order to determine the dtype of the data.

Parameters:

data (None) – The data object used for iteration
maxshape (tuple) – The maximum shape of the full data array. Use None to indicate unlimited dimensions
dtype (dtype) – The Numpy data type for the array
buffer_size (int) – Number of values to be buffered in a chunk
iter_axis (int) – The dimension to iterate over

classmethod from_iterable(data=None, maxshape=None, dtype=None, buffer_size=1, iter_axis=0)

Parameters:

data (None) – The data object used for iteration
maxshape (tuple) – The maximum shape of the full data array. Use None to indicate unlimited dimensions
dtype (dtype) – The Numpy data type for the array
buffer_size (int) – Number of values to be buffered in a chunk
iter_axis (int) – The dimension to iterate over

next()

Return the next data chunk or raise a StopIteration exception if all chunks have been retrieved.

Tip

numpy.s_ provides a convenient way to generate index tuples using standard array slicing. This is often useful to define the DataChunk.selection of the current chunk

Returns:: DataChunk object with the data and selection of the current chunk
Return type:: DataChunk

recommended_chunk_shape()

Recommend a chunk shape.

To optimize iterative write the chunk should be aligned with the common shape of chunks returned by __next__ or if those chunks are too large, then a well-aligned subset of those chunks. This may also be any other value in case one wants to recommend chunk shapes to optimize read rather than write. The default implementation returns None, indicating no preferential chunking option.

recommended_data_shape(): Recommend an initial shape of the data. This is useful when progressively writing data and we want to recommend an initial size for the dataset

property maxshape

Get a shape tuple describing the maximum shape of the array described by this DataChunkIterator.

Note

If an iterator is provided and no data has been read yet, then the first chunk will be read (i.e., next will be called on the iterator) in order to determine the maxshape. The iterator is expected to return single chunks along the iterator dimension, this means that maxshape will add an additional dimension along the iteration dimension. E.g., if we iterate over the first dimension and the iterator returns chunks of shape (10, 10), then the maxshape would be (None, 10, 10) or (len(self.data), 10, 10), depending on whether size of the iteration dimension is known.

Returns:: Shape tuple. None is used for dimensions where the maximum shape is not known or unlimited.

property dtype

Get the value data type

Returns:: np.dtype object describing the datatype

class hdmf.data_utils.DataChunk(data=None, selection=None)

Bases: object

Class used to describe a data chunk. Used in DataChunkIterator.

Parameters:

data (ndarray) – Numpy array with the data value(s) of the chunk
selection (None) – Numpy index tuple describing the location of the chunk

astype(dtype): Get a new DataChunk with the self.data converted to the given type

property dtype

Data type of the values in the chunk

Returns:: np.dtype of the values in the DataChunk

get_min_bounds()

Helper function to compute the minimum dataset size required to fit the selection of this chunk.

Raises:: TypeError – If the the selection is not a single int, slice, or tuple of slices.
Returns:: Tuple with the minimum shape required to store the selection

hdmf.data_utils.assertEqualShape(data1, data2, axes1=None, axes2=None, name1=None, name2=None, ignore_undetermined=True)

Ensure that the shape of data1 and data2 match along the given dimensions

Parameters:

data1 (List, Tuple, numpy.ndarray, DataChunkIterator) – The first input array
data2 (List, Tuple, numpy.ndarray, DataChunkIterator) – The second input array
name1 – Optional string with the name of data1
name2 – Optional string with the name of data2
axes1 (int, Tuple(int), List(int), None) – The dimensions of data1 that should be matched to the dimensions of data2. Set to None to compare all axes in order.
axes2 – The dimensions of data2 that should be matched to the dimensions of data1. Must have the same length as axes1. Set to None to compare all axes in order.
ignore_undetermined – Boolean indicating whether non-matching unlimited dimensions should be ignored, i.e., if two dimension don’t match because we can’t determine the shape of either one, then should we ignore that case or treat it as no match

Returns:

Bool indicating whether the check passed and a string with a message about the matching process

class hdmf.data_utils.ShapeValidatorResult(result=False, message=None, ignored=(), unmatched=(), error=None, shape1=(), shape2=(), axes1=(), axes2=())

Bases: object

Class for storing results from validating the shape of multi-dimensional arrays.

This class is used to store results generated by ShapeValidator

Variables:

result – Boolean indicating whether results matched or not
message – Message indicating the result of the matching procedure

Parameters:

result (bool) – Result of the shape validation
message (str) – Message describing the result of the shape validation
ignored (tuple) – Axes that have been ignored in the validation process
unmatched (tuple) – List of axes that did not match during shape validation
error (str) – Error that may have occurred. One of ERROR_TYPE
shape1 (tuple) – Shape of the first array for comparison
shape2 (tuple) – Shape of the second array for comparison
axes1 (tuple) – Axes for the first array that should match
axes2 (tuple) – Axes for the second array that should match

SHAPE_ERROR = {'AXIS_LEN_ERROR': 'Unequal length of axes.', 'AXIS_OUT_OF_BOUNDS': 'Axis index for comparison out of bounds.', 'NUM_AXES_ERROR': 'Unequal number of axes for comparison.', 'NUM_DIMS_ERROR': 'Unequal number of dimensions.', None: 'All required axes matched'}: Dict where the Keys are the type of errors that may have occurred during shape comparison and the values are strings with default error messages for the type.

class hdmf.data_utils.DataIO(data=None, dtype=None, shape=None)

Bases: object

Base class for wrapping data arrays for I/O. Derived classes of DataIO are typically used to pass dataset-specific I/O parameters to the particular HDMFIO backend.

Parameters:

data (ndarray or list or tuple or Dataset or Series or ExtensionArray or Array or StrDataset or HDMFDataset or AbstractDataChunkIterator) – the data to be written
dtype (type or dtype) – the data type of the dataset. Not used if data is specified.
shape (tuple) – the shape of the dataset. Not used if data is specified.

get_io_params(): Returns a dict with the I/O parameters specified in this DataIO.

property data: Get the wrapped data object

property dtype: Get the wrapped data object

property shape: Get the wrapped data object

append(arg)

extend(arg)

__getitem__(item): Delegate slicing to the data object

property valid: bool indicating if the data object is valid

exception hdmf.data_utils.InvalidDataIOError: Bases: Exception