hdmf.data_utils module

hdmf.data_utils.append_data(data, arg)
hdmf.data_utils.extend_data(data, arg)

Add all the elements of the iterable arg to the end of data.

Parameters:

data (list, DataIO, numpy.ndarray, h5py.Dataset) – The array to extend

class hdmf.data_utils.AbstractDataChunkIterator

Bases: object

Abstract iterator class used to iterate over DataChunks.

Derived classes must ensure that all abstract methods and abstract properties are implemented, in particular, dtype, maxshape, __iter__, ___next__, recommended_chunk_shape, and recommended_data_shape.

Iterating over AbstractContainer objects is not yet supported.

abstract __iter__()

Return the iterator object

abstract __next__()

Return the next data chunk or raise a StopIteration exception if all chunks have been retrieved.

HINT: numpy.s_ provides a convenient way to generate index tuples using standard array slicing. This is often useful to define the DataChunk.selection of the current chunk

Returns:

DataChunk object with the data and selection of the current chunk

Return type:

DataChunk

abstract recommended_chunk_shape()

Recommend the chunk shape for the data array.

Returns:

NumPy-style shape tuple describing the recommended shape for the chunks of the target array or None. This may or may not be the same as the shape of the chunks returned in the iteration process.

abstract recommended_data_shape()

Recommend the initial shape for the data array.

This is useful in particular to avoid repeated resized of the target array when reading from this data iterator. This should typically be either the final size of the array or the known minimal shape of the array.

Returns:

NumPy-style shape tuple indicating the recommended initial shape for the target array. This may or may not be the final full shape of the array, i.e., the array is allowed to grow. This should not be None.

abstract property dtype

Define the data type of the array

Returns:

NumPy style dtype or otherwise compliant dtype string

abstract property maxshape

Property describing the maximum shape of the data array that is being iterated over

Returns:

NumPy-style shape tuple indicating the maximum dimensions up to which the dataset may be resized. Axes with None are unlimited.

class hdmf.data_utils.GenericDataChunkIterator(buffer_gb=None, buffer_shape=None, chunk_mb=None, chunk_shape=None, display_progress=False, progress_bar_options=None)

Bases: AbstractDataChunkIterator

DataChunkIterator that lets the user specify chunk and buffer shapes.

Break a dataset into buffers containing multiple chunks to be written into an HDF5 dataset.

Basic users should set the buffer_gb argument to as much free RAM space as can be safely allocated. Advanced users are offered full control over the shape parameters for the buffer and the chunks; however, the chunk shape must perfectly divide the buffer shape along each axis.

HDF5 recommends chunk size in the range of 2 to 16 MB for optimal cloud performance. https://youtu.be/rcS5vt-mKok?t=621

Parameters:
  • buffer_gb (float or int) – If buffer_shape is not specified, it will be inferred as the smallest chunk below the buffer_gb threshold.Defaults to 1GB.

  • buffer_shape (tuple) – Manually defined shape of the buffer.

  • chunk_mb (float or int) – (‘If chunk_shape is not specified, it will be inferred as the smallest chunk below the chunk_mb threshold.’, ‘Defaults to 10MB.’)

  • chunk_shape (tuple) – Manually defined shape of the chunks.

  • display_progress (bool) – Display a progress bar with iteration rate and estimated completion time.

  • progress_bar_options (None) – Dictionary of keyword arguments to be passed directly to tqdm.

abstract _get_data(selection: Tuple[slice]) ndarray

Retrieve the data specified by the selection using minimal I/O.

The developer of a new implementation of the GenericDataChunkIterator must ensure the data is actually loaded into memory, and not simply mapped.

Parameters:

selection (Tuple[slice]) – tuple of slices, each indicating the selection indexed with respect to maxshape for that axis. Each axis of tuple is a slice of the full shape from which to pull data into the buffer.

Returns:

Array of data specified by selection

Return type:

numpy.ndarray

abstract _get_maxshape() Tuple[int, ...]

Retrieve the maximum bounds of the data shape using minimal I/O.

abstract _get_dtype() dtype

Retrieve the dtype of the data using minimal I/O.

recommended_chunk_shape() Tuple[int, ...]

Recommend the chunk shape for the data array.

Returns:

NumPy-style shape tuple describing the recommended shape for the chunks of the target array or None. This may or may not be the same as the shape of the chunks returned in the iteration process.

recommended_data_shape() Tuple[int, ...]

Recommend the initial shape for the data array.

This is useful in particular to avoid repeated resized of the target array when reading from this data iterator. This should typically be either the final size of the array or the known minimal shape of the array.

Returns:

NumPy-style shape tuple indicating the recommended initial shape for the target array. This may or may not be the final full shape of the array, i.e., the array is allowed to grow. This should not be None.

property maxshape: Tuple[int, ...]

Property describing the maximum shape of the data array that is being iterated over

Returns:

NumPy-style shape tuple indicating the maximum dimensions up to which the dataset may be resized. Axes with None are unlimited.

property dtype: dtype

Define the data type of the array

Returns:

NumPy style dtype or otherwise compliant dtype string

class hdmf.data_utils.DataChunkIterator(data=None, maxshape=None, dtype=None, buffer_size=1, iter_axis=0)

Bases: AbstractDataChunkIterator

Custom iterator class used to iterate over chunks of data.

This default implementation of AbstractDataChunkIterator accepts any iterable and assumes that we iterate over a single dimension of the data array (default: the first dimension). DataChunkIterator supports buffered read, i.e., multiple values from the input iterator can be combined to a single chunk. This is useful for buffered I/O operations, e.g., to improve performance by accumulating data in memory and writing larger blocks at once.

Note

DataChunkIterator assumes that the iterator that it wraps returns one element along the iteration dimension at a time. I.e., the iterator is expected to return chunks that are one dimension lower than the array itself. For example, when iterating over the first dimension of a dataset with shape (1000, 10, 10), then the iterator would return 1000 chunks of shape (10, 10) one-chunk-at-a-time. If this pattern does not match your use-case then using GenericDataChunkIterator or AbstractDataChunkIterator may be more appropriate.

Initialize the DataChunkIterator.

If ‘data’ is an iterator and ‘dtype’ is not specified, then next is called on the iterator in order to determine the dtype of the data.

Parameters:
  • data (None) – The data object used for iteration

  • maxshape (tuple) – The maximum shape of the full data array. Use None to indicate unlimited dimensions

  • dtype (dtype) – The Numpy data type for the array

  • buffer_size (int) – Number of values to be buffered in a chunk

  • iter_axis (int) – The dimension to iterate over

classmethod from_iterable(data=None, maxshape=None, dtype=None, buffer_size=1, iter_axis=0)
Parameters:
  • data (None) – The data object used for iteration

  • maxshape (tuple) – The maximum shape of the full data array. Use None to indicate unlimited dimensions

  • dtype (dtype) – The Numpy data type for the array

  • buffer_size (int) – Number of values to be buffered in a chunk

  • iter_axis (int) – The dimension to iterate over

next()

Return the next data chunk or raise a StopIteration exception if all chunks have been retrieved.

Tip

numpy.s_ provides a convenient way to generate index tuples using standard array slicing. This is often useful to define the DataChunk.selection of the current chunk

Returns:

DataChunk object with the data and selection of the current chunk

Return type:

DataChunk

recommended_chunk_shape()

Recommend a chunk shape.

To optimize iterative write the chunk should be aligned with the common shape of chunks returned by __next__ or if those chunks are too large, then a well-aligned subset of those chunks. This may also be any other value in case one wants to recommend chunk shapes to optimize read rather than write. The default implementation returns None, indicating no preferential chunking option.

recommended_data_shape()
Recommend an initial shape of the data. This is useful when progressively writing data and

we want to recommend an initial size for the dataset

property maxshape

Get a shape tuple describing the maximum shape of the array described by this DataChunkIterator.

Note

If an iterator is provided and no data has been read yet, then the first chunk will be read (i.e., next will be called on the iterator) in order to determine the maxshape. The iterator is expected to return single chunks along the iterator dimension, this means that maxshape will add an additional dimension along the iteration dimension. E.g., if we iterate over the first dimension and the iterator returns chunks of shape (10, 10), then the maxshape would be (None, 10, 10) or (len(self.data), 10, 10), depending on whether size of the iteration dimension is known.

Returns:

Shape tuple. None is used for dimensions where the maximum shape is not known or unlimited.

property dtype

Get the value data type

Returns:

np.dtype object describing the datatype

class hdmf.data_utils.DataChunk(data=None, selection=None)

Bases: object

Class used to describe a data chunk. Used in DataChunkIterator.

Parameters:
  • data (ndarray) – Numpy array with the data value(s) of the chunk

  • selection (None) – Numpy index tuple describing the location of the chunk

astype(dtype)

Get a new DataChunk with the self.data converted to the given type

property dtype

Data type of the values in the chunk

Returns:

np.dtype of the values in the DataChunk

get_min_bounds()

Helper function to compute the minimum dataset size required to fit the selection of this chunk.

Raises:

TypeError – If the the selection is not a single int, slice, or tuple of slices.

Returns:

Tuple with the minimum shape required to store the selection

hdmf.data_utils.assertEqualShape(data1, data2, axes1=None, axes2=None, name1=None, name2=None, ignore_undetermined=True)

Ensure that the shape of data1 and data2 match along the given dimensions

Parameters:
  • data1 (List, Tuple, numpy.ndarray, DataChunkIterator) – The first input array

  • data2 (List, Tuple, numpy.ndarray, DataChunkIterator) – The second input array

  • name1 – Optional string with the name of data1

  • name2 – Optional string with the name of data2

  • axes1 (int, Tuple(int), List(int), None) – The dimensions of data1 that should be matched to the dimensions of data2. Set to None to compare all axes in order.

  • axes2 – The dimensions of data2 that should be matched to the dimensions of data1. Must have the same length as axes1. Set to None to compare all axes in order.

  • ignore_undetermined – Boolean indicating whether non-matching unlimited dimensions should be ignored, i.e., if two dimension don’t match because we can’t determine the shape of either one, then should we ignore that case or treat it as no match

Returns:

Bool indicating whether the check passed and a string with a message about the matching process

class hdmf.data_utils.ShapeValidatorResult(result=False, message=None, ignored=(), unmatched=(), error=None, shape1=(), shape2=(), axes1=(), axes2=())

Bases: object

Class for storing results from validating the shape of multi-dimensional arrays.

This class is used to store results generated by ShapeValidator

Variables:
  • result – Boolean indicating whether results matched or not

  • message – Message indicating the result of the matching procedure

Parameters:
  • result (bool) – Result of the shape validation

  • message (str) – Message describing the result of the shape validation

  • ignored (tuple) – Axes that have been ignored in the validaton process

  • unmatched (tuple) – List of axes that did not match during shape validation

  • error (str) – Error that may have occurred. One of ERROR_TYPE

  • shape1 (tuple) – Shape of the first array for comparison

  • shape2 (tuple) – Shape of the second array for comparison

  • axes1 (tuple) – Axes for the first array that should match

  • axes2 (tuple) – Axes for the second array that should match

SHAPE_ERROR = {'AXIS_LEN_ERROR': 'Unequal length of axes.', 'AXIS_OUT_OF_BOUNDS': 'Axis index for comparison out of bounds.', 'NUM_AXES_ERROR': 'Unequal number of axes for comparison.', 'NUM_DIMS_ERROR': 'Unequal number of dimensions.', None: 'All required axes matched'}

Dict where the Keys are the type of errors that may have occurred during shape comparison and the values are strings with default error messages for the type.

class hdmf.data_utils.DataIO(data=None, dtype=None, shape=None)

Bases: object

Base class for wrapping data arrays for I/O. Derived classes of DataIO are typically used to pass dataset-specific I/O parameters to the particular HDMFIO backend.

Parameters:
get_io_params()

Returns a dict with the I/O parameters specified in this DataIO.

property data

Get the wrapped data object

property dtype

Get the wrapped data object

property shape

Get the wrapped data object

append(arg)
extend(arg)
__getitem__(item)

Delegate slicing to the data object

property valid

bool indicating if the data object is valid

exception hdmf.data_utils.InvalidDataIOError

Bases: Exception