GenericDataChunkIterator Tutorial¶

This is a tutorial for interacting with GenericDataChunkIterator objects. This tutorial is written for beginners and does not describe the full capabilities and nuances of the functionality. This tutorial is designed to give you basic familiarity with how GenericDataChunkIterator works and help you get started with creating a specific instance for your data format or API access pattern.

Introduction¶

The GenericDataChunkIterator class represents a semi-abstract version of a AbstractDataChunkIterator that automatically handles the selection of buffer regions and resolves communication of compatible chunk regions within a H5DataIO wrapper. It does not, however, know how data (values) or metadata (data type, full shape) ought to be directly accessed. This is by intention to be fully agnostic to a range of indexing methods and format-independent APIs, rather than make strong assumptions about how data ranges are to be sliced.

Constructing a simple child class¶

We will begin with a simple example case of data access to a standard Numpy array. To create a GenericDataChunkIterator that accomplishes this, we begin by defining our child class.

import numpy as np

from hdmf.data_utils import GenericDataChunkIterator


class NumpyArrayDataChunkIterator(GenericDataChunkIterator):
    def __init__(self, array: np.ndarray, **kwargs):
        self.array = array
        super().__init__(**kwargs)

    def _get_data(self, selection):
        return self.array[selection]

    def _get_maxshape(self):
        return self.array.shape

    def _get_dtype(self):
        return self.array.dtype


# To instantiate this class on an array to allow iteration over buffer_shapes,
my_array = np.random.randint(low=0, high=10, size=(12, 6), dtype="int16")
my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array)

# and this iterator now behaves as a standard Python generator (i.e., it can only be exhausted once)
# that returns DataChunk objects for each buffer.
for buffer in my_custom_iterator:
    print(buffer.data)

[[6 4 8 4 8 0]
 [2 0 2 7 0 0]
 [4 3 5 2 8 8]
 [7 0 5 8 1 5]
 [9 3 1 8 4 9]
 [0 4 8 1 3 6]
 [5 3 6 8 2 3]
 [3 4 2 6 3 8]
 [9 0 7 3 2 0]
 [0 0 3 7 9 9]
 [3 4 6 2 0 4]
 [1 5 5 9 7 9]]

Intended use for advanced data I/O¶

Of course, the real use case for this class is intended for when the amount of data stored on a hard drive is larger than what can be read into RAM. Hence the goal is to read only an amount of data with a size in gigabytes (GB) at or below the buffer_gb argument (defaults to 1 GB).

# This design can be seen if we increase the amount of data in our example code
my_array = np.random.randint(low=0, high=10, size=(20000, 5000), dtype="int32")
my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array, buffer_gb=0.2)

for j, buffer in enumerate(my_custom_iterator, start=1):
    print(f"Buffer number {j} returns data from selection: {buffer.selection}")

Buffer number 1 returns data from selection: (slice(0, 12640, None), slice(0, 3160, None))
Buffer number 2 returns data from selection: (slice(0, 12640, None), slice(3160, 5000, None))
Buffer number 3 returns data from selection: (slice(12640, 20000, None), slice(0, 3160, None))
Buffer number 4 returns data from selection: (slice(12640, 20000, None), slice(3160, 5000, None))

Note

Technically, in this example the total data is still fully loaded into RAM from the initial Numpy array. A more accurate use case would be achieved from writing the test_array to a temporary file on your system and loading it back with np.memmap, which is a subtype of Numpy arrays that do not immediately load the data.

Writing to an HDF5 file with full control of shape arguments¶

The true intention of returning data selections of this form, and within a DataChunk object, is to write these piecewise to an HDF5 dataset.

# This is where the importance of the underlying `chunk_shape` comes in, and why it is critical to performance
# that it perfectly subsets the `buffer_shape`.
import h5py

maxshape = (20000, 5000)
buffer_shape = (10000, 2500)
chunk_shape = (1000, 250)

my_array = np.random.randint(low=0, high=10, size=maxshape, dtype="int32")
my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array, buffer_shape=buffer_shape, chunk_shape=chunk_shape)
out_file = "my_temporary_test_file.hdf5"
with h5py.File(name=out_file, mode="w") as f:
    dset = f.create_dataset(name="test", shape=maxshape, dtype="int16", chunks=my_custom_iterator.chunk_shape)
    for buffer in my_custom_iterator:
        dset[buffer.selection] = buffer.data
# Remember to remove the temporary file after running this and exploring the contents!

Note

Here we explicitly set the chunks value in the HDF5 dataset object; however, a nice part of the design of this iterator is that when wrapped in a hdmf.backends.hdf5.h5_utils.H5DataIO that is called within a hdmf.backends.hdf5.h5tools.HDF5IO context with a corresponding hdmf.container.Container, these details will be automatically parsed.

Note

There is some overlap here in nomenclature between HDMF and HDF5. The term chunk in both HDMF and HDF5 refer to a subset of dataset, however, in HDF5 a chunk is a piece of dataset on disk, whereas in the context of the DataChunk iteration is a block of data in memory. As such, the requirements on the shape and size of chunks are different. In HDF5 these chunks are pieces of a dataset that get compressed and cached together, and they should usually be small in size for optimal performance (typically 1 MB or less). In contrast, a DataChunk in HDMF acts as a block of data for writing data to dataset, and spans multiple HDF5 chunks to improve performance. This is achieved by avoiding repeat updates to the same Chunk in the HDF5 file, DataChunk objects for write should align with Chunks in the HDF5 file, i.e., the DataChunk.selection should fully cover one or more Chunks in the HDF5 file to avoid repeat updates to the same Chunks in the HDF5 file. This is what the buffer of the :py:class`~hdmf.data_utils.GenericDataChunkIterator` does, which upon each iteration returns a single DataChunk object (by default > 1 GB) that perfectly spans many HDF5 chunks (by default < 1 MB) to help reduce the number of small I/O operations and help improve performance. In practice, the buffer should usually be even larger than the default, i.e, as much free RAM as can be safely used.

Remove the test file

import os
if os.path.exists(out_file):
    os.remove(out_file)

Gallery generated by Sphinx-Gallery