.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/plot_generic_data_chunk_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_plot_generic_data_chunk_tutorial.py: .. _genericdci-tutorial: GenericDataChunkIterator Tutorial ================================== This is a tutorial for interacting with :py:class:`~hdmf.data_utils.GenericDataChunkIterator` objects. This tutorial is written for beginners and does not describe the full capabilities and nuances of the functionality. This tutorial is designed to give you basic familiarity with how :py:class:`~hdmf.data_utils.GenericDataChunkIterator` works and help you get started with creating a specific instance for your data format or API access pattern. Introduction ------------ The :py:class:`~hdmf.data_utils.GenericDataChunkIterator` class represents a semi-abstract version of a :py:class:`~hdmf.data_utils.AbstractDataChunkIterator` that automatically handles the selection of buffer regions and resolves communication of compatible chunk regions within a H5DataIO wrapper. It does not, however, know how data (values) or metadata (data type, full shape) ought to be directly accessed. This is by intention to be fully agnostic to a range of indexing methods and format-independent APIs, rather than make strong assumptions about how data ranges are to be sliced. Constructing a simple child class --------------------------------- We will begin with a simple example case of data access to a standard Numpy array. To create a :py:class:`~hdmf.data_utils.GenericDataChunkIterator` that accomplishes this, we begin by defining our child class. .. GENERATED FROM PYTHON SOURCE LINES 30-60 .. code-block:: Python import numpy as np from hdmf.data_utils import GenericDataChunkIterator class NumpyArrayDataChunkIterator(GenericDataChunkIterator): def __init__(self, array: np.ndarray, **kwargs): self.array = array super().__init__(**kwargs) def _get_data(self, selection): return self.array[selection] def _get_maxshape(self): return self.array.shape def _get_dtype(self): return self.array.dtype # To instantiate this class on an array to allow iteration over buffer_shapes, my_array = np.random.randint(low=0, high=10, size=(12, 6), dtype="int16") my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array) # and this iterator now behaves as a standard Python generator (i.e., it can only be exhausted once) # that returns DataChunk objects for each buffer. for buffer in my_custom_iterator: print(buffer.data) .. rst-class:: sphx-glr-script-out .. code-block:: none [[7 9 2 7 3 7] [3 6 0 7 2 6] [9 5 1 9 9 3] [4 6 0 9 4 4] [5 7 5 3 6 9] [2 2 9 3 3 6] [1 8 9 8 7 0] [4 0 4 3 3 5] [9 3 4 8 3 6] [3 1 6 7 1 8] [9 4 4 4 5 6] [0 3 6 9 1 5]] .. GENERATED FROM PYTHON SOURCE LINES 62-67 Intended use for advanced data I/O ---------------------------------- Of course, the real use case for this class is intended for when the amount of data stored on a hard drive is larger than what can be read into RAM. Hence the goal is to read only an amount of data with a size in gigabytes (GB) at or below the `buffer_gb` argument (defaults to 1 GB). .. GENERATED FROM PYTHON SOURCE LINES 67-75 .. code-block:: Python # This design can be seen if we increase the amount of data in our example code my_array = np.random.randint(low=0, high=10, size=(20000, 5000), dtype="int32") my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array, buffer_gb=0.2) for j, buffer in enumerate(my_custom_iterator, start=1): print(f"Buffer number {j} returns data from selection: {buffer.selection}") .. rst-class:: sphx-glr-script-out .. code-block:: none Buffer number 1 returns data from selection: (slice(0, 12640, None), slice(0, 3160, None)) Buffer number 2 returns data from selection: (slice(0, 12640, None), slice(3160, 5000, None)) Buffer number 3 returns data from selection: (slice(12640, 20000, None), slice(0, 3160, None)) Buffer number 4 returns data from selection: (slice(12640, 20000, None), slice(3160, 5000, None)) .. GENERATED FROM PYTHON SOURCE LINES 76-80 .. note:: Technically, in this example the total data is still fully loaded into RAM from the initial Numpy array. A more accurate use case would be achieved from writing the test_array to a temporary file on your system and loading it back with np.memmap, which is a subtype of Numpy arrays that do not immediately load the data. .. GENERATED FROM PYTHON SOURCE LINES 82-86 Writing to an HDF5 file with full control of shape arguments ------------------------------------------------------------ The true intention of returning data selections of this form, and within a DataChunk object, is to write these piecewise to an HDF5 dataset. .. GENERATED FROM PYTHON SOURCE LINES 86-104 .. code-block:: Python # This is where the importance of the underlying `chunk_shape` comes in, and why it is critical to performance # that it perfectly subsets the `buffer_shape`. import h5py maxshape = (20000, 5000) buffer_shape = (10000, 2500) chunk_shape = (1000, 250) my_array = np.random.randint(low=0, high=10, size=maxshape, dtype="int32") my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array, buffer_shape=buffer_shape, chunk_shape=chunk_shape) out_file = "my_temporary_test_file.hdf5" with h5py.File(name=out_file, mode="w") as f: dset = f.create_dataset(name="test", shape=maxshape, dtype="int16", chunks=my_custom_iterator.chunk_shape) for buffer in my_custom_iterator: dset[buffer.selection] = buffer.data # Remember to remove the temporary file after running this and exploring the contents! .. GENERATED FROM PYTHON SOURCE LINES 105-110 .. note:: Here we explicitly set the `chunks` value in the HDF5 dataset object; however, a nice part of the design of this iterator is that when wrapped in a ``hdmf.backends.hdf5.h5_utils.H5DataIO`` that is called within a ``hdmf.backends.hdf5.h5tools.HDF5IO`` context with a corresponding ``hdmf.container.Container``, these details will be automatically parsed. .. GENERATED FROM PYTHON SOURCE LINES 112-131 .. note:: There is some overlap here in nomenclature between HDMF and HDF5. The term *chunk* in both HDMF and HDF5 refer to a subset of dataset, however, in HDF5 a chunk is a piece of dataset on disk, whereas in the context of the :py:class:`~hdmf.data_utils.DataChunk` iteration is a block of data in memory. As such, the requirements on the shape and size of chunks are different. In HDF5 these chunks are pieces of a dataset that get compressed and cached together, and they should usually be small in size for optimal performance (typically 1 MB or less). In contrast, a :py:class:`~hdmf.data_utils.DataChunk` in HDMF acts as a block of data for writing data to dataset, and spans multiple HDF5 chunks to improve performance. This is achieved by avoiding repeat updates to the same ``Chunk`` in the HDF5 file, :py:class:`~hdmf.data_utils.DataChunk` objects for write should align with ``Chunks`` in the HDF5 file, i.e., the ``DataChunk.selection`` should fully cover one or more ``Chunks`` in the HDF5 file to avoid repeat updates to the same ``Chunks`` in the HDF5 file. This is what the `buffer` of the :py:class`~hdmf.data_utils.GenericDataChunkIterator` does, which upon each iteration returns a single :py:class:`~hdmf.data_utils.DataChunk` object (by default > 1 GB) that perfectly spans many HDF5 chunks (by default < 1 MB) to help reduce the number of small I/O operations and help improve performance. In practice, the `buffer` should usually be even larger than the default, i.e, as much free RAM as can be safely used. .. GENERATED FROM PYTHON SOURCE LINES 133-134 Remove the test file .. GENERATED FROM PYTHON SOURCE LINES 134-137 .. code-block:: Python import os if os.path.exists(out_file): os.remove(out_file) .. _sphx_glr_download_tutorials_plot_generic_data_chunk_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_generic_data_chunk_tutorial.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_generic_data_chunk_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_generic_data_chunk_tutorial.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_