Note

Go to the end to download the full example code.

HERD: HDMF External Resources Data Structure

This is a user guide to interacting with the HERD class.

Introduction

The HERD class provides a way to organize and map user terms from their data (keys) to multiple entities from the external resources. A typical use case for external resources is to link data stored in datasets or attributes to ontologies. For example, you may have a dataset country storing locations. Using HERD allows us to link the country names stored in the dataset to an ontology of all countries, enabling more rigid standardization of the data and facilitating data query and introspection.

From a user’s perspective, one can think of the HERD as a simple table, in which each row associates a particular key stored in a particular object (i.e., Attribute or Dataset in a file) with a particular entity (i.e, a term of an online resource). That is, (object, key) refer to parts inside a file and entity refers to an external resource outside the file, and HERD allows us to link the two. To reduce data redundancy and improve data integrity, HERD stores this data internally in a collection of interlinked tables.

KeyTable where each row describes a Key
FileTable where each row describes a File
EntityTable where each row describes an Entity
EntityKeyTable where each row describes an EntityKey
ObjectTable where each row describes an Object
ObjectKeyTable where each row describes an ObjectKey pair identifying which keys are used by which objects.

The HERD class then provides convenience functions to simplify interaction with these tables, allowing users to treat HERD as a single large table as much as possible.

Rules to HERD

When using the HERD class, there are rules to how users store information in the interlinked tables.

Multiple Key objects can have the same name. They are disambiguated by the Object associated with each, meaning we may have keys with the same name in different objects, but for a particular object all keys must be unique.
In order to query specific records, the HERD class uses ‘(file, object_id, relative_path, field, key)’ as the unique identifier.
Object can have multiple Key objects.
Multiple Object objects can use the same Key.
Do not use the private methods to add into the KeyTable, FileTable, EntityTable, ObjectTable, ObjectKeyTable, EntityKeyTable individually.
URIs are optional, but highly recommended. If not known, an empty string may be used.
An entity ID should be the unique string identifying the entity in the given resource. This may or may not include a string representing the resource and a colon. Use the format provided by the resource. For example, Identifiers.org uses the ID ncbigene:22353 but the NCBI Gene uses the ID 22353 for the same term.
In a majority of cases, Object objects will have an empty string for ‘field’. The HERD class supports compound data_types. In that case, ‘field’ would be the field of the compound data_type that has an external reference.
In some cases, the attribute that needs an external reference is not a object with a ‘data_type’. The user must then use the nearest object that has a data type to be used as the parent object. When adding an external resource for an object with a data type, users should not provide an attribute. When adding an external resource for an attribute of an object, users need to provide the name of the attribute.
The user must provide a File or an Object that has File along the parent hierarchy.

Creating an instance of the HERD class

from hdmf.common import HERD
from hdmf.common import DynamicTable, VectorData
from hdmf.term_set import TermSet
from hdmf import Container, HERDManager
from hdmf import Data
import numpy as np
import os

try:
    import linkml_runtime  # noqa: F401
except ImportError as e:
    raise ImportError("Please install linkml-runtime to run this example: pip install linkml-runtime") from e

try:
    dir_path = os.path.dirname(os.path.abspath(__file__))
    yaml_file = os.path.join(dir_path, 'example_term_set.yaml')
except NameError:
    dir_path = os.path.dirname(os.path.abspath('.'))
    yaml_file = os.path.join(dir_path, 'gallery/example_term_set.yaml')


# Class to represent a file
class HERDManagerContainer(Container, HERDManager):

    __fields__ = (
        {'name': 'external_resources', 'child': True, 'required_name': 'external_resources'},
    )

    def __init__(self, **kwargs):
        kwargs['name'] = 'HERDManagerContainer'
        super().__init__(**kwargs)


herd = HERD()
file = HERDManagerContainer(name='file')

Using the add_ref method

add_ref is a wrapper function provided by the HERD class that simplifies adding data. Using add_ref allows us to treat new entries similar to adding a new row to a flat table, with add_ref taking care of populating the underlying data structures accordingly.

data = Data(name="species", data=['Homo sapiens', 'Mus musculus'])
data.parent = file
herd.add_ref(
    container=data,
    key='Homo sapiens',
    entity_id='NCBITaxon:9606',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606'
)

herd.add_ref(
    container=data,
    key='Mus musculus',
    entity_id='NCBITaxon:10090',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090'
)

Using the add_ref method with an attribute

It is important to keep in mind that when adding and Object to the :py:class:~hdmf.common.resources.ObjectTable, the parent object identified by Object.object_id must be the closest parent to the target object (i.e., Object.relative_path must be the shortest possible path and as such cannot contain any objects with a data_type and associated object_id).

A common example would be with the DynamicTable class, which holds VectorData objects as columns. If we wanted to add an external reference on a column from a DynamicTable, then we would use the column as the object and not the DynamicTable (Refer to rule 9).

genotypes = DynamicTable(name='genotypes', description='My genotypes')
genotypes.add_column(name='genotype_name', description="Name of genotypes")
genotypes.add_row(id=0, genotype_name='Rorb')
genotypes.parent = file
herd.add_ref(
    container=genotypes,
    attribute='genotype_name',
    key='Rorb',
    entity_id='MGI:1346434',
    entity_uri='http://www.informatics.jax.org/marker/MGI:1343464'
)

# Note: :py:func:`~hdmf.common.resources.HERD.add_ref` internally resolves the object
# to the closest parent, so that ``herd.add_ref(container=genotypes, attribute='genotype_name')`` and
# ``herd.add_ref(container=genotypes.genotype_name, attribute=None)`` will ultimately both use the ``object_id``
# of the ``genotypes.genotype_name`` :py:class:`~hdmf.common.table.VectorData` column and
# not the object_id of the genotypes table.

How add_ref resolves the file

A reference can only be added to a container that has already been added to a file (or more accurately to a Container that is a hdmf.container.HERDManager, which in most practical cases is the file). add_ref automatically resolves the file by walking up the container’s parent hierarchy to find the enclosing HERDManager (the file). If the container is not yet in a file, add_ref raises an informative error. In the example below the column is reachable from the file because its parent table has been added to the file.

col1 = VectorData(
    name='Species_Data',
    description='species from NCBI and Ensemble',
    data=['Homo sapiens', 'Ursus arctos horribilis'],
)

# Create a DynamicTable with this column and set the table parent to the file object created earlier
species = DynamicTable(name='species', description='My species', columns=[col1])
species.parent = file

herd.add_ref(
    container=species,
    attribute='Species_Data',
    key='Ursus arctos horribilis',
    entity_id='NCBITaxon:116960',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id'
)

Visualize HERD

Users can visualize ~hdmf.common.resources.HERD as a flattened table or as separate tables.

# `~hdmf.common.resources.HERD` as a flattened table
herd.to_dataframe()

# The individual interlinked tables:
herd.files.to_dataframe()
herd.objects.to_dataframe()
herd.entities.to_dataframe()
herd.keys.to_dataframe()
herd.object_keys.to_dataframe()
herd.entity_keys.to_dataframe()

	entities_idx	keys_idx
0	0	0
1	1	1
2	2	2
3	3	3

Using the get_key method

The get_key method will return a Key object. In the current version of HERD, duplicate keys are allowed; however, each key needs a unique linking Object. In other words, each combination of (file, container, relative_path, field, key) can exist only once in HERD.

# The :py:func:`~hdmf.common.resources.HERD.get_key` method will be able to return the
# :py:class:`~hdmf.common.resources.Key` object if the :py:class:`~hdmf.common.resources.Key` object is unique.
genotype_key_object = herd.get_key(key_name='Rorb')

# If the :py:class:`~hdmf.common.resources.Key` object has a duplicate name, then the user will need
# to provide the unique (file, container, relative_path, field, key) combination.
species_key_object = herd.get_key(file=file,
                                container=species['Species_Data'],
                                key_name='Ursus arctos horribilis')

# If the file is not provided, :py:func:`~hdmf.common.resources.HERD.get_key` also will check the
# :py:class:`~hdmf.common.resources.Object` for a :py:class:`~hdmf.common.resources.File` along the
# parent hierarchy, the same way :py:func:`~hdmf.common.resources.HERD.add_ref` always resolves the file.

Using the add_ref method with a key_object

Multiple Object objects can use the same Key. To use an existing key when adding new entries into HERD, pass the Key object instead of the ‘key_name’ to the add_ref method. If a ‘key_name’ is used, a new Key will be created.

herd.add_ref(
    container=genotypes,
    attribute='genotype_name',
    key=genotype_key_object,
    entity_id='ENSEMBL:ENSG00000198963',
    entity_uri='https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000198963'
)

Using the get_object_entities

The get_object_entities method allows the user to retrieve all entities and key information associated with an Object in the form of a pandas DataFrame.

herd.get_object_entities(file=file,
                       container=genotypes['genotype_name'],
                       relative_path='')

	entity_id	entity_uri
0	MGI:1346434	http://www.informatics.jax.org/marker/MGI:1343464
1	ENSEMBL:ENSG00000198963	https://uswest.ensembl.org/Homo_sapiens/Gene/S...

Using the get_object_type

The get_object_entities method allows the user to retrieve all entities and key information associated with an Object in the form of a pandas DataFrame.

herd.get_object_type(object_type='Data')

	file_object_id	objects_idx	object_id	files_idx	object_type	relative_path	field	keys_idx	key	entities_idx	entity_id	entity_uri
0	b78acac8-c005-4c21-a068-c76936b55baa	0	1da12a4a-c301-46d2-8838-11cd033776f1	0	Data			0	Homo sapiens	0	NCBITaxon:9606	https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
1	b78acac8-c005-4c21-a068-c76936b55baa	0	1da12a4a-c301-46d2-8838-11cd033776f1	0	Data			1	Mus musculus	1	NCBITaxon:10090	https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...

Special Case: Using add_ref with compound data

In most cases, the field is left as an empty string, but if the dataset or attribute is a compound data_type, then we can use the ‘field’ value to differentiate the different columns of the dataset. For example, if a dataset has a compound data_type with columns/fields ‘x’, ‘y’, and ‘z’, and each column/field is associated with different ontologies, then use field=’x’ to denote that ‘x’ is using the external reference.

# Let's create a new instance of :py:class:`~hdmf.common.resources.HERD`.
herd = HERD()

data = Data(
    name='data_name',
    data=np.array(
        [('Mus musculus', 9, 81.0), ('Homo sapiens', 3, 27.0)],
        dtype=[('species', 'U14'), ('age', 'i4'), ('weight', 'f4')]
    )
)
data.parent = file

herd.add_ref(
    container=data,
    field='species',
    key='Mus musculus',
    entity_id='NCBITaxon:txid10090',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090'
)

Using add_ref_termset

The add_ref_termset method allows users to not only validate terms, i.e., keys, but also add references for an entire datasets, rather than single entries as we saw prior with add_ref.

# :py:func:`~hdmf.common.resources.HERD.add_ref_termset` has many optional fields,
# giving the user a range of control when adding references. Let's see an example.
herd = HERD()
terms = TermSet(term_schema_path=yaml_file)

herd.add_ref_termset(container=species,
                   attribute='Species_Data',
                   key='Ursus arctos horribilis',
                   termset=terms)

Using add_ref_termset for an entire dataset

As mentioned above, add_ref_termset supports iteratively validating and populating HERD.

# When populating :py:class:`~hdmf.common.resources.HERD`, users may have some terms
# that are not in the :py:class:`~hdmf.term_set.TermSet`. As a result,
# :py:func:`~hdmf.common.resources.HERD.add_ref_termset` will return all of the missing
# terms in a dictionary. It is up to the user to either add these terms to the
# :py:class:`~hdmf.term_set.TermSet` or remove them from the dataset.

herd = HERD()
terms = TermSet(term_schema_path=yaml_file)

herd.add_ref_termset(container=species,
                   attribute='Species_Data',
                   termset=terms)

Write HERD

HERD is written as a zip file of the individual tables written to tsv. The user provides the path, which contains the name of the file.

herd.to_zip(path='./HERD.zip')

Read HERD

Users can read HERD from the zip file by providing the path to the file itself.

er_read = HERD.from_zip(path='./HERD.zip')
os.remove('./HERD.zip')

Gallery generated by Sphinx-Gallery