TermSet

This is a user guide for interacting with the TermSet and TermSetWrapper classes. The TermSet and TermSetWrapper types are experimental and are subject to change in future releases. If you use these types, please provide feedback to the HDMF team so that we can improve the structure and overall capabilities.

Introduction

The TermSet class provides a way for users to create their own set of terms from brain atlases, species taxonomies, and anatomical, cell, and gene function ontologies.

Users will be able to validate their data and attributes to their own set of terms, ensuring clean data to be used inline with the FAIR principles later on. The TermSet class allows for a reusable and sharable pool of metadata to serve as references for any dataset or attribute. The TermSet class is used closely with HERD to more efficiently map terms to data.

In order to actually use a TermSet, users will use the TermSetWrapper to wrap data and attributes. The TermSetWrapper uses a user-provided TermSet to perform validation.

TermSet is built upon the resources from LinkML, a modeling language that uses YAML-based schema, giving TermSet a standardized structure and a variety of tools to help the user manage their references.

How to make a TermSet Schema

Before the user can take advantage of all the wonders within the TermSet class, the user needs to create a LinkML schema (YAML) that provides all the permissible term values. Please refer to https://linkml.io/linkml/intro/tutorial06.html to learn more about how LinkML structures their schema.

  1. The name of the schema is up to the user, e.g., the name could be “Species” if the term set will contain species terms.

  2. The prefixes will be the standardized prefix of your source, followed by the URI to the terms. For example, the NCBI Taxonomy is abbreviated as NCBI_TAXON, and Ensemble is simply Ensemble. As mentioned prior, the URI needs to be to the terms; this is to allow the URI to later be coupled with the source id for the term to create a valid link to the term source page.

  3. The schema uses LinkML enumerations to list all the possible terms. To define the all the permissible values, the user can define them manually in the schema, transfer them from a Google spreadsheet, or pull them into the schema dynamically from a LinkML supported source.

For a clear example, please view the example_term_set.yaml for this tutorial, which provides a concise example of how a term set schema looks.

Note

For more information regarding LinkML Enumerations, please refer to https://linkml.io/linkml/intro/tutorial06.html.

Note

For more information on how to properly format the Google spreadsheet to be compatible with LinkMl, please refer to https://linkml.io/schemasheets/#examples.

Note

For more information how to properly format the schema to support LinkML Dynamic Enumerations, please refer to https://linkml.io/linkml/schemas/enums.html#dynamic-enums.

from hdmf.common import DynamicTable, VectorData
import os

try:
    import linkml_runtime  # noqa: F401
except ImportError as e:
    raise ImportError("Please install linkml-runtime to run this example: pip install linkml-runtime") from e
from hdmf.term_set import TermSet, TermSetWrapper

try:
    dir_path = os.path.dirname(os.path.abspath(__file__))
    yaml_file = os.path.join(dir_path, 'example_term_set.yaml')
    schemasheets_folder = os.path.join(dir_path, 'schemasheets')
    dynamic_schema_path = os.path.join(dir_path, 'example_dynamic_term_set.yaml')
except NameError:
    dir_path = os.path.dirname(os.path.abspath('.'))
    yaml_file = os.path.join(dir_path, 'gallery/example_term_set.yaml')
    schemasheets_folder = os.path.join(dir_path, 'gallery/schemasheets')
    dynamic_schema_path = os.path.join(dir_path, 'gallery/example_dynamic_term_set.yaml')

# Use Schemasheets to create TermSet schema
# -----------------------------------------
# The :py:class:`~hdmf.term_set.TermSet` class builds off of LinkML Schemasheets, allowing users to convert between
# a Google spreadsheet to a complete LinkML schema. Once the user has defined the necessary LinkML metadata within the
# spreadsheet, the spreadsheet needs to be saved as individual tsv files, i.e., one tsv file per spreadsheet tab. Please
# refer to the Schemasheets tutorial link above for more details on the required syntax structure within the sheets.
# Once the tsv files are in a folder, the user simply provides the path to the folder with ``schemasheets_folder``.
termset = TermSet(schemasheets_folder=schemasheets_folder)

# Use Dynamic Enumerations to populate TermSet
# --------------------------------------------
# The :py:class:`~hdmf.term_set.TermSet` class allows user to skip manually defining permissible values, by pulling from
# a LinkML supported source. These sources contain multiple ontologies. A user can select a node from an ontology,
# in which all the elements on the branch, starting from the chosen node, will be used as permissible values.
# Please refer to the LinkMl Dynamic Enumeration tutorial for more information on these sources and how to setup Dynamic
# Enumerations within the schema. Once the schema is ready, the user provides a path to the schema and set
# ``dynamic=True``. A new schema, with the populated permissible values, will be created in the same directory.
termset = TermSet(term_schema_path=dynamic_schema_path, dynamic=True)
/home/docs/checkouts/readthedocs.org/user_builds/hdmf/envs/stable/lib/python3.9/site-packages/pydantic/_internal/_config.py:322: UserWarning: Valid config keys have changed in V2:
* 'underscore_attrs_are_private' has been removed
  warnings.warn(message, UserWarning)
ERROR:root:Prefix TEMP not declared: using default

Downloading cl.db.gz: 0.00B [00:00, ?B/s]
Downloading cl.db.gz:   0%|          | 8.00k/84.6M [00:00<18:54, 78.2kB/s]
Downloading cl.db.gz:  10%|▉         | 8.17M/84.6M [00:00<00:01, 49.4MB/s]
Downloading cl.db.gz:  26%|██▌       | 21.8M/84.6M [00:00<00:00, 89.2MB/s]
Downloading cl.db.gz:  31%|███       | 26.1M/84.6M [00:00<00:00, 69.9MB/s]
Downloading cl.db.gz:  47%|████▋     | 40.0M/84.6M [00:00<00:00, 92.8MB/s]
Downloading cl.db.gz:  66%|██████▌   | 55.9M/84.6M [00:00<00:00, 117MB/s]
Downloading cl.db.gz:  66%|██████▌   | 56.0M/84.6M [00:00<00:00, 62.2MB/s]
Downloading cl.db.gz:  74%|███████▎  | 62.3M/84.6M [00:01<00:00, 50.8MB/s]
Downloading cl.db.gz:  90%|████████▉ | 75.7M/84.6M [00:01<00:00, 71.4MB/s]

Viewing TermSet values

TermSet has methods to retrieve terms. The view_set method will return a dictionary of all the terms and the corresponding information for each term. Users can index specific terms from the TermSet. LinkML runtime will need to be installed. You can do so by first running pip install linkml-runtime.

terms = TermSet(term_schema_path=yaml_file)
print(terms.view_set)

# Retrieve a specific term
terms['Homo sapiens']
{'Homo sapiens': Term_Info(id='NCBI_TAXON:9606', description='the species is human', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606'), 'Mus musculus': Term_Info(id='NCBI_TAXON:10090', description='the species is a house mouse', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10090'), 'Ursus arctos horribilis': Term_Info(id='NCBI_TAXON:116960', description='the species is a grizzly bear', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=116960'), 'Myrmecophaga tridactyla': Term_Info(id='NCBI_TAXON:71006', description='the species is an anteater', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=71006')}

Term_Info(id='NCBI_TAXON:9606', description='the species is human', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606')

Validate Data with TermSetWrapper

TermSetWrapper can be wrapped around data. To validate data, the user will set the data to the wrapped data, in which validation must pass for the data object to be created.

data = VectorData(
    name='species',
    description='...',
    data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
    )

Validate Attributes with TermSetWrapper

Similar to wrapping datasets, TermSetWrapper can be wrapped around any attribute. To validate attributes, the user will set the attribute to the wrapped value, in which validation must pass for the object to be created.

data = VectorData(
    name='species',
    description=TermSetWrapper(value='Homo sapiens', termset=terms),
    data=['Human']
    )

Validate on append with TermSetWrapper

As mentioned prior, when using a TermSetWrapper, all new data is validated. This is true for adding new data with append and extend.

data = VectorData(
    name='species',
    description='...',
    data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
    )

data.append('Ursus arctos horribilis')
data.extend(['Mus musculus', 'Myrmecophaga tridactyla'])

Validate Data in a DynamicTable

Validating data for DynamicTable is determined by which columns were initialized with a TermSetWrapper. The data is validated when the columns are created and modified using DynamicTable.add_row.

col1 = VectorData(
    name='Species_1',
    description='...',
    data=TermSetWrapper(value=['Homo sapiens'], termset=terms),
)
col2 = VectorData(
    name='Species_2',
    description='...',
    data=TermSetWrapper(value=['Mus musculus'], termset=terms),
)
species = DynamicTable(name='species', description='My species', columns=[col1,col2])

Validate new rows in a DynamicTable with TermSetWrapper

Validating new rows to DynamicTable is simple. The add_row method will automatically check each column for a TermSetWrapper. If a wrapper is being used, then the data will be validated for that column using that column’s TermSet from the TermSetWrapper. If there is invalid data, the row will not be added and the user will be prompted to fix the new data in order to populate the table.

species.add_row(Species_1='Mus musculus', Species_2='Mus musculus')

Validate new columns in a DynamicTable with TermSetWrapper

To add a column that is validated using TermSetWrapper, wrap the data in the add_column method as if you were making a new instance of VectorData.

species.add_column(name='Species_3',
                   description='...',
                   data=TermSetWrapper(value=['Ursus arctos horribilis', 'Mus musculus'], termset=terms),)

Total running time of the script: (0 minutes 7.460 seconds)

Gallery generated by Sphinx-Gallery