Note
Go to the end to download the full example code
TermSet¶
This is a user guide for interacting with the
TermSet
and TermSetWrapper
classes.
The TermSet
and TermSetWrapper
types
are experimental and are subject to change in future releases. If you use these types,
please provide feedback to the HDMF team so that we can improve the structure and
overall capabilities.
Introduction¶
The TermSet
class provides a way for users to create their own
set of terms from brain atlases, species taxonomies, and anatomical, cell, and
gene function ontologies.
Users will be able to validate their data and attributes to their own set of terms, ensuring
clean data to be used inline with the FAIR principles later on.
The TermSet
class allows for a reusable and sharable
pool of metadata to serve as references for any dataset or attribute.
The TermSet
class is used closely with
HERD
to more efficiently map terms
to data.
In order to actually use a TermSet
, users will use the
TermSetWrapper
to wrap data and attributes. The
TermSetWrapper
uses a user-provided TermSet
to perform validation.
TermSet
is built upon the resources from LinkML, a modeling
language that uses YAML-based schema, giving TermSet
a standardized structure and a variety of tools to help the user manage their references.
How to make a TermSet Schema¶
Before the user can take advantage of all the wonders within the
TermSet
class, the user needs to create a LinkML schema (YAML) that provides
all the permissible term values. Please refer to https://linkml.io/linkml/intro/tutorial06.html
to learn more about how LinkML structures their schema.
The name of the schema is up to the user, e.g., the name could be “Species” if the term set will contain species terms.
The prefixes will be the standardized prefix of your source, followed by the URI to the terms. For example, the NCBI Taxonomy is abbreviated as NCBI_TAXON, and Ensemble is simply Ensemble. As mentioned prior, the URI needs to be to the terms; this is to allow the URI to later be coupled with the source id for the term to create a valid link to the term source page.
The schema uses LinkML enumerations to list all the possible terms. To define the all the permissible values, the user can define them manually in the schema, transfer them from a Google spreadsheet, or pull them into the schema dynamically from a LinkML supported source.
For a clear example, please view the example_term_set.yaml for this tutorial, which provides a concise example of how a term set schema looks.
Note
For more information regarding LinkML Enumerations, please refer to https://linkml.io/linkml/intro/tutorial06.html.
Note
For more information on how to properly format the Google spreadsheet to be compatible with LinkMl, please refer to https://linkml.io/schemasheets/#examples.
Note
For more information how to properly format the schema to support LinkML Dynamic Enumerations, please refer to https://linkml.io/linkml/schemas/enums.html#dynamic-enums.
from hdmf.common import DynamicTable, VectorData
import os
try:
import linkml_runtime # noqa: F401
except ImportError as e:
raise ImportError("Please install linkml-runtime to run this example: pip install linkml-runtime") from e
from hdmf.term_set import TermSet, TermSetWrapper
try:
dir_path = os.path.dirname(os.path.abspath(__file__))
yaml_file = os.path.join(dir_path, 'example_term_set.yaml')
schemasheets_folder = os.path.join(dir_path, 'schemasheets')
dynamic_schema_path = os.path.join(dir_path, 'example_dynamic_term_set.yaml')
except NameError:
dir_path = os.path.dirname(os.path.abspath('.'))
yaml_file = os.path.join(dir_path, 'gallery/example_term_set.yaml')
schemasheets_folder = os.path.join(dir_path, 'gallery/schemasheets')
dynamic_schema_path = os.path.join(dir_path, 'gallery/example_dynamic_term_set.yaml')
# Use Schemasheets to create TermSet schema
# -----------------------------------------
# The :py:class:`~hdmf.term_set.TermSet` class builds off of LinkML Schemasheets, allowing users to convert between
# a Google spreadsheet to a complete LinkML schema. Once the user has defined the necessary LinkML metadata within the
# spreadsheet, the spreadsheet needs to be saved as individual tsv files, i.e., one tsv file per spreadsheet tab. Please
# refer to the Schemasheets tutorial link above for more details on the required syntax structure within the sheets.
# Once the tsv files are in a folder, the user simply provides the path to the folder with ``schemasheets_folder``.
termset = TermSet(schemasheets_folder=schemasheets_folder)
# Use Dynamic Enumerations to populate TermSet
# --------------------------------------------
# The :py:class:`~hdmf.term_set.TermSet` class allows user to skip manually defining permissible values, by pulling from
# a LinkML supported source. These sources contain multiple ontologies. A user can select a node from an ontology,
# in which all the elements on the branch, starting from the chosen node, will be used as permissible values.
# Please refer to the LinkMl Dynamic Enumeration tutorial for more information on these sources and how to setup Dynamic
# Enumerations within the schema. Once the schema is ready, the user provides a path to the schema and set
# ``dynamic=True``. A new schema, with the populated permissible values, will be created in the same directory.
termset = TermSet(term_schema_path=dynamic_schema_path, dynamic=True)
/home/docs/checkouts/readthedocs.org/user_builds/hdmf/envs/stable/lib/python3.9/site-packages/pydantic/_internal/_config.py:322: UserWarning: Valid config keys have changed in V2:
* 'underscore_attrs_are_private' has been removed
warnings.warn(message, UserWarning)
ERROR:root:Prefix TEMP not declared: using default
Downloading cl.db.gz: 0.00B [00:00, ?B/s]
Downloading cl.db.gz: 0%| | 8.00k/84.6M [00:00<18:54, 78.2kB/s]
Downloading cl.db.gz: 10%|▉ | 8.17M/84.6M [00:00<00:01, 49.4MB/s]
Downloading cl.db.gz: 26%|██▌ | 21.8M/84.6M [00:00<00:00, 89.2MB/s]
Downloading cl.db.gz: 31%|███ | 26.1M/84.6M [00:00<00:00, 69.9MB/s]
Downloading cl.db.gz: 47%|████▋ | 40.0M/84.6M [00:00<00:00, 92.8MB/s]
Downloading cl.db.gz: 66%|██████▌ | 55.9M/84.6M [00:00<00:00, 117MB/s]
Downloading cl.db.gz: 66%|██████▌ | 56.0M/84.6M [00:00<00:00, 62.2MB/s]
Downloading cl.db.gz: 74%|███████▎ | 62.3M/84.6M [00:01<00:00, 50.8MB/s]
Downloading cl.db.gz: 90%|████████▉ | 75.7M/84.6M [00:01<00:00, 71.4MB/s]
Viewing TermSet values¶
TermSet
has methods to retrieve terms. The view_set
method will return a dictionary of all the terms and the corresponding information for each term.
Users can index specific terms from the TermSet
. LinkML runtime will need to be installed.
You can do so by first running pip install linkml-runtime
.
terms = TermSet(term_schema_path=yaml_file)
print(terms.view_set)
# Retrieve a specific term
terms['Homo sapiens']
{'Homo sapiens': Term_Info(id='NCBI_TAXON:9606', description='the species is human', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606'), 'Mus musculus': Term_Info(id='NCBI_TAXON:10090', description='the species is a house mouse', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10090'), 'Ursus arctos horribilis': Term_Info(id='NCBI_TAXON:116960', description='the species is a grizzly bear', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=116960'), 'Myrmecophaga tridactyla': Term_Info(id='NCBI_TAXON:71006', description='the species is an anteater', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=71006')}
Term_Info(id='NCBI_TAXON:9606', description='the species is human', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606')
Validate Data with TermSetWrapper¶
TermSetWrapper
can be wrapped around data.
To validate data, the user will set the data to the wrapped data, in which validation must pass
for the data object to be created.
data = VectorData(
name='species',
description='...',
data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
)
Validate Attributes with TermSetWrapper¶
Similar to wrapping datasets, TermSetWrapper
can be wrapped around any attribute.
To validate attributes, the user will set the attribute to the wrapped value, in which validation must pass
for the object to be created.
data = VectorData(
name='species',
description=TermSetWrapper(value='Homo sapiens', termset=terms),
data=['Human']
)
Validate on append with TermSetWrapper¶
As mentioned prior, when using a TermSetWrapper
, all new data is validated.
This is true for adding new data with append and extend.
data = VectorData(
name='species',
description='...',
data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
)
data.append('Ursus arctos horribilis')
data.extend(['Mus musculus', 'Myrmecophaga tridactyla'])
Validate Data in a DynamicTable¶
Validating data for DynamicTable
is determined by which columns were
initialized with a TermSetWrapper
. The data is validated when the columns
are created and modified using DynamicTable.add_row
.
col1 = VectorData(
name='Species_1',
description='...',
data=TermSetWrapper(value=['Homo sapiens'], termset=terms),
)
col2 = VectorData(
name='Species_2',
description='...',
data=TermSetWrapper(value=['Mus musculus'], termset=terms),
)
species = DynamicTable(name='species', description='My species', columns=[col1,col2])
Validate new rows in a DynamicTable with TermSetWrapper¶
Validating new rows to DynamicTable
is simple. The
add_row
method will automatically check each column for a
TermSetWrapper
. If a wrapper is being used, then the data will be
validated for that column using that column’s TermSet
from the
TermSetWrapper
. If there is invalid data, the
row will not be added and the user will be prompted to fix the new data in order to populate the table.
species.add_row(Species_1='Mus musculus', Species_2='Mus musculus')
Validate new columns in a DynamicTable with TermSetWrapper¶
To add a column that is validated using TermSetWrapper
,
wrap the data in the add_column
method as if you were making a new instance of VectorData
.
species.add_column(name='Species_3',
description='...',
data=TermSetWrapper(value=['Ursus arctos horribilis', 'Mus musculus'], termset=terms),)
Total running time of the script: (0 minutes 7.460 seconds)