ScispaCy Integration
A bridge between PyOBO and scispacy.
scispacy implements a lexical index in
scispacy.linking_utils.KnowledgeBase which keeps track of labels, synonyms, and
definitions for entities. These are used to construct a TF-IDF index and implement
entity linking (also called named entity normalization (NEN) or grounding) in
scispacy.linking.EntityLinker.
Constructing a Lexical Index
An ad hoc ScispaCy lexical index can be constructed on-the-fly by passing a
Bioregistry prefix to pyobo.get_scispacy_knowledgebase(). In the following
example, the prefix to is used to construct a lexical index for the Plant Trait
Ontology.
import pyobo
from scispacy.linking_utils import KnowledgeBase
kb: KnowledgeBase = pyobo.get_scispacy_knowledgebase("to")
The high-level PyOBO interface abstracts the differences between external ontologies
like the Plant Trait Ontology and databases that are converted to ontologies in
pyobo.sources like the HUGO Gene Nomenclature Committee. Therefore, you can also do
import pyobo
from scispacy.linking_utils import KnowledgeBase
kb: KnowledgeBase = pyobo.get_scispacy_knowledgebase("hgnc")
Alternatively, a reusable class can be defined like in the following:
import pyobo
from scispacy.linking_utils import KnowledgeBase
class HGNCKnowledgeBase(KnowledgeBase):
def __init__(self) -> None:
super().__init__(pyobo.get_scispacy_entities("hgnc"))
kb = HGNCKnowledgeBase()
Constructing an Entity Linker
An entity linker can be constructed from a scispacy.linking_utils.KnowledgeBase
like in:
import pyobo
from scispacy.linking import EntityLinker
kb = pyobo.get_scispacy_knowledgebase("hgnc")
linker = EntityLinker.from_kb(kb, filter_for_definitions=False)
Where filter_for_definitions is set to False to retain entities that don’t have
a definition.
PyOBO provides a convenience function pyobo.get_scispacy_entity_linker() that
wraps this workflow and also automatically caches the TF-IDF index constructed in the
process in the correctly versioned folder in the PyOBO cache.
import pyobo
from scispacy.linking import EntityLinker
linker: EntityLinker = pyobo.get_scispacy_entity_linker("hgnc", filter_for_definitions=False)
Full Workflow
Once an entity linker has been constructed, it can b used in series with a
spacy.Language object instantiated with spacy.load() to ground named
entities that were recognized by a model like en_core_web_sm
import pyobo
import spacy
from scispacy.linking import EntityLinker
from tabulate import tabulate
linker: EntityLinker = pyobo.get_scispacy_entity_linker("hgnc", filter_for_definitions=False)
# now, put it all together with a NER model
nlp = spacy.load("en_core_web_sm")
text = (
"RAC(Rho family)-alpha serine/threonine-protein kinase "
"is an enzyme that in humans is encoded by the AKT1 gene."
)
doc = linker(nlp(text))
rows = [
(
span,
span.start_char,
span.end_char,
f"`{curie} <https://bioregistry.io/{curie}>`_",
score,
)
for span in doc.ents
for curie, score in span._.kb_ents
]
print(tabulate(rows, headers=["text", "start", "end", "prefix", "identifier"], tablefmt="rst"))
text |
start |
end |
curie |
score |
|---|---|---|---|---|
AKT1 |
100 |
104 |
1 |
|
AKT1 |
100 |
104 |
0.776504 |
|
AKT1 |
100 |
104 |
0.764049 |
This example recognizes the AKT serine/threonine kinase 1 (AKT1) gene and provides three highly scored groundings, the best of which, hgnc:391, is correct.
Note
The groundings and scores are stored by ScispaCy in the hidden attribute
span._.kb_ents.
Functions
|
Iterate over entities in a given ontology via |
|
Get an entity linker for usage with |
|
Get a lexical index for usage with |