ScispaCy Integration

A bridge between PyOBO and scispacy.

scispacy implements a lexical index in scispacy.linking_utils.KnowledgeBase which keeps track of labels, synonyms, and definitions for entities. These are used to construct a TF-IDF index and implement entity linking (also called named entity normalization (NEN) or grounding) in scispacy.linking.EntityLinker.

Constructing a Lexical Index

An ad hoc ScispaCy lexical index can be constructed on-the-fly by passing a Bioregistry prefix to pyobo.get_scispacy_knowledgebase(). In the following example, the prefix to is used to construct a lexical index for the Plant Trait Ontology.

import pyobo
from scispacy.linking_utils import KnowledgeBase

kb: KnowledgeBase = pyobo.get_scispacy_knowledgebase("to")

The high-level PyOBO interface abstracts the differences between external ontologies like the Plant Trait Ontology and databases that are converted to ontologies in pyobo.sources like the HUGO Gene Nomenclature Committee. Therefore, you can also do

import pyobo
from scispacy.linking_utils import KnowledgeBase

kb: KnowledgeBase = pyobo.get_scispacy_knowledgebase("hgnc")

Alternatively, a reusable class can be defined like in the following:

import pyobo
from scispacy.linking_utils import KnowledgeBase


class HGNCKnowledgeBase(KnowledgeBase):
    def __init__(self) -> None:
        super().__init__(pyobo.get_scispacy_entities("hgnc"))


kb = HGNCKnowledgeBase()

Constructing an Entity Linker

An entity linker can be constructed from a scispacy.linking_utils.KnowledgeBase like in:

import pyobo
from scispacy.linking import EntityLinker

kb = pyobo.get_scispacy_knowledgebase("hgnc")
linker = EntityLinker.from_kb(kb, filter_for_definitions=False)

Where filter_for_definitions is set to False to retain entities that don’t have a definition.

PyOBO provides a convenience function pyobo.get_scispacy_entity_linker() that wraps this workflow and also automatically caches the TF-IDF index constructed in the process in the correctly versioned folder in the PyOBO cache.

import pyobo
from scispacy.linking import EntityLinker

linker: EntityLinker = pyobo.get_scispacy_entity_linker("hgnc", filter_for_definitions=False)

Full Workflow

Once an entity linker has been constructed, it can b used in series with a spacy.Language object instantiated with spacy.load() to ground named entities that were recognized by a model like en_core_web_sm

import pyobo
import spacy
from scispacy.linking import EntityLinker
from tabulate import tabulate

linker: EntityLinker = pyobo.get_scispacy_entity_linker("hgnc", filter_for_definitions=False)

# now, put it all together with a NER model
nlp = spacy.load("en_core_web_sm")

text = (
    "RAC(Rho family)-alpha serine/threonine-protein kinase "
    "is an enzyme that in humans is encoded by the AKT1 gene."
)
doc = linker(nlp(text))

rows = [
    (
        span,
        span.start_char,
        span.end_char,
        f"`{curie} <https://bioregistry.io/{curie}>`_",
        score,
    )
    for span in doc.ents
    for curie, score in span._.kb_ents
]
print(tabulate(rows, headers=["text", "start", "end", "prefix", "identifier"], tablefmt="rst"))

text	start	end	curie	score
AKT1	100	104	hgnc:391	1
AKT1	100	104	hgnc:392	0.776504
AKT1	100	104	hgnc:393	0.764049

This example recognizes the AKT serine/threonine kinase 1 (AKT1) gene and provides three highly scored groundings, the best of which, hgnc:391, is correct.

Note

The groundings and scores are stored by ScispaCy in the hidden attribute span._.kb_ents.

Functions

`get_scispacy_entities`(prefix, **kwargs)	Iterate over entities in a given ontology via `pyobo`.
`get_scispacy_entity_linker`(prefix, *[, ...])	Get an entity linker for usage with `scispacy`.
`get_scispacy_knowledgebase`(prefix, **kwargs)	Get a lexical index for usage with `scispacy`.