ScispaCy Integration

A bridge between PyOBO and scispacy.

scispacy implements a lexical index in scispacy.linking_utils.KnowledgeBase which keeps track of labels, synonyms, and definitions for entities. These are used to construct a TF-IDF index and implement entity linking (also called named entity normalization (NEN) or grounding) in scispacy.linking.EntityLinker.

Constructing a Lexical Index

An ad hoc ScispaCy lexical index can be constructed on-the-fly by passing a Bioregistry prefix to pyobo.get_scispacy_knowledgebase(). In the following example, the prefix to is used to construct a lexical index for the Plant Trait Ontology.

import pyobo
from scispacy.linking_utils import KnowledgeBase

kb: KnowledgeBase = pyobo.get_scispacy_knowledgebase("to")

The high-level PyOBO interface abstracts the differences between external ontologies like the Plant Trait Ontology and databases that are converted to ontologies in pyobo.sources like the HUGO Gene Nomenclature Committee. Therefore, you can also do

import pyobo
from scispacy.linking_utils import KnowledgeBase

kb: KnowledgeBase = pyobo.get_scispacy_knowledgebase("hgnc")

Alternatively, a reusable class can be defined like in the following:

import pyobo
from scispacy.linking_utils import KnowledgeBase


class HGNCKnowledgeBase(KnowledgeBase):
    def __init__(self) -> None:
        super().__init__(pyobo.get_scispacy_entities("hgnc"))


kb = HGNCKnowledgeBase()

Constructing an Entity Linker

An entity linker can be constructed from a scispacy.linking_utils.KnowledgeBase like in:

import pyobo
from scispacy.linking import EntityLinker

kb = pyobo.get_scispacy_knowledgebase("hgnc")
linker = EntityLinker.from_kb(kb, filter_for_definitions=False)

Where filter_for_definitions is set to False to retain entities that don’t have a definition.

PyOBO provides a convenience function pyobo.get_scispacy_entity_linker() that wraps this workflow and also automatically caches the TF-IDF index constructed in the process in the correctly versioned folder in the PyOBO cache.

import pyobo
from scispacy.linking import EntityLinker

linker: EntityLinker = pyobo.get_scispacy_entity_linker("hgnc", filter_for_definitions=False)

Full Workflow

Once an entity linker has been constructed, it can b used in series with a spacy.Language object instantiated with spacy.load() to ground named entities that were recognized by a model like en_core_web_sm

import pyobo
import spacy
from scispacy.linking import EntityLinker
from tabulate import tabulate

linker: EntityLinker = pyobo.get_scispacy_entity_linker("hgnc", filter_for_definitions=False)

# now, put it all together with a NER model
nlp = spacy.load("en_core_web_sm")

text = (
    "RAC(Rho family)-alpha serine/threonine-protein kinase "
    "is an enzyme that in humans is encoded by the AKT1 gene."
)
doc = linker(nlp(text))

rows = [
    (
        span,
        span.start_char,
        span.end_char,
        f"`{curie} <https://bioregistry.io/{curie}>`_",
        score,
    )
    for span in doc.ents
    for curie, score in span._.kb_ents
]
print(tabulate(rows, headers=["text", "start", "end", "prefix", "identifier"], tablefmt="rst"))

text

start

end

curie

score

AKT1

100

104

hgnc:391

1

AKT1

100

104

hgnc:392

0.776504

AKT1

100

104

hgnc:393

0.764049

This example recognizes the AKT serine/threonine kinase 1 (AKT1) gene and provides three highly scored groundings, the best of which, hgnc:391, is correct.

Note

The groundings and scores are stored by ScispaCy in the hidden attribute span._.kb_ents.

Functions

get_scispacy_entities(prefix, **kwargs)

Iterate over entities in a given ontology via pyobo.

get_scispacy_entity_linker(prefix, *[, ...])

Get an entity linker for usage with scispacy.

get_scispacy_knowledgebase(prefix, **kwargs)

Get a lexical index for usage with scispacy.