SpaCy models for biomedical text processing
scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
Just looking to test out the models on your data? Check out our demo.
pip install scispacy
pip install <Model URL>
Model | Description | Install URL |
---|---|---|
en_core_sci_sm | A full spaCy pipeline for biomedical data. | Download |
en_core_sci_md | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. | Download |
en_core_sci_scibert | A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. |
Download |
en_core_sci_lg | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. | Download |
en_ner_craft_md | A spaCy NER model trained on the CRAFT corpus. | Download |
en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus. | Download |
en_ner_bc5cdr_md | A spaCy NER model trained on the BC5CDR corpus. | Download |
en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus. | Download |
Our models achieve performance within 3% of published state of the art dependency parsers and within 0.4% accuracy of state of the art biomedical POS taggers.
model | UAS | LAS | POS | Mentions (F1) | Web UAS |
---|---|---|---|---|---|
en_core_sci_sm | 89.18 | 87.15 | 98.18 | 67.89 | 87.36 |
en_core_sci_md | 90.08 | 88.16 | 98.46 | 68.86 | 88.04 |
en_core_sci_lg | 89.97 | 88.18 | 98.51 | 68.98 | 87.89 |
en_core_sci_scibert | 92.12 | 90.58 | 98.18 | 67.70 | 92.58 |
model | F1 | Entity Types |
---|---|---|
en_ner_craft_md | 78.01 | GGP, SO, TAXON, CHEBI, GO, CL |
en_ner_jnlpba_md | 72.06 | DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN |
en_ner_bc5cdr_md | 84.28 | DISEASE, CHEMICAL |
en_ner_bionlp13cg_md | 77.84 | AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE |
import scispacy
import spacy
nlp = spacy.load("en_core_sci_sm")
text = """
Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
carcinoma (HCC).
"""
doc = nlp(text)
print(list(doc.sents))
>>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.",
"They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."]
# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)
>>> (Myeloid derived suppressor cells,
MDSC,
immature,
myeloid cells,
immunosuppressive activity,
accumulate,
tumor-bearing mice,
humans,
cancer,
hepatocellular carcinoma,
HCC)
# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)
# See below for the generated SVG.
# Zoom your browser in a bit!
scispaCy models are trained on data from a variety of sources. In particular, we use: