scispacy

Logo

SpaCy models for biomedical text processing

View the Project on GitHub allenai/scispacy

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.

Interactive Demo

Just looking to test out the models on your data? Check out our demo.

Installing

pip install scispacy
pip install <Model URL>

Models

Model Description Install URL
en_core_sci_sm A full spaCy pipeline for biomedical data. Download
en_core_sci_md A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors. Download
en_core_sci_scibert A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. Download
en_core_sci_lg A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors. Download
en_ner_craft_md A spaCy NER model trained on the CRAFT corpus. Download
en_ner_jnlpba_md A spaCy NER model trained on the JNLPBA corpus. Download
en_ner_bc5cdr_md A spaCy NER model trained on the BC5CDR corpus. Download
en_ner_bionlp13cg_md A spaCy NER model trained on the BIONLP13CG corpus. Download

Performance

Our models achieve performance within 3% of published state of the art dependency parsers and within 0.4% accuracy of state of the art biomedical POS taggers.

model UAS LAS POS Mentions (F1) Web UAS
en_core_sci_sm 89.18 87.15 98.18 67.89 87.36
en_core_sci_md 90.08 88.16 98.46 68.86 88.04
en_core_sci_lg 89.97 88.18 98.51 68.98 87.89
en_core_sci_scibert 92.12 90.58 98.18 67.70 92.58
model F1 Entity Types
en_ner_craft_md 78.01 GGP, SO, TAXON, CHEBI, GO, CL
en_ner_jnlpba_md 72.06 DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN
en_ner_bc5cdr_md 84.28 DISEASE, CHEMICAL
en_ner_bionlp13cg_md 77.84 AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

Example Usage

import scispacy
import spacy

nlp = spacy.load("en_core_sci_sm")
text = """
Myeloid derived suppressor cells (MDSC) are immature 
myeloid cells with immunosuppressive activity. 
They accumulate in tumor-bearing mice and humans 
with different types of cancer, including hepatocellular 
carcinoma (HCC).
"""
doc = nlp(text)

print(list(doc.sents))
>>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.", 
     "They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."]

# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)
>>> (Myeloid derived suppressor cells,
     MDSC,
     immature,
     myeloid cells,
     immunosuppressive activity,
     accumulate,
     tumor-bearing mice,
     humans,
     cancer,
     hepatocellular carcinoma,
     HCC)

# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)

# See below for the generated SVG.
# Zoom your browser in a bit!

Branching

Data Sources

scispaCy models are trained on data from a variety of sources. In particular, we use: