scispacy

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.

Interactive Demo

Just looking to test out the models on your data? Check out our demo.

Installing

pip install scispacy
pip install <Model URL>

Models

Model	Description	Install URL
en_core_sci_sm	A full spaCy pipeline for biomedical data.	Download
en_core_sci_md	A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.	Download
en_core_sci_scibert	A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model.	Download
en_core_sci_lg	A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.	Download
en_ner_craft_md	A spaCy NER model trained on the CRAFT corpus.	Download
en_ner_jnlpba_md	A spaCy NER model trained on the JNLPBA corpus.	Download
en_ner_bc5cdr_md	A spaCy NER model trained on the BC5CDR corpus.	Download
en_ner_bionlp13cg_md	A spaCy NER model trained on the BIONLP13CG corpus.	Download

Performance

Our models achieve performance within 3% of published state of the art dependency parsers and within 0.4% accuracy of state of the art biomedical POS taggers.

model	UAS	LAS	POS	Mentions (F1)	Web UAS
en_core_sci_sm	89.18	87.15	98.18	67.89	87.36
en_core_sci_md	90.08	88.16	98.46	68.86	88.04
en_core_sci_lg	89.97	88.18	98.51	68.98	87.89
en_core_sci_scibert	92.12	90.58	98.18	67.70	92.58

model	F1	Entity Types
en_ner_craft_md	78.01	GGP, SO, TAXON, CHEBI, GO, CL
en_ner_jnlpba_md	72.06	DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN
en_ner_bc5cdr_md	84.28	DISEASE, CHEMICAL
en_ner_bionlp13cg_md	77.84	AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

Example Usage

import scispacy
import spacy

nlp = spacy.load("en_core_sci_sm")
text = """
Myeloid derived suppressor cells (MDSC) are immature 
myeloid cells with immunosuppressive activity. 
They accumulate in tumor-bearing mice and humans 
with different types of cancer, including hepatocellular 
carcinoma (HCC).
"""
doc = nlp(text)

print(list(doc.sents))
>>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.", 
     "They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."]

# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)
>>> (Myeloid derived suppressor cells,
     MDSC,
     immature,
     myeloid cells,
     immunosuppressive activity,
     accumulate,
     tumor-bearing mice,
     humans,
     cancer,
     hepatocellular carcinoma,
     HCC)

# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)

# See below for the generated SVG.
# Zoom your browser in a bit!

Branching

Data Sources

scispaCy models are trained on data from a variety of sources. In particular, we use:

The GENIA 1.0 Treebank, converted to basic Universal Dependencies using the Stanford Dependency Converter. We have made this dataset available along with the original raw data.
word2vec word vectors trained on the Pubmed Central Open Access Subset.
The MedMentions Entity Linking dataset, used for training a mention detector.
Ontonotes 5.0 to make the parser and tagger more robust to non-biomedical text. Unfortunately this is not publicly available.