scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
pip install scispacy pip install <Model URL>
|en_core_sci_sm||A full spaCy pipeline for biomedical data.||Download|
|en_core_sci_md||A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.||Download|
|en_core_sci_lg||A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.||Download|
|en_ner_craft_md||A spaCy NER model trained on the CRAFT corpus.||Download|
|en_ner_jnlpba_md||A spaCy NER model trained on the JNLPBA corpus.||Download|
|en_ner_bc5cdr_md||A spaCy NER model trained on the BC5CDR corpus.||Download|
|en_ner_bionlp13cg_md||A spaCy NER model trained on the BIONLP13CG corpus.||Download|
Our models achieve performance within 3% of published state of the art dependency parsers and within 0.4% accuracy of state of the art biomedical POS taggers.
|model||UAS||LAS||POS||Mentions (F1)||Web UAS|
|en_ner_craft_md||76.60||GGP, SO, TAXON, CHEBI, GO, CL|
|en_ner_jnlpba_md||74.26||DNA, CELL_TYPE, CELL_LINE, RNA, PROTEIN|
|en_ner_bionlp13cg_md||78.28||CANCER, ORGAN, TISSUE, ORGANISM, CELL, AMINO_ACID, GENE_OR_GENE_PRODUCT, SIMPLE_CHEMICAL, ANATOMICAL_SYSTEM, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, DEVELOPING_ANATOMICAL_STRUCTURE, ORGANISM_SUBDIVISION, CELLULAR_COMPONENT|
import scispacy import spacy nlp = spacy.load("en_core_sci_sm") text = """ Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity. They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC). """ doc = nlp(text) print(list(doc.sents)) >>> ["Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.", "They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC)."] # Examine the entities extracted by the mention detector. # Note that they don't have types like in SpaCy, and they # are more general (e.g including verbs) - these are any # spans which might be an entity in UMLS, a large # biomedical database. print(doc.ents) >>> (Myeloid derived suppressor cells, MDSC, immature, myeloid cells, immunosuppressive activity, accumulate, tumor-bearing mice, humans, cancer, hepatocellular carcinoma, HCC) # We can also visualise dependency parses # (This renders automatically inside a jupyter notebook!): from spacy import displacy displacy.render(next(doc.sents), style='dep', jupyter=True) # See below for the generated SVG. # Zoom your browser in a bit!
scispaCy models are trained on data from a variety of sources. In particular, we use: