allennlp.data.tokenizers

class allennlp.data.tokenizers.token.Token(text: str = None, idx: int = None, pos: str = None, tag: str = None, dep: str = None, ent_type: str = None, text_id: int = None) → None[source]

Bases: object

A simple token representation, keeping track of the token’s text, offset in the passage it was taken from, POS tag, and dependency relation. These fields match spacy’s exactly, so we can just use a spacy token for this.

Parameters:

text : str, optional

The original text represented by this token.

idx : int, optional

The character offset of this token into the tokenized passage.

pos : str, optional

The coarse-grained part of speech of this token.

tag : str, optional

The fine-grained part of speech of this token.

dep : str, optional

The dependency relation for this token.

ent_type : str, optional

The entity type (i.e., the NER tag) for this token.

text_id : int, optional

If your tokenizer returns integers instead of strings (e.g., because you’re doing byte encoding, or some hash-based embedding), set this with the integer. If this is set, we will bypass the vocabulary when indexing this token, regardless of whether text is also set. You can also set text with the original text, if you want, so that you can still use a character-level representation in addition to a hash-based word embedding.

The other fields on Token follow the fields on spacy’s Token object; this is one we added, similar to spacy’s lex_id.

This module contains various classes for performing tokenization, stemming, and filtering.

class allennlp.data.tokenizers.tokenizer.Tokenizer[source]

Bases: allennlp.common.registrable.Registrable

A Tokenizer splits strings of text into tokens. Typically, this either splits text into word tokens or character tokens, and those are the two tokenizer subclasses we have implemented here, though you could imagine wanting to do other kinds of tokenization for structured or other inputs.

As part of tokenization, concrete implementations of this API will also handle stemming, stopword filtering, adding start and end tokens, or other kinds of things you might want to do to your tokens. See the parameters to, e.g., WordTokenizer, or whichever tokenizer you want to use.

If the base input to your model is words, you should use a WordTokenizer, even if you also want to have a character-level encoder to get an additional vector for each word token. Splitting word tokens into character arrays is handled separately, in the token_representations.TokenRepresentation class.

default_implementation = 'word'
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.tokenizer.Tokenizer[source]
tokenize(text: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

The only public method for this class. Actually implements splitting words into tokens.

Returns:tokens : List[Token]
class allennlp.data.tokenizers.word_tokenizer.WordTokenizer(word_splitter: allennlp.data.tokenizers.word_splitter.WordSplitter = None, word_filter: allennlp.data.tokenizers.word_filter.WordFilter = <allennlp.data.tokenizers.word_filter.PassThroughWordFilter object>, word_stemmer: allennlp.data.tokenizers.word_stemmer.WordStemmer = <allennlp.data.tokenizers.word_stemmer.PassThroughWordStemmer object>, start_tokens: typing.List[str] = None, end_tokens: typing.List[str] = None) → None[source]

Bases: allennlp.data.tokenizers.tokenizer.Tokenizer

A WordTokenizer handles the splitting of strings into words as well as any desired post-processing (e.g., stemming, filtering, etc.). Note that we leave one particular piece of post-processing for later: the decision of whether or not to lowercase the token. This is for two reasons: (1) if you want to make two different casing decisions for whatever reason, you won’t have to run the tokenizer twice, and more importantly (2) if you want to lowercase words for your word embedding, but retain capitalization in a character-level representation, we need to retain the capitalization here.

Parameters:

word_splitter : WordSplitter, optional

The WordSplitter to use for splitting text strings into word tokens. The default is to use the SpacyWordSplitter with default parameters.

word_filter : WordFilter, optional

The WordFilter to use for, e.g., removing stopwords. Default is to do no filtering.

word_stemmer : WordStemmer, optional

The WordStemmer to use. Default is no stemming.

start_tokens : List[str], optional

If given, these tokens will be added to the beginning of every string we tokenize.

end_tokens : List[str], optional

If given, these tokens will be added to the end of every string we tokenize.

language : str, optional

We use spacy to tokenize strings; this option specifies which language to use. By default we use English.

pos_tags : bool, optional

By default we do not load spacy’s tagging model, to save loading time and memory. Set this to True if you want to have access to spacy’s POS tags in the returned tokens.

parse : bool, optional

By default we do not load spacy’s parsing model, to save loading time and memory. Set this to True if you want to have access to spacy’s dependency parse tags in the returned tokens.

ner : bool, optional

By default we do not load spacy’s parsing model, to save loading time and memory. Set this to True if you want to have access to spacy’s NER tags in the returned tokens.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_tokenizer.WordTokenizer[source]
tokenize(text: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Does whatever processing is required to convert a string of text into a sequence of tokens.

At a minimum, this uses a WordSplitter to split words into text. It may also do stemming or stopword removal, depending on the parameters given to the constructor.

class allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer(byte_encoding: str = None, lowercase_characters: bool = False, start_tokens: typing.List[str] = None, end_tokens: typing.List[str] = None) → None[source]

Bases: allennlp.data.tokenizers.tokenizer.Tokenizer

A CharacterTokenizer splits strings into character tokens.

Parameters:

byte_encoding : str, optional (default=``None``)

If not None, we will use this encoding to encode the string as bytes, and use the byte sequence as characters, instead of the unicode characters in the python string. E.g., the character ‘á’ would be a single token if this option is None, but it would be two tokens if this option is set to "utf-8".

If this is not None, tokenize will return a List[int] instead of a List[str], and we will bypass the vocabulary in the TokenIndexer.

lowercase_characters : bool, optional (default=``False``)

If True, we will lowercase all of the characters in the text before doing any other operation. You probably do not want to do this, as character vocabularies are generally not very large to begin with, but it’s an option if you really want it.

start_tokens : List[str], optional

If given, these tokens will be added to the beginning of every string we tokenize. If using byte encoding, this should actually be a List[int], not a List[str].

end_tokens : List[str], optional

If given, these tokens will be added to the end of every string we tokenize. If using byte encoding, this should actually be a List[int], not a List[str].

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer[source]
tokenize(text: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

The only public method for this class. Actually implements splitting words into tokens.

Returns:tokens : List[Token]
class allennlp.data.tokenizers.word_filter.PassThroughWordFilter[source]

Bases: allennlp.data.tokenizers.word_filter.WordFilter

Does not filter words; it’s a no-op. This is the default word filter.

filter_words(words: typing.List[allennlp.data.tokenizers.token.Token]) → typing.List[allennlp.data.tokenizers.token.Token][source]

Returns a filtered list of words.

class allennlp.data.tokenizers.word_filter.StopwordFilter[source]

Bases: allennlp.data.tokenizers.word_filter.WordFilter

Uses a list of stopwords to filter.

filter_words(words: typing.List[allennlp.data.tokenizers.token.Token]) → typing.List[allennlp.data.tokenizers.token.Token][source]

Returns a filtered list of words.

class allennlp.data.tokenizers.word_filter.WordFilter[source]

Bases: allennlp.common.registrable.Registrable

A WordFilter removes words from a token list. Typically, this is for stopword removal, though you could feasibly use it for more domain-specific removal if you want.

Word removal happens before stemming, so keep that in mind if you’re designing a list of words to be removed.

default_implementation = 'pass_through'
filter_words(words: typing.List[allennlp.data.tokenizers.token.Token]) → typing.List[allennlp.data.tokenizers.token.Token][source]

Returns a filtered list of words.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_filter.WordFilter[source]
class allennlp.data.tokenizers.word_splitter.JustSpacesWordSplitter[source]

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter that assumes you’ve already done your own tokenization somehow and have separated the tokens by spaces. We just split the input string on whitespace and return the resulting list. We use a somewhat odd name here to avoid coming too close to the more commonly used SpacyWordSplitter.

Note that we use sentence.split(), which means that the amount of whitespace between the tokens does not matter. This will never result in spaces being included as tokens.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_splitter.WordSplitter[source]
split_words(sentence: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.LettersDigitsWordSplitter[source]

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter which keeps runs of (unicode) letters and runs of digits together, while every other non-whitespace character becomes a separate word.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_splitter.WordSplitter[source]
split_words(sentence: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.NltkWordSplitter[source]

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter that uses nltk’s word_tokenize method.

I found that nltk is very slow, so I switched to using my own simple one, which is a good deal faster. But I’m adding this one back so that there’s consistency with older versions of the code, if you really want it.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_splitter.WordSplitter[source]
split_words(sentence: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.SimpleWordSplitter[source]

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead. This just does an initial split(), followed by some heuristic filtering of each whitespace-delimited token, separating contractions and punctuation. We assume lower-cased, reasonably well-formed English sentences as input.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_splitter.WordSplitter[source]
split_words(sentence: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Splits a sentence into word tokens. We handle four kinds of things: words with punctuation that should be ignored as a special case (Mr. Mrs., etc.), contractions/genitives (isn’t, don’t, Matt’s), and beginning and ending punctuation (“antennagate”, (parentheticals), and such.).

The basic outline is to split on whitespace, then check each of these cases. First, we strip off beginning punctuation, then strip off ending punctuation, then strip off contractions. When we strip something off the beginning of a word, we can add it to the list of tokens immediately. When we strip it off the end, we have to save it to be added to after the word itself has been added. Before stripping off any part of a token, we first check to be sure the token isn’t in our list of special cases.

class allennlp.data.tokenizers.word_splitter.SpacyWordSplitter(language: str = 'en', pos_tags: bool = False, parse: bool = False, ner: bool = False) → None[source]

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter that uses spaCy’s tokenizer. It’s fast and reasonable - this is the recommended WordSplitter.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_splitter.WordSplitter[source]
split_words(sentence: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.WordSplitter[source]

Bases: allennlp.common.registrable.Registrable

A WordSplitter splits strings into words. This is typically called a “tokenizer” in NLP, because splitting strings into characters is trivial, but we use Tokenizer to refer to the higher-level object that splits strings into tokens (which could just be character tokens). So, we’re using “word splitter” here for this.

default_implementation = 'spacy'
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_splitter.WordSplitter[source]
split_words(sentence: str) → typing.List[allennlp.data.tokenizers.token.Token][source]

Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_stemmer.PassThroughWordStemmer[source]

Bases: allennlp.data.tokenizers.word_stemmer.WordStemmer

Does not stem words; it’s a no-op. This is the default word stemmer.

stem_word(word: allennlp.data.tokenizers.token.Token) → allennlp.data.tokenizers.token.Token[source]

Returns a new Token with word.text replaced by a stemmed word.

class allennlp.data.tokenizers.word_stemmer.PorterStemmer[source]

Bases: allennlp.data.tokenizers.word_stemmer.WordStemmer

Uses NLTK’s PorterStemmer to stem words.

stem_word(word: allennlp.data.tokenizers.token.Token) → allennlp.data.tokenizers.token.Token[source]

Returns a new Token with word.text replaced by a stemmed word.

class allennlp.data.tokenizers.word_stemmer.WordStemmer[source]

Bases: allennlp.common.registrable.Registrable

A WordStemmer lemmatizes words. This means that we map words to their root form, so that, e.g., “have”, “has”, and “had” all have the same internal representation.

You should think carefully about whether and how much stemming you want in your model. Kind of the whole point of using word embeddings is so that you don’t have to do this, but in a highly inflected language, or in a low-data setting, you might need it anyway. The default WordStemmer does nothing, just returning the work token as-is.

default_implementation = 'pass_through'
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.tokenizers.word_stemmer.WordStemmer[source]
stem_word(word: allennlp.data.tokenizers.token.Token) → allennlp.data.tokenizers.token.Token[source]

Returns a new Token with word.text replaced by a stemmed word.