Text tokenization utility class.

Used in the notebooks

Used in the tutorials

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

num_words the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
filters a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.
lower boolean. Whether to convert the texts to lowercase.
split str. Separator for word splitting.
char_level if True, every character will be treated as a token.
oov_token if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.

0 is a reserved index that won't be assigned to any word.



View source

Updates internal vocabulary based on a list of se