tfds.features.text.TokenTextEncoder

View source on GitHub

TextEncoder backed by a list of tokens.

Inherits From: TextEncoder

Used in the notebooks

Used in the tutorials

Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".

vocab_list list<str>, list of tokens.
oov_buckets int, the number of ints to reserve for OOV hash buckets. Tokens that are OOV will be hash-modded into a OOV bucket in encode.
oov_token str, the string to use for OOV ids in decode.
lowercase bool, whether to make all text and tokens lowercase.
tokenizer Tokenizer, responsible for converting incoming text into a list of tokens.
strip_vocab bool, whether to strip whitespace from the beginning and end of elements of vocab_list.
decode_token_separator str, the string used to separate tokens when decoding.

lowercase

oov_token

tokenizer

tokens

vocab_size Size of the vocabulary. Decode produces ints [1, vocab_size).

Methods

decode

View source

Decodes a list of integers into text.

encode

View source

Encodes text into a list of integers.

load_from_file

View source

Load from file. Inverse of save_to_file.

save_to_file

View source

Store to file. Inverse of load_from_file.