Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tfds.features.text.TokenTextEncoder

View source on GitHub

TextEncoder backed by a list of tokens.

Inherits From: TextEncoder

tfds.features.text.TokenTextEncoder(
    vocab_list, oov_buckets=1, oov_token='UNK', lowercase=False, tokenizer=None,
    strip_vocab=True, decode_token_separator=' '
)

Used in the notebooks

Used in the tutorials

Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".

Args:

  • vocab_list: list<str>, list of tokens.
  • oov_buckets: int, the number of ints to reserve for OOV hash buckets. Tokens that are OOV will be hash-modded into a OOV bucket in encode.
  • oov_token: str, the string to use for OOV ids in decode.
  • lowercase: bool, whether to make all text and tokens lowercase.
  • tokenizer: Tokenizer, responsible for converting incoming text into a list of tokens.
  • strip_vocab: bool, whether to strip whitespace from the beginning and end of elements of vocab_list.
  • decode_token_separator: str, the string used to separate tokens when decoding.

Attributes:

  • lowercase
  • oov_token
  • tokenizer
  • tokens
  • vocab_size: Size of the vocabulary. Decode produces ints [1, vocab_size).

Methods

decode

View source

decode(
    ids
)

Decodes a list of integers into text.

encode

View source

encode(
    s
)

Encodes text into a list of integers.

load_from_file

View source

@classmethod
load_from_file(
    cls, filename_prefix
)

Load from file. Inverse of save_to_file.

save_to_file

View source

save_to_file(
    filename_prefix
)

Store to file. Inverse of load_from_file.