tfds.features.text.TokenTextEncoder

View source on GitHub

Class TokenTextEncoder

TextEncoder backed by a list of tokens.

Inherits From: TextEncoder

Used in the tutorials:

Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".

__init__

View source

__init__(
    vocab_list,
    oov_buckets=1,
    oov_token='UNK',
    lowercase=False,
    tokenizer=None,
    strip_vocab=True,
    decode_token_separator=' '
)

Constructs a TokenTextEncoder.

To load from a file saved with TokenTextEncoder.save_to_file, use TokenTextEncoder.load_from_file.

Args:

  • vocab_list: list<str>, list of tokens.
  • oov_buckets: int, the number of ints to reserve for OOV hash buckets. Tokens that are OOV will be hash-modded into a OOV bucket in encode.
  • oov_token: str, the string to use for OOV ids in decode.
  • lowercase: bool, whether to make all text and tokens lowercase.
  • tokenizer: Tokenizer, responsible for converting incoming text into a list of tokens.
  • strip_vocab: bool, whether to strip whitespace from the beginning and end of elements of vocab_list.
  • decode_token_separator: str, the string used to separate tokens when decoding.

Properties

lowercase

oov_token

tokenizer

tokens

vocab_size

Size of the vocabulary. Decode produces ints [1, vocab_size).

Methods

decode

View source

decode(ids)

Decodes a list of integers into text.

encode

View source

encode(s)

Encodes text into a list of integers.

load_from_file

View source

@classmethod
load_from_file(
    cls,
    filename_prefix
)

Load from file. Inverse of save_to_file.

save_to_file

View source

save_to_file(filename_prefix)

Store to file. Inverse of load_from_file.