TensorFlow 2.0 Beta is available Learn more

tfds.features.text.TokenTextEncoder

Class TokenTextEncoder

TextEncoder backed by a list of tokens.

Inherits From: TextEncoder

View source

Used in the tutorials:

Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".

__init__

View source

__init__(
    vocab_list,
    oov_buckets=1,
    oov_token='UNK',
    lowercase=False,
    tokenizer=None,
    strip_vocab=True,
    decode_token_separator=' '
)

Constructs a TokenTextEncoder.

To load from a file saved with TokenTextEncoder.save_to_file, use TokenTextEncoder.load_from_file.

Args:

  • vocab_list: list<str>, list of tokens.
  • oov_buckets: int, the number of ints to reserve for OOV hash buckets. Tokens that are OOV will be hash-modded into a OOV bucket in encode.
  • oov_token: str, the string to use for OOV ids in decode.
  • lowercase: bool, whether to make all text and tokens lowercase.
  • tokenizer: Tokenizer, responsible for converting incoming text into a list of tokens.
  • strip_vocab: bool, whether to strip whitespace from the beginning and end of elements of vocab_list.
  • decode_token_separator: str, the string used to separate tokens when decoding.

Properties

lowercase

oov_token

tokenizer

tokens

vocab_size

Methods

decode

View source

decode(ids)

encode

View source

encode(s)

load_from_file

View source

@classmethod
load_from_file(
    cls,
    filename_prefix
)

save_to_file

View source

save_to_file(filename_prefix)