tfds.features.text.TokenTextEncoder

Class TokenTextEncoder

Inherits From: TextEncoder

Defined in core/features/text/text_encoder.py.

TextEncoder backed by a list of tokens.

Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".

__init__

__init__(
    vocab_list,
    oov_buckets=1,
    oov_token='UNK',
    lowercase=False,
    tokenizer=None
)

Constructs a TokenTextEncoder.

To load from a file saved with TokenTextEncoder.save_to_file, use TokenTextEncoder.load_from_file.

Args:

  • vocab_list: list<str>, list of tokens.
  • oov_buckets: int, the number of ints to reserve for OOV hash buckets. Tokens that are OOV will be hash-modded into a OOV bucket in encode.
  • oov_token: str, the string to use for OOV ids in decode.
  • lowercase: bool, whether to make all text and tokens lowercase.
  • tokenizer: Tokenizer, responsible for converting incoming text into a list of tokens.

Properties

lowercase

oov_token

tokenizer

tokens

vocab_size

Methods

decode

decode(ids)

encode

encode(s)

load_from_file

@classmethod
load_from_file(
    cls,
    filename_prefix
)

save_to_file

save_to_file(filename_prefix)