tfds.features.text.Tokenizer

Class Tokenizer

Defined in core/features/text/text_encoder.py.

Splits a string into tokens, and joins them back.

__init__

__init__(
    alphanum_only=True,
    reserved_tokens=None
)

Constructs a Tokenizer.

Note that the Tokenizer is invertible if alphanum_only=False. i.e. s == t.join(t.tokenize(s)).

Args:

  • alphanum_only: bool, if True, only parse out alphanumeric tokens (non-alphanumeric characters are dropped); otherwise, keep all characters (individual tokens will still be either all alphanumeric or all non-alphanumeric).
  • reserved_tokens: list<str>, a list of strings that, if any are in s, will be preserved as whole tokens, even if they contain mixed alphnumeric/non-alphanumeric characters.

Properties

alphanum_only

reserved_tokens

Methods

join

join(tokens)

Joins tokens into a string.

load_from_file

@classmethod
load_from_file(
    cls,
    filename_prefix
)

save_to_file

save_to_file(filename_prefix)

tokenize

tokenize(s)

Splits a string into tokens.