tfds.features.text.Tokenizer

View source on GitHub

Class Tokenizer

Splits a string into tokens, and joins them back.

Used in the tutorials:

__init__

View source

__init__(
    alphanum_only=True,
    reserved_tokens=None
)

Constructs a Tokenizer.

Note that the Tokenizer is invertible if alphanum_only=False. i.e. s == t.join(t.tokenize(s)).

Args:

  • alphanum_only: bool, if True, only parse out alphanumeric tokens (non-alphanumeric characters are dropped); otherwise, keep all characters (individual tokens will still be either all alphanumeric or all non-alphanumeric).
  • reserved_tokens: list<str>, a list of strings that, if any are in s, will be preserved as whole tokens, even if they contain mixed alphanumeric/non-alphanumeric characters.

Properties

alphanum_only

reserved_tokens

Methods

join

View source

join(tokens)

Joins tokens into a string.

load_from_file

View source

@classmethod
load_from_file(
    cls,
    filename_prefix
)

save_to_file

View source

save_to_file(filename_prefix)

tokenize

View source

tokenize(s)

Splits a string into tokens.