TF 2.0 is out! Get hands-on practice at TF World, Oct 28-31. Use code TF20 for 20% off select passes. Register now

tfds.features.text.SubwordTextEncoder

View source

Class SubwordTextEncoder

Invertible TextEncoder using word pieces with a byte-level fallback.

Inherits From: TextEncoder

Used in the tutorials:

Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded.

The vocabulary is "trained" on a corpus and all wordpieces are stored in a vocabulary file. To generate a vocabulary from a corpus, use tfds.features.text.SubwordTextEncoder.build_from_corpus.

Typical usage:

# Build
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    corpus_generator, target_vocab_size=2**15)
encoder.save_to_file(vocab_filename)

# Load
encoder = tfds.features.text.SubwordTextEncoder.load_from_file(vocab_filename)
ids = encoder.encode("hello world")
text = encoder.decode([1, 2, 3, 4])

__init__

View source

__init__(vocab_list=None)

Constructs a SubwordTextEncoder from a vocabulary list.

Args:

  • vocab_list: list<str>, list of subwords for the vocabulary. Note that an underscore at the end of a subword indicates the end of the word (i.e. a space will be inserted afterwards when decoding). Underscores in the interior of subwords are disallowed and should use the underscore escape sequence.

Properties

subwords

vocab_size

Methods

build_from_corpus

View source

@classmethod
build_from_corpus(
    cls,
    corpus_generator,
    target_vocab_size,
    max_subword_length=20,
    max_corpus_chars=None,
    reserved_tokens=None
)

Builds a SubwordTextEncoder based on the corpus_generator.

Args:

  • corpus_generator: generator yielding str, from which subwords will be constructed.
  • target_vocab_size: int, approximate size of the vocabulary to create.
  • max_subword_length: int, maximum length of a subword. Note that memory and compute scale quadratically in the length of the longest token.
  • max_corpus_chars: int, the maximum number of characters to consume from corpus_generator for the purposes of building the subword vocabulary.
  • reserved_tokens: list<str>, list of tokens that will always be treated as whole tokens and not split up. Note that these must contain a mix of alphanumeric and non-alphanumeric characters (e.g. "") and not end in an underscore.

Returns:

SubwordTextEncoder.

decode

View source

decode(ids)

Decodes a list of integers into text.

encode

View source

encode(s)

Encodes text into a list of integers.

load_from_file

View source

@classmethod
load_from_file(
    cls,
    filename_prefix
)

Extracts list of subwords from file.

save_to_file

View source

save_to_file(filename_prefix)

Save the vocabulary to a file.