![]() |
TextEncoder backed by a list of tokens.
Inherits From: TextEncoder
tfds.deprecated.text.TokenTextEncoder(
vocab_list, oov_buckets=1, oov_token='UNK', lowercase=False,
tokenizer=None, strip_vocab=True, decode_token_separator=' '
)
Tokenization splits on (and drops) non-alphanumeric characters with regex "\W+".
Args | |
---|---|
vocab_list
|
list<str> , list of tokens.
|
oov_buckets
|
int , the number of int s to reserve for OOV hash buckets.
Tokens that are OOV will be hash-modded into a OOV bucket in encode .
|
oov_token
|
str , the string to use for OOV ids in decode .
|
lowercase
|
bool , whether to make all text and tokens lowercase.
|
tokenizer
|
Tokenizer , responsible for converting incoming text into a
list of tokens.
|
strip_vocab
|
bool , whether to strip whitespace from the beginning and
end of elements of vocab_list .
|
decode_token_separator
|
str , the string used to separate tokens when
decoding.
|
Attributes | |
---|---|
lowercase
|
|
oov_token
|
|
tokenizer
|
|
tokens
|
|
vocab_size
|
Size of the vocabulary. Decode produces ints [1, vocab_size). |
Methods
decode
decode(
ids
)
Decodes a list of integers into text.
encode
encode(
s
)
Encodes text into a list of integers.
load_from_file
@classmethod
load_from_file( filename_prefix )
Load from file. Inverse of save_to_file.
save_to_file
save_to_file(
filename_prefix
)
Store to file. Inverse of load_from_file.