|View source on GitHub|
Generates skip-gram token and label paired Tensors from the input
tfa.text.skip_gram_sample( input_tensor, min_skips=1, max_skips=5, start=0, limit=-1, emit_self_as_target=False, vocab_freq_table=None, vocab_min_count=None, vocab_subsampling=None, corpus_size=None, batch_size=None, batch_capacity=None, seed=None, name=None )
("token", "label") pairs using each element in the
input_tensor as a token. The window size used for each token will
be randomly selected from the range specified by
inclusive. See https://arxiv.org/abs/1301.3781 for more details about
For example, given
input_tensor = ["the", "quick", "brown", "fox",
min_skips = 1,
max_skips = 2,
emit_self_as_target = False,
(tokens, labels) pairs for the token "quick" will be randomly
selected from either
(tokens=["quick", "quick"], labels=["the", "brown"])
for 1 skip, or
(tokens=["quick", "quick", "quick"],
labels=["the", "brown", "fox"]) for 2 skips.
emit_self_as_target = True, each token will also be emitted as a label
for itself. From the previous example, the output will be either
(tokens=["quick", "quick", "quick"], labels=["the", "quick", "brown"])
for 1 skip, or
(tokens=["quick", "quick", "quick", "quick"],
labels=["the", "quick", "brown", "fox"]) for 2 skips.
The same process is repeated for each element of
concatenated together into the two output rank-1
Tensors (one for all the
tokens, another for all the labels).
vocab_freq_table is specified, tokens in
input_tensor that are not
present in the vocabulary are discarded. Tokens whose frequency counts are
vocab_min_count are also discarded. Tokens whose frequency
proportions in the corpus exceed
vocab_subsampling may be randomly
down-sampled. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details
Due to the random window sizes used for each token, the lengths of the
outputs are non-deterministic, unless
batch_size is specified to batch
the outputs to always return
Tensors of length
input_tensor: A rank-1
Tensorfrom which to generate skip-gram candidates.
Tensorspecifying the minimum window size to randomly use for each token. Must be >= 0 and <=
max_skipsare both 0, the only label outputted will be the token itself when
emit_self_as_target = True- or no output otherwise.
Tensorspecifying the maximum window size to randomly use for each token. Must be >= 0.
Tensorspecifying the position in
input_tensorfrom which to start generating skip-gram candidates.
Tensorspecifying the maximum number of elements in
input_tensorto use in generating skip-gram candidates. -1 means to use the rest of the
Tensorspecifying whether to emit each token as a label for itself.
vocab_freq_table: (Optional) A lookup table (subclass of
lookup.InitializableLookupTableBase) that maps tokens to their raw frequency counts. If specified, any token in
input_tensorthat is not found in
vocab_freq_tablewill be filtered out before generating skip-gram candidates. While this will typically map to integer raw frequency counts, it could also map to float frequency proportions.
corpus_sizeshould be in the same units as this.
float, or scalar
Tensorspecifying minimum frequency threshold (from
vocab_freq_table) for a token to be kept in
input_tensor. If this is specified,
vocab_freq_tablemust also be specified - and they should both be in the same units.
floatspecifying frequency proportion threshold for tokens from
input_tensor. Tokens that occur more frequently (based on the ratio of the token's
vocab_freq_tablevalue to the
corpus_size) will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. If this is specified, both
corpus_sizemust also be specified. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.
float, or scalar
Tensorspecifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of
vocab_freq_table). Used with
vocab_subsamplingfor down-sampling frequently occurring tokens. If this is specified,
vocab_subsamplingmust also be specified.
intspecifying batch size of returned
intspecifying batch capacity for the queue used for batching returned
Tensors. Only has an effect if
batch_size> 0. Defaults to 100 *
batch_sizeif not specified.
intused to create a random seed for window size and subsampling. See
set_random_seeddocs for behavior.
name: (Optional) A
stringname or a name scope for the operations.
tuple containing (token, label)
Tensors. Each output
Tensor is of
rank-1 and has the same type as
Tensors will be of
batch_size is not specified, they will be of
random length, though they will be in sync with each other as long as
they are evaluated together.
vocab_freq_tableis not provided, but
corpus_sizeis specified. If
corpus_sizeare not both present or both absent.