tfa.text.skip_gram_sample_with_text_vocab

Skip-gram sampling with a text vocabulary file.

Wrapper around skip_gram_sample() for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of vocab_delimiter-separated columns. The vocab_token_index column should contain the vocabulary term, while the vocab_freq_index column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of:

  bonjour,fr,42
  hello,en,777
  hola,es,99

You should set vocab_delimiter=",", vocab_token_index=0, and vocab_freq_index=2.

See skip_gram_sample() documentation for more details about the skip-gram sampling process.

input_tensor A rank-1 Tensor from which to generate skip-gram candidates.
vocab_freq_file string specifying full file path to the text vocab file.
vocab_token_index int specifying which column in the text vocab file contains the tokens.
vocab_token_dtype DType specifying the format of the tokens in the text vocab file.
vocab_freq_index int specifying which column in the text vocab file contains the frequency counts of the tokens.
vocab_freq_dtype DType specifying the format of the frequency counts in the text vocab file.
vocab_delimiter string specifying the delimiter used in the text vocab file.
vocab_min_count int, float, or scalar Tensor specifying minimum frequency threshold (from vocab_freq_file) for a token to be kept in input_tensor. This should correspond with vocab_freq_dtype.
vocab_subsampling (Optional) float specifying frequency proportion threshold for tokens from input_tensor. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.
corpus_size (Optional) int, float, or scalar Tensor specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of vocab_freq_file). Used with vocab_subsampling for down-sampling frequently occurring tokens. If this is specified, vocab_freq_file and vocab_subsampling must also be specified. If corpus_size is needed but not supplied, then it will be calculated from vocab_freq_file. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied corpus_size value must be greater than or equal to the sum of all the frequency counts of vocab_freq_file.
min_skips int or scalar Tensor specifying the minimum window size to randomly use for each token. Must be >= 0 and <= max_skips. If min_skips and max_skips are both 0, the only label outputted will be the token itself.
max_skips int or scalar Tensor specifying the maximum window size to randomly use for each token. Must be >= 0.
start int or scalar Tensor specifying the position in input_tensor from which to start generating skip-gram candidates.
limit int or scalar Tensor specifying the maximum number of elements in input_tensor to use in generating skip-gram candidates. -1 means to use the rest of the Tensor after start.
emit_self_as_target bool or scalar Tensor specifying whether to emit each token as a label for itself.
seed (Optional) int used to create a random seed for window size and subsampling. See set_random_seed for behavior.
name (Optional) A string name or a name scope for the operations.

A tuple containing (token, label) Tensors. Each output Tensor is of rank-1 and has the same type as input_tensor.

ValueError If vocab_token_index or vocab_freq_index is less than 0 or exceeds the number of columns in vocab_freq_file. If vocab_token_index and vocab_freq_index are both set to the same column. If any token in vocab_freq_file has a negative frequency.