tfa.text.skip_gram_sample_with_text_vocab

View source on GitHub

Skip-gram sampling with a text vocabulary file.

Aliases:

tfa.text.skip_gram_sample_with_text_vocab(
    input_tensor,
    vocab_freq_file,
    vocab_token_index=0,
    vocab_token_dtype=tf.dtypes.string,
    vocab_freq_index=1,
    vocab_freq_dtype=tf.dtypes.float64,
    vocab_delimiter=',',
    vocab_min_count=0,
    vocab_subsampling=None,
    corpus_size=None,
    min_skips=1,
    max_skips=5,
    start=0,
    limit=-1,
    emit_self_as_target=False,
    batch_size=None,
    batch_capacity=None,
    seed=None,
    name=None
)

Wrapper around skip_gram_sample() for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of vocab_delimiter-separated columns. The vocab_token_index column should contain the vocabulary term, while the vocab_freq_index column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of:

bonjour,fr,42
hello,en,777
hola,es,99

You should set vocab_delimiter=",", vocab_token_index=0, and vocab_freq_index=2.

See skip_gram_sample() documentation for more details about the skip-gram sampling process.

Args:

  • input_tensor: A rank-1 Tensor from which to generate skip-gram candidates.
  • vocab_freq_file: string specifying full file path to the text vocab file.
  • vocab_token_index: int specifying which column in the text vocab file contains the tokens.
  • vocab_token_dtype: DType specifying the format of the tokens in the text vocab file.
  • vocab_freq_index: int specifying which column in the text vocab file contains the frequency counts of the tokens.
  • vocab_freq_dtype: DType specifying the format of the frequency counts in the text vocab file.
  • vocab_delimiter: string specifying the delimiter used in the text vocab file.
  • vocab_min_count: int, float, or scalar Tensor specifying minimum frequency threshold (from vocab_freq_file) for a token to be kept in input_tensor. This should correspond with vocab_freq_dtype.
  • vocab_subsampling: (Optional) float specifying frequency proportion threshold for tokens from input_tensor. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.
  • corpus_size: (Optional) int, float, or scalar Tensor specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of vocab_freq_file). Used with vocab_subsampling for down-sampling frequently occurring tokens. If this is specified, vocab_freq_file and vocab_subsampling must also be specified. If corpus_size is needed but not supplied, then it will be calculated from vocab_freq_file. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied corpus_size value must be greater than or equal to the sum of all the frequency counts of vocab_freq_file.
  • min_skips: int or scalar Tensor specifying the minimum window size to randomly use for each token. Must be >= 0 and <= max_skips. If min_skips and max_skips are both 0, the only label outputted will be the token itself.
  • max_skips: int or scalar Tensor specifying the maximum window size to randomly use for each token. Must be >= 0.
  • start: int or scalar Tensor specifying the position in input_tensor from which to start generating skip-gram candidates.
  • limit: int or scalar Tensor specifying the maximum number of elements in input_tensor to use in generating skip-gram candidates. -1 means to use the rest of the Tensor after start.
  • emit_self_as_target: bool or scalar Tensor specifying whether to emit each token as a label for itself.
  • batch_size: (Optional) int specifying batch size of returned Tensors.
  • batch_capacity: (Optional) int specifying batch capacity for the queue used for batching returned Tensors. Only has an effect if batch_size > 0. Defaults to 100 * batch_size if not specified.
  • seed: (Optional) int used to create a random seed for window size and subsampling. See set_random_seed for behavior.
  • name: (Optional) A string name or a name scope for the operations.

Returns:

A tuple containing (token, label) Tensors. Each output Tensor is of rank-1 and has the same type as input_tensor. The Tensors will be of length batch_size; if batch_size is not specified, they will be of random length, though they will be in sync with each other as long as they are evaluated together.

Raises:

  • ValueError: If vocab_token_index or vocab_freq_index is less than 0 or exceeds the number of columns in vocab_freq_file. If vocab_token_index and vocab_freq_index are both set to the same column. If any token in vocab_freq_file has a negative frequency.