Warning: This project is deprecated. TensorFlow Addons has stopped development, The project will only be providing minimal maintenance releases until May 2024. See the full announcement here or on github.

tfa.text.skip_gram_sample_with_text_vocab

View source on GitHub

Skip-gram sampling with a text vocabulary file.

tfa.text.skip_gram_sample_with_text_vocab(
    input_tensor: tfa.types.TensorLike,
    vocab_freq_file: str,
    vocab_token_index: tfa.types.FloatTensorLike = 0,
    vocab_token_dtype: Optional[AcceptableDTypes] = tf.dtypes.string,
    vocab_freq_index: tfa.types.FloatTensorLike = 1,
    vocab_freq_dtype: Optional[AcceptableDTypes] = tf.dtypes.float64,
    vocab_delimiter: str = ',',
    vocab_min_count: Optional[FloatTensorLike] = None,
    vocab_subsampling: Optional[FloatTensorLike] = None,
    corpus_size: Optional[FloatTensorLike] = None,
    min_skips: tfa.types.FloatTensorLike = 1,
    max_skips: tfa.types.FloatTensorLike = 5,
    start: tfa.types.FloatTensorLike = 0,
    limit: tfa.types.FloatTensorLike = -1,
    emit_self_as_target: bool = False,
    seed: Optional[FloatTensorLike] = None,
    name: Optional[str] = None
) -> tf.Tensor

Wrapper around skip_gram_sample() for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of vocab_delimiter-separated columns. The vocab_token_index column should contain the vocabulary term, while the vocab_freq_index column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of:

  bonjour,fr,42
  hello,en,777
  hola,es,99

You should set vocab_delimiter=",", vocab_token_index=0, and vocab_freq_index=2.

See skip_gram_sample() documentation for more details about the skip-gram sampling process.

Args
`input_tensor`	A rank-1 `Tensor` from which to generate skip-gram candidates.
`vocab_freq_file`	`string` specifying full file path to the text vocab file.
`vocab_token_index`	`int` specifying which column in the text vocab file contains the tokens.
`vocab_token_dtype`	`DType` specifying the format of the tokens in the text vocab file.
`vocab_freq_index`	`int` specifying which column in the text vocab file contains the frequency counts of the tokens.
`vocab_freq_dtype`	`DType` specifying the format of the frequency counts in the text vocab file.
`vocab_delimiter`	`string` specifying the delimiter used in the text vocab file.
`vocab_min_count`	`int`, `float`, or scalar `Tensor` specifying minimum frequency threshold (from `vocab_freq_file`) for a token to be kept in `input_tensor`. This should correspond with `vocab_freq_dtype`.
`vocab_subsampling`	(Optional) `float` specifying frequency proportion threshold for tokens from `input_tensor`. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.
`corpus_size`	(Optional) `int`, `float`, or scalar `Tensor` specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of `vocab_freq_file`). Used with `vocab_subsampling` for down-sampling frequently occurring tokens. If this is specified, `vocab_freq_file` and `vocab_subsampling` must also be specified. If `corpus_size` is needed but not supplied, then it will be calculated from `vocab_freq_file`. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied `corpus_size` value must be greater than or equal to the sum of all the frequency counts of `vocab_freq_file`.
`min_skips`	`int` or scalar `Tensor` specifying the minimum window size to randomly use for each token. Must be >= 0 and <= `max_skips`. If `min_skips` and `max_skips` are both 0, the only label outputted will be the token itself.
`max_skips`	`int` or scalar `Tensor` specifying the maximum window size to randomly use for each token. Must be >= 0.
`start`	`int` or scalar `Tensor` specifying the position in `input_tensor` from which to start generating skip-gram candidates.
`limit`	`int` or scalar `Tensor` specifying the maximum number of elements in `input_tensor` to use in generating skip-gram candidates. -1 means to use the rest of the `Tensor` after `start`.
`emit_self_as_target`	`bool` or scalar `Tensor` specifying whether to emit each token as a label for itself.
`seed`	(Optional) `int` used to create a random seed for window size and subsampling. See `set_random_seed` for behavior.
`name`	(Optional) A `string` name or a name scope for the operations.

Returns
A `tuple` containing (token, label) `Tensors`. Each output `Tensor` is of rank-1 and has the same type as `input_tensor`.

Raises
`ValueError`	If `vocab_token_index` or `vocab_freq_index` is less than 0 or exceeds the number of columns in `vocab_freq_file`. If `vocab_token_index` and `vocab_freq_index` are both set to the same column. If any token in `vocab_freq_file` has a negative frequency.

tfa.text.skip_gram_sample_with_text_vocab

Args

Returns

Raises