![]() |
Skip-gram sampling with a text vocabulary file.
Aliases:
tfa.text.skip_gram_sample_with_text_vocab(
input_tensor,
vocab_freq_file,
vocab_token_index=0,
vocab_token_dtype=tf.dtypes.string,
vocab_freq_index=1,
vocab_freq_dtype=tf.dtypes.float64,
vocab_delimiter=',',
vocab_min_count=0,
vocab_subsampling=None,
corpus_size=None,
min_skips=1,
max_skips=5,
start=0,
limit=-1,
emit_self_as_target=False,
batch_size=None,
batch_capacity=None,
seed=None,
name=None
)
Wrapper around skip_gram_sample()
for use with a text vocabulary file.
The vocabulary file is expected to be a plain-text file, with lines of
vocab_delimiter
-separated columns. The vocab_token_index
column should
contain the vocabulary term, while the vocab_freq_index
column should
contain the number of times that term occurs in the corpus. For example,
with a text vocabulary file of:
bonjour,fr,42 hello,en,777 hola,es,99
You should set vocab_delimiter=","
, vocab_token_index=0
, and
vocab_freq_index=2
.
See skip_gram_sample()
documentation for more details about the skip-gram
sampling process.
Args:
input_tensor
: A rank-1Tensor
from which to generate skip-gram candidates.vocab_freq_file
:string
specifying full file path to the text vocab file.vocab_token_index
:int
specifying which column in the text vocab file contains the tokens.vocab_token_dtype
:DType
specifying the format of the tokens in the text vocab file.vocab_freq_index
:int
specifying which column in the text vocab file contains the frequency counts of the tokens.vocab_freq_dtype
:DType
specifying the format of the frequency counts in the text vocab file.vocab_delimiter
:string
specifying the delimiter used in the text vocab file.vocab_min_count
:int
,float
, or scalarTensor
specifying minimum frequency threshold (fromvocab_freq_file
) for a token to be kept ininput_tensor
. This should correspond withvocab_freq_dtype
.vocab_subsampling
: (Optional)float
specifying frequency proportion threshold for tokens frominput_tensor
. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.corpus_size
: (Optional)int
,float
, or scalarTensor
specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts ofvocab_freq_file
). Used withvocab_subsampling
for down-sampling frequently occurring tokens. If this is specified,vocab_freq_file
andvocab_subsampling
must also be specified. Ifcorpus_size
is needed but not supplied, then it will be calculated fromvocab_freq_file
. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-suppliedcorpus_size
value must be greater than or equal to the sum of all the frequency counts ofvocab_freq_file
.min_skips
:int
or scalarTensor
specifying the minimum window size to randomly use for each token. Must be >= 0 and <=max_skips
. Ifmin_skips
andmax_skips
are both 0, the only label outputted will be the token itself.max_skips
:int
or scalarTensor
specifying the maximum window size to randomly use for each token. Must be >= 0.start
:int
or scalarTensor
specifying the position ininput_tensor
from which to start generating skip-gram candidates.limit
:int
or scalarTensor
specifying the maximum number of elements ininput_tensor
to use in generating skip-gram candidates. -1 means to use the rest of theTensor
afterstart
.emit_self_as_target
:bool
or scalarTensor
specifying whether to emit each token as a label for itself.batch_size
: (Optional)int
specifying batch size of returnedTensors
.batch_capacity
: (Optional)int
specifying batch capacity for the queue used for batching returnedTensors
. Only has an effect ifbatch_size
> 0. Defaults to 100 *batch_size
if not specified.seed
: (Optional)int
used to create a random seed for window size and subsampling. Seeset_random_seed
for behavior.name
: (Optional) Astring
name or a name scope for the operations.
Returns:
A tuple
containing (token, label) Tensors
. Each output Tensor
is of
rank-1 and has the same type as input_tensor
. The Tensors
will be of
length batch_size
; if batch_size
is not specified, they will be of
random length, though they will be in sync with each other as long as
they are evaluated together.
Raises:
ValueError
: Ifvocab_token_index
orvocab_freq_index
is less than 0 or exceeds the number of columns invocab_freq_file
. Ifvocab_token_index
andvocab_freq_index
are both set to the same column. If any token invocab_freq_file
has a negative frequency.