![]() |
Estimates custom statistics in a non-streaming fashion.
Inherits From: TransformStatsGenerator
tfdv.NonStreamingCustomStatsGenerator(
stats_fn: PartitionedStatsFn,
num_partitions: int,
min_partitions_stat_presence: int,
seed: int,
max_examples_per_partition: int,
batch_size: int = 1000,
name: Text = 'NonStreamingCustomStatsGenerator'
) -> None
A TransformStatsGenerator which partitions the input data and calls the user specified stats_fn over each partition. Meta-statistics are calculated over the statistics returned by stats_fn to estimate the true value of the statistic. For invalid feature values, the worker computing PartitionedStatsFn over a partition may "gracefully fail" and not report that statistic (refer to PartitionedStatsFn for more information). Meta-statistics for a feature are only calculated if the number of partitions where the statistic is computed exceeds a configurable threshold.
A large number of examples in a partition may result in worker OOM errors. This can be prevented by setting max_examples_per_partition.
Args | |
---|---|
stats_fn
|
The PartitionedStatsFn that will be run on each sample. |
num_partitions
|
The number of partitions the stat will be calculated on. |
min_partitions_stat_presence
|
The minimum number of partitions a stat computation must succeed in for the result to be returned. |
seed
|
An int used to seed the numpy random number generator. |
max_examples_per_partition
|
An integer used to specify the maximum number of examples per partition to limit memory usage in a worker. If the number of examples per partition exceeds this value, the examples are randomly selected. |
batch_size
|
Number of examples per input batch. |
name
|
An optional unique name associated with the statistics generator. |
Attributes | |
---|---|
name
|
|
ptransform
|
|
schema
|