tfdv.NonStreamingCustomStatsGenerator

View source on GitHub

Estimates custom statistics in a non-streaming fashion.

Inherits From: TransformStatsGenerator

A TransformStatsGenerator which partitions the input data and calls the user specified stats_fn over each partition. Meta-statistics are calculated over the statistics returned by stats_fn to estimate the true value of the statistic. For invalid feature values, the worker computing PartitionedStatsFn over a partition may "gracefully fail" and not report that statistic (refer to PartitionedStatsFn for more information). Meta-statistics for a feature are only calculated if the number of partitions where the statistic is computed exceeds a configurable threshold.

A large number of examples in a partition may result in worker OOM errors. This can be prevented by setting max_examples_per_partition.

stats_fn The PartitionedStatsFn that will be run on each sample.
num_partitions The number of partitions the stat will be calculated on.
min_partitions_stat_presence The minimum number of partitions a stat computation must succeed in for the result to be returned.
seed An int used to seed the numpy random number generator.
max_examples_per_partition An integer used to specify the maximum number of examples per partition to limit memory usage in a worker. If the number of examples per partition exceeds this value, the examples are randomly selected.
batch_size Number of examples per input batch.
name An optional unique name associated with the statistics generator.

name

ptransform

schema