tfdv.CombinerStatsGenerator

A StatsGenerator which computes statistics using a combiner function.

This class computes statistics using a combiner function. It emits partial states processing a batch of examples at a time, merges the partial states, and finally computes the statistics from the merged partial state at the end.

This object mirrors a beam.CombineFn except for the add_input interface, which is expected to be defined by its sub-classes. Specifically, the generator must implement the following four methods:

Initializes an accumulator to store the partial state and returns it. create_accumulator()

Incorporates a batch of input examples (represented as an arrow RecordBatch) into the current accumulator and returns the updated accumulator. add_input(accumulator, input_record_batch)

Merge the partial states in the accumulators and returns the accumulator containing the merged state. merge_accumulators(accumulators)

Compute statistics from the partial state in the accumulator and return the result as a DatasetFeatureStatistics proto. extract_output(accumulator)

name A unique name associated with the statistics generator.
schema An optional schema for the dataset.

name

schema

Methods

add_input

View source

Returns result of folding a batch of inputs into accumulator.

Args
accumulator The current accumulator, which may be modified and returned for efficiency.
input_record_batch An Arrow RecordBatch whose columns are features and rows are examples. The columns are of type List or Null (If a feature's value is None across all the examples in the batch, its corresponding column is of Null type).

Returns
The accumulator after updating the statistics for the batch of inputs.

compact

View source

Returns a compact representation of the accumulator.

This is optionally called before an accumulator is sent across the wire. The base class is a no-op. This may be overwritten by the derived class.

Args
accumulator The accumulator to compact.

Returns
The compacted accumulator. By default is an identity.

create_accumulator

View source

Returns a fresh, empty accumulator.

Returns
An empty accumulator.

extract_output

View source

Returns result of converting accumulator into the output value.

Args
accumulator The final accumulator value.

Returns
A proto representing the result of this stats generator.

merge_accumulators

View source

Merges several accumulators to a single accumulator value.

Args
accumulators The accumulators to merge.

Returns
The merged accumulator.

setup

View source

Prepares an instance for combining.

Subclasses should put costly initializations here instead of in init(), so that 1) the cost is properly recognized by Beam as setup cost (per worker) and 2) the cost is not paid at the pipeline construction time.