tfdv.generate_statistics_from_tfrecord(
data_location,
output_path=None,
stats_options=options.StatsOptions(),
pipeline_options=None,
compression_type=CompressionTypes.AUTO
)
Compute data statistics from TFRecord files containing TFExamples.
Runs a Beam pipeline to compute the data statistics and return the result data statistics proto.
This is a convenience method for users with data in TFRecord format. Users with data in unsupported file/data formats, or users who wish to create their own Beam pipelines need to use the 'GenerateStatistics' PTransform API directly instead.
Args:
data_location
: The location of the input data files.output_path
: The file path to output data statistics result to. If None, we use a temporary directory. It will be a TFRecord file containing a single data statistics proto, and can be read with the 'load_statistics' API. If you run this function on Google Cloud, you must specify an output_path. Specifying None may cause an error.stats_options
:tfdv.StatsOptions
for generating data statistics.pipeline_options
: Optional beam pipeline options. This allows users to specify various beam pipeline execution parameters like pipeline runner (DirectRunner or DataflowRunner), cloud dataflow service project id, etc. See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for more details.compression_type
: Used to handle compressed input files. Default value is CompressionTypes.AUTO, in which case the file_path's extension will be used to detect the compression.
Returns:
A DatasetFeatureStatisticsList proto.