Compute data statistics from TFRecord files containing TFExamples.
tfdv.generate_statistics_from_tfrecord(
data_location: Text,
output_path: Optional[bytes] = None,
stats_options: tfdv.StatsOptions
= options.StatsOptions(),
pipeline_options: Optional[PipelineOptions] = None,
compression_type: Text = CompressionTypes.AUTO
) -> statistics_pb2.DatasetFeatureStatisticsList
Used in the notebooks
Runs a Beam pipeline to compute the data statistics and return the result
data statistics proto.
This is a convenience method for users with data in TFRecord format.
Users with data in unsupported file/data formats, or users who wish
to create their own Beam pipelines need to use the 'GenerateStatistics'
PTransform API directly instead.
Args |
data_location
|
The location of the input data files.
|
output_path
|
The file path to output data statistics result to. If None, we
use a temporary directory. It will be a TFRecord file containing a single
data statistics proto, and can be read with the 'load_statistics' API.
If you run this function on Google Cloud, you must specify an
output_path. Specifying None may cause an error.
|
stats_options
|
tfdv.StatsOptions for generating data statistics.
|
pipeline_options
|
Optional beam pipeline options. This allows users to
specify various beam pipeline execution parameters like pipeline runner
(DirectRunner or DataflowRunner), cloud dataflow service project id, etc.
See https://cloud.google.com/dataflow/pipelines/specifying-exec-params for
more details.
|
compression_type
|
Used to handle compressed input files. Default value is
CompressionTypes.AUTO, in which case the file_path's extension will be
used to detect the compression.
|
Returns |
A DatasetFeatureStatisticsList proto.
|