|View source on GitHub|
Validates the input statistics against the provided input schema.
tfdv.validate_statistics( statistics, schema, environment=None, previous_statistics=None, serving_statistics=None )
This method validates the
statistics against the
schema. If an optional
environment is specified, the
schema is filtered using the
environment and the
statistics is validated against the filtered schema.
serving_statistics are the statistics
computed over the control data for drift- and skew-detection, respectively.
statistics: A DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the current data. Validation is currently supported only for lists with a single DatasetFeatureStatistics proto or lists with multiple DatasetFeatureStatistics protos corresponding to data slices that include the default slice (i.e., the slice with all examples). If a list with multiple DatasetFeatureStatistics protos is used, this function will validate the statistics corresponding to the default slice.
schema: A Schema protocol buffer. Note that TFDV does not currently support validation of the following messages/fields in the Schema protocol buffer:
- Schema-level FloatDomain and IntDomain (validation is supported for Feature-level FloatDomain and IntDomain)
environment: An optional string denoting the validation environment. Must be one of the default environments specified in the schema. By default, validation assumes that all Examples in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving. Environments can be used to express such requirements. For example, assume a feature named 'LABEL' is required for training, but is expected to be missing from serving. This can be expressed by defining two distinct environments in schema: ["SERVING", "TRAINING"] and associating 'LABEL' only with environment "TRAINING".
previous_statistics: An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over an earlier data (for example, previous day's data). If provided, the
validate_statisticsmethod will detect if there exists drift between current data and previous data. Configuration for drift detection can be done by specifying a
drift_comparatorin the schema. For now drift detection is only supported for categorical features.
serving_statistics: An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the serving data. If provided, the
validate_statisticsmethod will identify if there exists distribution skew between current data and serving data. Configuration for skew detection can be done by specifying a
skew_comparatorin the schema. For now skew detection is only supported for categorical features.
An Anomalies protocol buffer.
TypeError: If any of the input arguments is not of the expected type.
ValueError: If the input statistics proto contains multiple datasets, none of which corresponds to the default slice.