tfdv.validate_statistics

Validates the input statistics against the provided input schema.

Used in the notebooks

Used in the tutorials

This method validates the statistics against the schema. If an optional environment is specified, the schema is filtered using the environment and the statistics is validated against the filtered schema. The optional previous_statistics and serving_statistics are the statistics computed over the control data for drift- and skew-detection, respectively.

If drift- or skew-detection is conducted, then the raw skew/drift measurements for each feature that is compared will be recorded in the drift_skew_info field in the returned Anomalies proto.

statistics A DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the current data. Validation is currently supported only for lists with a single DatasetFeatureStatistics proto or lists with multiple DatasetFeatureStatistics protos corresponding to data slices that include the default slice (i.e., the slice with all examples). If a list with multiple DatasetFeatureStatistics protos is used, this function will validate the statistics corresponding to the default slice.
schema A Schema protocol buffer. Note that TFDV does not currently support validation of the following messages/fields in the Schema protocol buffer:

  • FeaturePresenceWithinGroup
  • Schema-level FloatDomain and IntDomain (validation is supported for Feature-level FloatDomain and IntDomain)
environment An optional string denoting the validation environment. Must be one of the default environments specified in the schema. By default, validation assumes that all Examples in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving. Environments can be used to express such requirements. For example, assume a feature named 'LABEL' is required for training, but is expected to be missing from serving. This can be expressed by defining two distinct environments in schema: ["SERVING", "TRAINING"] and associating 'LABEL' only with environment "TRAINING".
previous_statistics An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over an earlier data (for example, previous day's data). If provided, the validate_statistics method will detect if there exists drift between current data and previous data. Configuration for drift detection can be done by specifying a drift_comparator in the schema.
serving_statistics An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the serving data. If provided, the validate_statistics method will identify if there exists distribution skew between current data and serving data. Configuration for skew detection can be done by specifying a skew_comparator in the schema.
custom_validation_config An optional config that can be used to specify custom validations to perform. If doing single-feature validations, the test feature will come from statistics and will be mapped to feature in the SQL query. If doing feature pair validations, the test feature will come from statistics and will be mapped to feature_test in the SQL query, and the base feature will come from previous_statistics and will be mapped to feature_base in the SQL query. Custom validations are not supported on Windows.

An Anomalies protocol buffer.

TypeError If any of the input arguments is not of the expected type.
ValueError If the input statistics proto contains multiple datasets, none of which corresponds to the default slice.