tfdv.validate_statistics

tfdv.validate_statistics(
    statistics,
    schema,
    environment=None,
    previous_statistics=None,
    serving_statistics=None
)

Validate the input statistics against the provided input schema.

This method validates the statistics against the schema. If an optional environment is specified, the schema is filtered using the environment and the statistics is validated against the filtered schema. The optional previous_statistics and serving_statistics are the statistics computed over the treatment data for drift- and skew-detection, respectively.

Args:

  • statistics: A DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the current data. Validation is currently only supported for lists with a single DatasetFeatureStatistics proto.
  • schema: A Schema protocol buffer.
  • environment: An optional string denoting the validation environment. Must be one of the default environments specified in the schema. By default, validation assumes that all Examples in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving. Environments can be used to express such requirements. For example, assume a feature named 'LABEL' is required for training, but is expected to be missing from serving. This can be expressed by defining two distinct environments in schema: ["SERVING", "TRAINING"] and associating 'LABEL' only with environment "TRAINING".
  • previous_statistics: An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over an earlier data (for example, previous day's data). If provided, the validate_statistics method will detect if there exists drift between current data and previous data. Configuration for drift detection can be done by specifying a drift_comparator in the schema. For now drift detection is only supported for categorical features.
  • serving_statistics: An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the serving data. If provided, the validate_statistics method will identify if there exists distribution skew between current data and serving data. Configuration for skew detection can be done by specifying a skew_comparator in the schema. For now skew detection is only supported for categorical features.

Returns:

An Anomalies protocol buffer.

Raises:

  • TypeError: If any of the input arguments is not of the expected type.
  • ValueError: If the input statistics proto does not have only one dataset.