TFDV supports custom data validation using SQL. You can run custom data
validation using
validate_statistics
or
custom_validate_statistics.
Use validate_statistics
to run standard, schema-based data validation along
with custom validation. Use custom_validate_statistics
to run only custom
validation.
Configuring Custom Data Validation
Use the CustomValidationConfig to define custom validations to run. For each validation, provide an SQL expression, which returns a boolean value. Each SQL expression is run against the summary statistics for the specified feature. If the expression returns false, TFDV generates a custom anomaly using the provided severity and anomaly description.
You may configure custom validations that run against individual features or
feature pairs. For each feature, specify both the dataset (i.e., slice) and the
feature path to use, though you may leave the dataset name blank if you want to
validate the default slice (i.e., all examples). For single feature validations,
the feature statistics are bound to feature
. For feature pair validations, the
test feature statistics are bound to feature_test
and the base feature
statistics are bound to feature_base
. See the section below for example
queries.
If a custom validation triggers an anomaly, TFDV will return an Anomalies proto with the reason(s) for the anomaly. Each reason will have a short description, which is user configured, and a description with the query that caused the anomaly, the dataset names on which the query was run, and the base feature path (if running a feature-pair validation). See the section below for example results of custom validation.
See the
documentation
in the CustomValidationConfig
proto for example
configurations.