TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).
TF Data Validation includes:
- Scalable calculation of summary statistics of training and test data.
- Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
- Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
- A schema viewer to help you inspect the schema.
- Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
- An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.
Installing from PyPI
The recommended way to install TFDV is using the PyPI package:
pip install tensorflow-data-validation
Installing from source
To compile and use TFDV, you need to set up some prerequisites.
If NumPy is not installed on your system, install it now by following these directions.
If Bazel is not installed on your system, install it now by following these directions.
2. Clone the TFDV repository
git clone https://github.com/tensorflow/data-validation cd data-validation
Note that these instructions will install the latest master branch of TensorFlow
Data Validation. If you want to install a specific branch (such as a release branch),
-b <branchname> to the
git clone command.
3. Build the pip package
TFDV uses Bazel to build the pip package from source:
bazel run -c opt tensorflow_data_validation:build_pip_package
You can find the generated
.whl file in the
4. Install the pip package
pip install dist/*.whl
TFDV is built and tested on the following 64-bit operating systems:
- macOS 10.12.6 (Sierra) or later.
- Ubuntu 14.04 or later.
Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.
The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.
|GitHub master||nightly (1.x)||2.8.0|