Some TFX components use a description of your input data called a schema. The schema is an instance of schema.proto. It can specify data types for feature values, whether a feature has to be present in all examples, allowed value ranges, and other properties. A SchemaGen pipeline component will automatically generate a schema by inferring types, categories, and ranges from the training data.
- Consumes: statistics from a StatisticsGen component
- Emits: Data schema proto
Here's an excerpt from a schema proto:
...
feature {
name: "age"
value_count {
min: 1
max: 1
}
type: FLOAT
presence {
min_fraction: 1
min_count: 1
}
}
feature {
name: "capital-gain"
value_count {
min: 1
max: 1
}
type: FLOAT
presence {
min_fraction: 1
min_count: 1
}
}
...
The following TFX libraries use the schema:
- TensorFlow Data Validation
- TensorFlow Transform
- TensorFlow Model Analysis
In a typical TFX pipeline SchemaGen generates a schema, which is consumed by the other pipeline components.
SchemaGen and TensorFlow Data Validation
SchemaGen makes extensive use of TensorFlow Data Validation for inferring a schema.
Using the SchemaGen Component
A SchemaGen pipeline component is typically very easy to deploy and requires little customization. Typical code looks like this:
from tfx import components
...
infer_schema = components.SchemaGen(
statistics=compute_training_stats.outputs['statistics'])