Missed TensorFlow World? Check out the recap. Learn more

The SchemaGen TFX Pipeline Component

Some TFX components use a description of your input data called a schema. The schema is an instance of schema.proto. It can specify data types for feature values, whether a feature has to be present in all examples, allowed value ranges, and other properties. A SchemaGen pipeline component will automatically generate a schema by inferring types, categories, and ranges from the training data.

  • Consumes: statistics from an StatisticsGen component
  • Emits: Data schema proto

Here's an excerpt from a schema proto:

...
feature {
  name: "age"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_fraction: 1
    min_count: 1
  }
}
feature {
  name: "capital-gain"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_fraction: 1
    min_count: 1
  }
}
...

The following TFX libraries use the schema:

  • TensorFlow Data Validation
  • TensorFlow Transform
  • TensorFlow Model Analysis

In a typical TFX pipeline SchemaGen generates a schema, which is consumed by the other pipeline components.

SchemaGen and TensorFlow Data Validation

SchemaGen makes extensive use of TensorFlow Data Validation for inferring a schema.

Using the SchemaGen Component

A SchemaGen pipeline component is typically very easy to deploy and requires little customization. Typical code looks like this:

from tfx import components

...

infer_schema = components.SchemaGen(
    statistics=compute_training_stats.outputs['statistics'])