tfdf.keras.FeatureSemantic

Semantic (e.g.

Inherits From: Enum

numerical, categorical) of an input feature.

Determines how a feature is interpreted by the model. Similar to the "column type" of Yggdrasil Decision Forest.

NUMERICAL Numerical value. Generally for quantities or counts with full ordering. For example, the age of a person, or the number of items in a bag. Can be a float or an integer. Missing values are represented by math.nan or with an empty sparse tensor. If a numerical tensor contains multiple values, its size should be constant, and each dimension is threaded independently (and each dimension should always have the same "meaning").
CATEGORICAL A categorical value. Generally for a type/class in finite set of possible values without ordering. For example, the color RED in the set {RED, BLUE, GREEN}. Can be a string or an integer. Missing values are represented by "" (empty sting), value -2 or with an empty sparse tensor. An out-of-vocabulary value (i.e. a value that was never seen in training) is represented by any new string value or the value -1. If a numerical tensor contains multiple values, its size should be constant, and each value is treated independently (each value on the tensor should always have the same meaning). Integer categorical values: (1) The training logic and model representation is optimized with the assumption that values are dense. (2) Internally, the value is stored as int32. The values should be <~2B. (3) The number of possible value is computed automatically from the training dataset. During inference, integer values greater than any value seen during training will be treated as out-of-vocabulary. (4) Minimum frequency and maximum vocabulary size constrains don't apply.
HASH The hash of a string value. Used when only the equality between values is important (not the value itself). Currently, only used for groups in ranking problems e.g. the query in a query/document problem. The hashing is computed with google's farmhash and stored as an uint64.
CATEGORICAL_SET Set of categorical values. Great to represent tokenized texts. Can be a string or an integer in a sparse tensor or a ragged tensor (recommended). Unlike CATEGORICAL, the number of items in a CATEGORICAL_SET can change and the order/index of each item doesn't matter.
BOOLEAN Boolean value. WARNING: Boolean values are not yet supported for training. Can be a float or an integer. Missing values are represented by math.nan or with an empty sparse tensor. If a numerical tensor contains multiple values, its size should be constant, and each dimension is threaded independently (and each dimension should always have the same "meaning").
DISCRETIZED_NUMERICAL Numerical values automatically discretized into bins. Discretized numerical features are faster to train than (non-discretized) numerical features. If the number of unique values of these features is lower than the number of bins, the discretization is lossless from the point of view of the model. If the number of unique values of this features is greater than the number of bins, the discretization is lossy from the point of view of the model. Lossy discretization can reduce and sometime increase (due to regularization) the quality of the model.

BOOLEAN <Semantic.BOOLEAN: 5>
CATEGORICAL <Semantic.CATEGORICAL: 2>
CATEGORICAL_SET <Semantic.CATEGORICAL_SET: 4>
DISCRETIZED_NUMERICAL <Semantic.DISCRETIZED_NUMERICAL: 6>
HASH <Semantic.HASH: 3>
NUMERICAL <Semantic.NUMERICAL: 1>