tfds.features.Text

FeatureConnector for text, encoding to integers with a TextEncoder.

Inherits From: Tensor

encoder tfds.deprecated.text.TextEncoder, an encoder that can convert text to integers. If None, the text will be utf-8 byte-encoded.
encoder_config tfds.deprecated.text.TextEncoderConfig, needed if restoring from a file with load_metadata.

dtype Return the dtype (or dict of dtype) of this FeatureConnector.
encoder

shape Return the shape (or dict of shape) of this FeatureConnector.
vocab_size

Methods

decode_batch_example

View source

See base class for details.

decode_example

View source

Decode the feature dict to TF compatible input.

Args
tfexample_data Data or dictionary of data, as read by the tf-example reader. It correspond to the tf.Tensor() (or dict of tf.Tensor()) extracted from the tf.train.Example, matching the info defined in get_serialized_info().

Returns
tensor_data Tensor or dictionary of tensor, output of the tf.data.Dataset object

decode_ragged_example

View source

See base class for details.

encode_example

View source

See base class for details.

from_config

View source

Reconstructs the FeatureConnector from the config file.

Usage:

features = FeatureConnector.from_config('path/to/features.json')

Args
root_dir Directory containing to the features.json file.

Returns
The reconstructed feature instance.

from_json

View source

FeatureConnector factory.

This function should be called from the tfds.features.FeatureConnector base class. Subclass should implement the from_json_content.

Example:

feature = tfds.features.FeatureConnector.from_json(
    {'type': 'Image', 'content': {'shape': [32, 32, 3], 'dtype': 'uint8'} }
)
assert isinstance(feature, tfds.features.Image)

Args
value dict(type=, content=) containing the feature to restore. Match dict returned by to_json.

Returns
The reconstructed FeatureConnector.

from_json_content

View source

FeatureConnector factory (to overwrite).

Subclasses should overwritte this method. importing the feature connector from the config.

This function should not be called directly. FeatureConnector.from_json should be called instead.

This function See existing FeatureConnector for example of implementation.

Args
value FeatureConnector information. Match the dict returned by to_json_content.

Returns
The reconstructed FeatureConnector.

get_serialized_info

View source

Return the shape/dtype of features after encoding (for the adapter).

The FileAdapter then use those information to write data on disk.

This function indicates how this feature is encoded on file internally. The DatasetBuilder are written on disk as tf.train.Example proto.

Ex:

return {
    'image': tfds.features.TensorInfo(shape=(None,), dtype=tf.uint8),
    'height': tfds.features.TensorInfo(shape=(), dtype=tf.int32),
    'width': tfds.features.TensorInfo(shape=(), dtype=tf.int32),
}

FeatureConnector which are not containers should return the feature proto directly:

return tfds.features.TensorInfo(shape=(64, 64), tf.uint8)

If not defined, the retuned values are automatically deduced from the get_tensor_info function.

Returns
features Either a dict of feature proto object, or a feature proto object

get_tensor_info

View source

See base class for details.

ints2str

View source

Conversion list[int] => decoded string.

load_metadata

View source

Restore the feature metadata from disk.

If a dataset is re-loaded and generated files exists on disk, this function will restore the feature metadata from the saved file.

Args
data_dir str, path to the dataset folder to which save the info (ex: ~/datasets/cifar10/1.2.0/)
feature_name str, the name of the feature (from the FeaturesDict key)

maybe_build_from_corpus

View source

Call SubwordTextEncoder.build_from_corpus is encoder_cls is such.

If self.encoder is None and self._encoder_cls is of type SubwordTextEncoder, the method instantiates self.encoder as returned by SubwordTextEncoder.build_from_corpus().

Args
corpus_generator generator yielding str, from which subwords will be constructed.
**kwargs kwargs forwarded to SubwordTextEncoder.build_from_corpus()

maybe_set_encoder

View source

Set encoder, but no-op if encoder is already set.

repr_html

View source

Text are decoded.

repr_html_batch

View source

Returns the HTML str representation of the object (Sequence).

repr_html_ragged

View source

Returns the HTML str representation of the object (Nested sequence).

save_config

View source

Exports the FeatureConnector to a file.

Args
root_dir path/to/dir containing the features.json

save_metadata

View source

Save the feature metadata on disk.

This function is called after the data has been generated (by _download_and_prepare) to save the feature connector info with the generated dataset.

Some dataset/features dynamically compute info during _download_and_prepare. For instance:

  • Labels are loaded from the downloaded data
  • Vocabulary is created from the downloaded data
  • ImageLabelFolder compute the image dtypes/shape from the manual_dir

After the info have been added to the feature, this function allow to save those additional info to be restored the next time the data is loaded.

By default, this function do not save anything, but sub-classes can overwrite the function.

Args
data_dir str, path to the dataset folder to which save the info (ex: ~/datasets/cifar10/1.2.0/)
feature_name str, the name of the feature (from the FeaturesDict key)

str2ints

View source

Conversion string => encoded list[int].

to_json

View source

Exports the FeatureConnector to Json.

Each feature is serialized as a dict(type=..., content=...).

  • type: The cannonical name of the feature (module.FeatureName).
  • content: is specific to each feature connector and defined in to_json_content. Can contain nested sub-features (like for tfds.features.FeaturesDict and tfds.features.Sequence).

For example:

tfds.features.FeaturesDict({
    'input': tfds.features.Image(),
    'target': tfds.features.ClassLabel(num_classes=10),
})

Is serialized as:

{
    "type": "tensorflow_datasets.core.features.features_dict.FeaturesDict",
    "content": {
        "input": {
            "type": "tensorflow_datasets.core.features.image_feature.Image",
            "content": {
                "shape": [null, null, 3],
                "dtype": "uint8",
                "encoding_format": "png"
            }
        },
        "target": {
            "type": "tensorflow_datasets.core.features.class_label_feature.ClassLabel",
            "num_classes": 10
        }
    }
}

Returns
A dict(type=, content=). Will be forwarded to from_json when reconstructing the feature.

to_json_content

View source

FeatureConnector factory (to overwrite).

This function should be overwritten by the subclass to allow re-importing the feature connector from the config. See existing FeatureConnector for example of implementation.

Returns
Dict containing the FeatureConnector metadata. Will be forwarded to from_json_content when reconstructing the feature.