tfio.IOTensor

IOTensor

tfio.IOTensor(
    spec, internal=False
)

An IOTensor is a tensor with data backed by IO operations. For example, an AudioIOTensor is a tensor with data from an audio file, a KafkaIOTensor is a tensor with data from reading the messages of a Kafka stream server.

The IOTensor is indexable, supporting __getitem__() and __len__() methods in Python. In other words, it is a subclass of collections.abc.Sequence.

Example:

import tensorflow_io as tfio

samples = tfio.IOTensor.from_audio("sample.wav") print(samples[1000:1005]) ... tf.Tensor( ... [[-3] ... [-7] ... [-6] ... [-6] ... [-5]], shape=(5, 1), dtype=int16)

Indexable vs. Iterable

While many IO formats are natually considered as iterable only, in most of the situations they could still be accessed by indexing through certain workaround. For example, a Kafka stream is not directly indexable yet the stream could be saved in memory or disk to allow indexing. Another example is the packet capture (PCAP) file in networking area. The packets inside a PCAP file is concatenated sequentially. Since each packets could have a variable length, the only way to access each packet is to read one packet at a time. If the PCAP file is huge (e.g., hundreds of GBs or even TBs), it may not be realistic (or necessarily) to save the index of every packet in memory. We could consider PCAP format as iterable only.

As we could see the, availability of memory size could be a factor to decide if a format is indexable or not. However, this factor could also be blurred as well in distributed computing. One common case is the file format that might be splittable where a file could be split into multiple chunks (without read the whole file) with no data overlapping in between those chunks. For example, a text file could be reliably split into multiple chunks with line feed (LF) as the boundary. Processing of chunks could then be distributed across a group of compute node to speed up (by reading small chunks into memory). From that standpoint, we could still consider splittable formats as indexable.

For that reason our focus is IOTensor with convinience indexing and slicing through __getitem__() method.

Lazy Read

One useful feature of IOTensor is the lazy read. Data inside a file is not read into memory until needed. This could be convenient where only a small segment of the data is needed. For example, a WAV file could be as big as GBs but in many cases only several seconds of samples are used for training or inference purposes.

While CPU memory is cheap nowadays, GPU memory is still considered as an expensive resource. It is also imperative to fit data in GPU memory for speed up purposes. From that perspective lazy read could be very helpful.

Association of Metadata

While a file format could consist of mostly numeric data, in many situations the metadata is important as well. For example, in audio file format the sample rate is a number that is necessary for almost everything. Association of the sample rate with the sample of int16 Tensor is more helpful, especially in eager mode.

Example:

import tensorflow_io as tfio

samples = tfio.IOTensor.from_audio("sample.wav") print(samples.rate) ... 44100

Nested Element Structure

The concept of IOTensor is not limited to a Tensor of single data type. It supports nested element structure which could consists of many components and complex structures. The exposed API such as shape() or dtype() will display the shape and data type of an individual Tensor, or a nested structure of shape and data types for components of a composite Tensor.

Example:

import tensorflow_io as tfio

samples = tfio.IOTensor.from_audio("sample.wav") print(samples.shape) ... (22050, 2) print(samples.dtype) ...

features = tfio.IOTensor.from_json("feature.json") print(features.shape) ... (TensorShape([Dimension(2)]), TensorShape([Dimension(2)])) print(features.dtype) ... (tf.float64, tf.int64)

Access Columns of Tabular Data Formats

Many file formats such as Parquet or Json are considered as Tabular because they consist of columns in a table. With IOTensor it is possible to access individual columns through __call__().

Example:

import tensorflow_io as tfio

features = tfio.IOTensor.from_json("feature.json") print(features.shape("floatfeature")) ... (2,) print(features.dtype("floatfeature")) ...

print(features("floatfeature").shape) ... (2,) print(features("floatfeature").dtype) ...

Conversion to Tensor and Dataset

When needed, IOTensor can be converted into a Tensor (through to_tensor(), or a tf.data.Dataset (through to_dataset(), to support operations that are only available through Tensor or tf.data.Dataset.

Example:

import tensorflow as tf import tensorflow_io as tfio

features = tfio.IOTensor.from_json("feature.json")

features_tensor = features.to_tensor() print(features_tensor()) ... (, ... )

features_dataset = features.to_dataset() print(features_dataset) ... <_IOTensorDataset shapes: ((), ()), types: (tf.float64, tf.int64)>

dataset = tf.data.Dataset.zip((features_dataset, labels_dataset))

Attributes
`spec`	The `TensorSpec` of values in this tensor.

Methods

`from_arrow`

View source

@classmethod
from_arrow(
    table, spec=None, **kwargs
)

Creates an IOTensor from a pyarrow.Table.

Args
`table`	An instance of a `pyarrow.Table`.
`spec`	A dict of `dataset:tf.TensorSpec` or `dataset:dtype` pairs that specify the dataset selected and the tf.TensorSpec or dtype of the dataset. In eager mode the spec is probed automatically. In graph mode `spec` is required and columns in the `pyarrow.Table` can be keyed by column name or index.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_audio`

View source

@classmethod
from_audio(
    filename, **kwargs
)

Creates an IOTensor from an audio file.

The following audio file formats are supported:

WAV
Flac
Vorbis
MP3

Args
`filename`	A string, the filename of an audio file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_avro`

View source

@classmethod
from_avro(
    filename, schema, **kwargs
)

Creates an IOTensor from an avro file.

Args
`filename`	A string, the filename of an avro file.
`schema`	A string, the schema of an avro file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_csv`

View source

@classmethod
from_csv(
    filename, **kwargs
)

Creates an IOTensor from an csv file.

Args
`filename`	A string, the filename of an csv file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_feather`

View source

@classmethod
from_feather(
    filename, **kwargs
)

Creates an IOTensor from an feather file.

Args
`filename`	A string, the filename of an feather file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_ffmpeg`

View source

@classmethod
from_ffmpeg(
    filename, **kwargs
)

Creates an IOTensor from a audio/video file.

Args
`filename`	A string, the filename of a audio/video file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_hdf5`

View source

@classmethod
from_hdf5(
    filename, spec=None, **kwargs
)

Creates an IOTensor from an hdf5 file.

Args
`filename`	A string, the filename of an hdf5 file.
`spec`	A dict of `dataset:tf.TensorSpec` or `dataset:dtype` pairs that specify the dataset selected and the tf.TensorSpec or dtype of the dataset. In eager mode the spec is probed automatically. In graph mode spec has to be specified.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_json`

View source

@classmethod
from_json(
    filename, **kwargs
)

Creates an IOTensor from an json file.

Args
`filename`	A string, the filename of an json file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_kafka`

View source

@classmethod
from_kafka(
    topic, partition=0, servers=None, configuration=None, **kwargs
)

Creates an IOTensor from a Kafka stream.

Args
`topic`	A `tf.string` tensor containing topic subscription.
`partition`	A `tf.int64` tensor containing the partition, by default 0.
`servers`	An optional list of bootstrap servers, by default `localhost:9092`.
`configuration`	An optional `tf.string` tensor containing configurations in [Key=Value] format. There are three types of configurations: Global configuration: please refer to 'Global configuration properties' in librdkafka doc. Examples include ["enable.auto.commit=false", "heartbeat.interval.ms=2000"] Topic configuration: please refer to 'Topic configuration properties' in librdkafka doc. Note all topic configurations should be prefixed with `configuration.topic.`. Examples include ["conf.topic.auto.offset.reset=earliest"]
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_lmdb`

View source

@classmethod
from_lmdb(
    filename, **kwargs
)

Creates an IOTensor from a LMDB key/value store.

Args
`filename`	A string, the filename of a LMDB file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_parquet`

View source

@classmethod
from_parquet(
    filename, **kwargs
)

Creates an IOTensor from a parquet file.

Args
`filename`	A string, the filename of a parquet file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`from_tensor`

View source

@classmethod
from_tensor(
    tensor, **kwargs
)

Converts a tf.Tensor into a IOTensor.

Examples:

Args
`tensor`	The `Tensor` to convert.

Returns
A `IOTensor`.

Raises
`ValueError`	If tensor is not a `Tensor`.

`from_tiff`

View source

@classmethod
from_tiff(
    filename, **kwargs
)

Creates an IOTensor from a tiff file.

Note tiff file may consists of multiple images with different shapes.

Args
`filename`	A string, the filename of a tiff file.
`name`	A name prefix for the IOTensor (optional).

Returns
A `IOTensor`.

`graph`

View source

@classmethod
graph(
    dtype
)

Obtain a GraphIOTensor to be used in graph mode.

Args
`dtype`	Data type of the GraphIOTensor.

Returns
A class of `GraphIOTensor`.

tfio.IOTensor

Example:

Indexable vs. Iterable

Lazy Read

Association of Metadata

Example:

Nested Element Structure

Example:

Access Columns of Tabular Data Formats

Example:

Conversion to Tensor and Dataset

Example:

Attributes

Methods

from_arrow

from_audio

from_avro

from_csv

from_feather

from_ffmpeg

from_hdf5

from_json

from_kafka

from_lmdb

from_parquet

from_tensor

Examples:

from_tiff

graph

`from_arrow`

`from_audio`

`from_avro`

`from_csv`

`from_feather`

`from_ffmpeg`

`from_hdf5`

`from_json`

`from_kafka`

`from_lmdb`

`from_parquet`

`from_tensor`

`from_tiff`

`graph`