Help protect the Great Barrier Reef with TensorFlow on Kaggle Join Challenge

Apache ORC Reader

View on TensorFlow.org Run in Google Colab View on GitHub Download notebook

Overview

Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading Apache ORC files.

Setup

Install required packages, and restart runtime

pip install tensorflow-io
import tensorflow as tf
import tensorflow_io as tfio
2021-07-30 12:26:35.624072: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0

Download a sample dataset file in ORC

The dataset you will use here is the Iris Data Set from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label.

curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc
ls -l iris.orc
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   144  100   144    0     0   1180      0 --:--:-- --:--:-- --:--:--  1180
100  3328  100  3328    0     0  13419      0 --:--:-- --:--:-- --:--:--     0
-rw-rw-r-- 1 kbuilder kokoro 3328 Jul 30 12:26 iris.orc

Create a dataset from the file

dataset = tfio.IODataset.from_orc("iris.orc", capacity=15).batch(1)
2021-07-30 12:26:37.779732: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
2021-07-30 12:26:37.887808: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-30 12:26:37.979733: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-07-30 12:26:37.979781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kokoro-gcp-ubuntu-prod-1874323723): /proc/driver/nvidia/version does not exist
2021-07-30 12:26:37.980766: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-30 12:26:37.984832: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>

Examine the dataset:

for item in dataset.take(1):
    print(item)
(<tf.Tensor: shape=(1,), dtype=float32, numpy=array([5.1], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([3.5], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.4], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.2], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'setosa'], dtype=object)>)
2021-07-30 12:26:38.167628: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-30 12:26:38.168103: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000170000 Hz

Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset.

Data preprocessing

Configure which columns are features, and which column is label:

feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
label_cols = ["species"]

# select feature columns
feature_dataset = tfio.IODataset.from_orc("iris.orc", columns=feature_cols)
# select label columns
label_dataset = tfio.IODataset.from_orc("iris.orc", columns=label_cols)
2021-07-30 12:26:38.222712: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>
2021-07-30 12:26:38.286470: I tensorflow_io/core/kernels/orc/orc_kernels.cc:49] ORC file schema:struct<sepal_length:float,sepal_width:float,petal_length:float,petal_width:float,species:string>

A util function to map species to float numbers for model training:

vocab_init = tf.lookup.KeyValueTensorInitializer(
    keys=tf.constant(["virginica", "versicolor", "setosa"]),
    values=tf.constant([0, 1, 2], dtype=tf.int64))
vocab_table = tf.lookup.StaticVocabularyTable(
    vocab_init,
    num_oov_buckets=4)
label_dataset = label_dataset.map(vocab_table.lookup)
dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))
dataset = dataset.batch(1)

def pack_features_vector(features, labels):
    """Pack the features into a single array."""
    features = tf.stack(list(features), axis=1)
    return features, labels

dataset = dataset.map(pack_features_vector)

Build, compile and train the model

Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed.

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(
            10, activation=tf.nn.relu, input_shape=(4,)
        ),
        tf.keras.layers.Dense(10, activation=tf.nn.relu),
        tf.keras.layers.Dense(3),
    ]
)

model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"])
model.fit(dataset, epochs=5)
Epoch 1/5
150/150 [==============================] - 0s 1ms/step - loss: 1.3479 - accuracy: 0.4800
Epoch 2/5
150/150 [==============================] - 0s 920us/step - loss: 0.8355 - accuracy: 0.6000
Epoch 3/5
150/150 [==============================] - 0s 951us/step - loss: 0.6370 - accuracy: 0.7733
Epoch 4/5
150/150 [==============================] - 0s 954us/step - loss: 0.5276 - accuracy: 0.7933
Epoch 5/5
150/150 [==============================] - 0s 940us/step - loss: 0.4766 - accuracy: 0.7933
<tensorflow.python.keras.callbacks.History at 0x7f263b830850>