View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

Overview

This tutorial demonstrates the tfio.genome package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities).

This package uses the Google Nucleus library to provide some of the core functionality.

Setup

try:
  %tensorflow_version 2.x
except Exception:
  pass
!pip install -q tensorflow-io
import tensorflow_io as tfio
import tensorflow as tf

FASTQ Data

FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.

First, let's download a sample fastq file.

# Download some sample data:
!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   407  100   407    0     0   1118      0 --:--:-- --:--:-- --:--:--  1121

Read FASTQ Data

Now, let's use tfio.genome.read_fastq to read this file (note a tf.data API coming soon).

fastq_data = tfio.genome.read_fastq(filename="test.fastq")
print(fastq_data.sequences)
print(fastq_data.raw_quality)
tf.Tensor(
[b'GATTACA'
 b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'
 b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)
tf.Tensor(
[b'BB>B@FA'
 b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'
 b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)

As you see, the returned fastq_data has fastq_data.sequences which is a string tensor of all sequences in the fastq file (which can each be a different size) along with fastq_data.raw_quality which includes Phred encoded quality information about the quality of each base read in the sequence.

Quality

You can use a helper op to convert this quality information into probabilities if we are interested.

quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)
print(quality.shape)
print(quality.row_lengths().numpy())
print(quality)
(4, None, 1)
[ 7 68 46  3]
<tf.RaggedTensor [[[0.0005011872854083776], [0.0005011872854083776], [0.0012589250691235065], [0.0005011872854083776], [0.0007943279924802482], [0.00019952621369156986], [0.0006309573072940111]], [[0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0005011872854083776], [0.00019952621369156986], [0.0007943279924802482], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.0003162277571391314], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952621369156986], [0.00019952621369156986], [0.0001584893325343728], [0.00019952621369156986], [0.0001584893325343728], [0.00012589251855388284], [0.0005011872854083776], [0.00019952621369156986], [0.0005011872854083776], [0.00019952621369156986], [0.0005011872854083776], [0.00019952621369156986], [0.0006309573072940111], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.00012589251855388284], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00019952621369156986], [0.00012589251855388284], [0.00012589251855388284], [0.0003981070767622441], [0.0002511885541025549], [0.00019952621369156986], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952621369156986], [0.0001584893325343728], [0.00019952621369156986], [0.00019952621369156986], [0.00012589251855388284], [0.0002511885541025549], [0.0003162277571391314], [0.0001584893325343728], [0.015848929062485695], [0.0002511885541025549], [0.00019952621369156986], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00012589251855388284], [0.0002511885541025549], [0.0001584893325343728], [0.00012589251855388284], [0.0001584893325343728]], [[0.00019952621369156986], [0.0006309573072940111], [0.00019952621369156986], [0.0006309573072940111], [0.00019952621369156986], [0.0025118854828178883], [0.00019952621369156986], [0.03981071710586548], [0.003981070592999458], [0.0025118854828178883], [0.050118714570999146], [0.003162277629598975], [0.03981071710586548], [0.0025118854828178883], [0.003981070592999458], [0.003981070592999458], [0.003981070592999458], [0.0005011872854083776], [0.03981071710586548], [0.003981070592999458], [0.0006309573072940111], [0.050118714570999146], [0.0003162277571391314], [0.00019952621369156986], [0.00019952621369156986], [0.00019952621369156986], [0.0025118854828178883], [0.06309572607278824], [0.06309572607278824], [0.0012589250691235065], [0.050118714570999146], [0.0006309573072940111], [0.0006309573072940111], [0.0005011872854083776], [0.03981071710586548], [0.00019952621369156986], [0.0003981070767622441], [0.0025118854828178883], [0.003981070592999458], [0.06309572607278824], [0.0007943279924802482], [0.06309572607278824], [0.001584893325343728], [0.0025118854828178883], [0.001584893325343728], [0.050118714570999146]], [[0.00019952621369156986], [0.0006309573072940111], [0.0003162277571391314]]]>

One hot encodings

You may also want to encode the genome sequence data (which consists of A T C G bases) using a one hot encoder. There's a built in operation that can help with this.

one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences)
print(one_hot)
print(one_hot.shape)
<tf.RaggedTensor [[[0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 0]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0]]]>
(4, None, 4)

print(tfio.genome.sequences_to_onehot.__doc__)
Convert DNA sequences into a one hot nucleotide encoding.

  Each nucleotide in each sequence is mapped as follows:
  A -> [1, 0, 0, 0]
  C -> [0, 1, 0, 0]
  G -> [0 ,0 ,1, 0]
  T -> [0, 0, 0, 1]

  If for some reason a non (A, T, C, G) character exists in the string, it is
  currently mapped to a error one hot encoding [1, 1, 1, 1].

  Args:
    sequences: A tf.string tensor where each string represents a DNA sequence

  Returns:
    tf.RaggedTensor: The output sequences with nucleotides one hot encoded.