• Description:

Bacteria identification based on genomic sequences holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on out-of-distribution (OOD) genomic sequences from new bacteria that were not present in the training data.

We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. New bacterial classes are gradually discovered over the years. Grouping classes by years is a natural way to mimic the in-distribution and OOD examples.

The dataset contains genomic sequences sampled from 10 bacteria classes that were discovered before the year 2011 as in-distribution classes, 60 bacteria classes discovered between 2011-2016 as OOD for validation, and another 60 different bacteria classes discovered after 2016 as OOD for test, in total 130 bacteria classes. Note that training, validation, and test data are provided for the in-distribution classes, and validation and test data are proviede for OOD classes. By its nature, OOD data is not available at the training time.

The genomic sequence is 250 long, composed by characters of {A, C, G, T}. The sample size of each class is 100,000 in the training and 10,000 for the validation and test sets.

For each example, the features include: seq: the input DNA sequence composed by {A, C, G, T}. label: the name of the bacteria class. seq_info: the source of the DNA sequence, i.e., the genome name, NCBI accession number, and the position where it was sampled from. domain: if the bacteria is in-distribution (in), or OOD (ood)

The details of the dataset can be found in the paper supplemental.

Split Examples
'test' 100,000
'test_ood' 600,000
'train' 1,000,000
'validation' 100,000
'validation_ood' 600,000
  • Feature structure:
    'domain': Text(shape=(), dtype=string),
    'label': ClassLabel(shape=(), dtype=int64, num_classes=130),
    'seq': Text(shape=(), dtype=string),
    'seq_info': Text(shape=(), dtype=string),
  • Feature documentation:
Feature Class Shape Dtype Description
domain Text string
label ClassLabel int64
seq Text string
seq_info Text string
  • Citation:
  title={Likelihood ratios for out-of-distribution detection},
  author={Ren, Jie and
  Liu, Peter J and
  Fertig, Emily and
  Snoek, Jasper and
  Poplin, Ryan and
  Depristo, Mark and
  Dillon, Joshua and
  Lakshminarayanan, Balaji},
  booktitle={Advances in Neural Information Processing Systems},