Save the date! Google I/O returns May 18-20 Register now

Transfer Learning with YAMNet for environmental sound classification

View on Run in Google Colab View on GitHub Download notebook See TF Hub model

YAMNet is an audio event classifier that can predict audio events from 521 classes, like laughter, barking, or a siren.

In this tutorial you will learn how to:

  • Load and use the YAMNet model for inference.
  • Build a new model using the YAMNet embeddings to classify cat and dog sounds.
  • Evaluate and export your model.

Import TensorFlow and other libraries

Start by installing TensorFlow I/O, which will make it easier for you to load audio files off disk.

pip install -q tensorflow_io
import os

from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import tensorflow_io as tfio

About YAMNet

YAMNet is an audio event classifier that takes audio waveform as input and makes independent predictions for each of 521 audio events from the AudioSet ontology.

Internally, the model extracts "frames" from the audio signal and processes batches of these frames. This version of the model uses frames that are 0.96s long and extracts one frame every 0.48s.

The model accepts a 1-D float32 Tensor or NumPy array containing a waveform of arbitrary length, represented as mono 16 kHz samples in the range [-1.0, +1.0]. This tutorial contains code to help you convert a .wav file into the correct format.

The model returns 3 outputs, including the class scores, embeddings (which you will use for transfer learning), and the log mel spectrogram. You can find more details here, and this tutorial will walk you through using these in practice.

One specific use of YAMNet is as a high-level feature extractor: the 1024-D embedding output of YAMNet can be used as the input features of another shallow model which can then be trained on a small amount of data for a particular task. This allows the quick creation of specialized audio classifiers without requiring a lot of labeled data and without having to train a large model end-to-end.

You will use YAMNet's embeddings output for transfer learning and train one or more Dense layers on top of this.

First, you will try the model and see the results of classifying audio. You will then construct the data pre-processing pipeline.

Loading YAMNet from TensorFlow Hub

You are going to use YAMNet from Tensorflow Hub to extract the embeddings from the sound files.

Loading a model from TensorFlow Hub is straightforward: choose the model, copy its URL and use the load function.

yamnet_model_handle = ''
yamnet_model = hub.load(yamnet_model_handle)

With the model loaded and following the models's basic usage tutorial you'll download a sample wav file and run the inference.

testing_wav_file_name = tf.keras.utils.get_file('miaow_16k.wav',

Downloading data from
221184/215546 [==============================] - 0s 0us/step

You will need a function to load the audio files. They will also be used later when working with the training data.

# Util functions for loading audio files and ensure the correct sample rate

def load_wav_16k_mono(filename):
    """ read in a waveform file and convert to 16 kHz mono """
    file_contents =
    wav, sample_rate =
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav =, rate_in=sample_rate, rate_out=16000)
    return wav
testing_wav_data = load_wav_16k_mono(testing_wav_file_name)

_ = plt.plot(testing_wav_data)

# Play the audio file.


Load the class mapping

It's important to load the class names that YAMNet is able to recognize. The mapping file is present at yamnet_model.class_map_path(), in the csv format.

class_map_path = yamnet_model.class_map_path().numpy().decode('utf-8')
class_names =list(pd.read_csv(class_map_path)['display_name'])

for name in class_names[:20]:
Child speech, kid speaking
Narration, monologue
Speech synthesizer
Children shouting
Baby laughter
Belly laugh
Chuckle, chortle
Crying, sobbing

Run inference

YAMNet provides frame-level class-scores (i.e., 521 scores for every frame). In order to determine clip-level predictions, the scores can be aggregated per-class across frames (e.g., using mean or max aggregation). This is done below by scores_np.mean(axis=0). Finally, in order to find the top-scored class at the clip-level, we take the maximum of the 521 aggregated scores.

scores, embeddings, spectrogram = yamnet_model(testing_wav_data)
class_scores = tf.reduce_mean(scores, axis=0)
top_class = tf.argmax(class_scores)
infered_class = class_names[top_class]

print(f'The main sound is: {infered_class}')
print(f'The embeddings shape: {embeddings.shape}')
The main sound is: Animal
The embeddings shape: (13, 1024)

ESC-50 dataset

The ESC-50 dataset, well described here, is a labeled collection of 2000 environmental audio recordings (each 5 seconds long). The data consists of 50 classes, with 40 examples per class.

Next, you will download and extract it.

_ = tf.keras.utils.get_file('',
Downloading data from
645693440/Unknown - 44s 0us/step

Explore the data

The metadata for each file is specified in the csv file at ./datasets/ESC-50-master/meta/esc50.csv

and all the audio files are in ./datasets/ESC-50-master/audio/

You will create a pandas dataframe with the mapping and use that to have a clearer view of the data.

esc50_csv = './datasets/ESC-50-master/meta/esc50.csv'
base_data_path = './datasets/ESC-50-master/audio/'

pd_data = pd.read_csv(esc50_csv)

Filter the data

Given the data on the dataframe, you will apply some transformations:

  • filter out rows and use only the selected classes (dog and cat). If you want to use any other classes, this is where you can choose them.
  • change the filename to have the full path. This will make loading easier later.
  • change targets to be within a specific range. In this example, dog will remain 0, but cat will become 1 instead of its original value of 5.
my_classes = ['dog', 'cat']
map_class_to_id = {'dog':0, 'cat':1}

filtered_pd = pd_data[pd_data.category.isin(my_classes)]

class_id = filtered_pd['category'].apply(lambda name: map_class_to_id[name])
filtered_pd = filtered_pd.assign(target=class_id)

full_path = filtered_pd['filename'].apply(lambda row: os.path.join(base_data_path, row))
filtered_pd = filtered_pd.assign(filename=full_path)


Load the audio files and retrieve embeddings

Here you'll apply the load_wav_16k_mono and prepare the wav data for the model.

When extracting embeddings from the wav data, you get an array of shape (N, 1024) where N is the number of frames that YAMNet found (one for every 0.48 seconds of audio).

Your model will use each frame as one input so you need to to create a new column that has one frame per row. You also need to expand the labels and fold column to proper reflect these new rows.

The expanded fold column keeps the original value. You cannot mix frames because, when doing the splits, you might end with parts of the same audio on different splits and that would make our validation and test steps less effective.

filenames = filtered_pd['filename']
targets = filtered_pd['target']
folds = filtered_pd['fold']

main_ds =, targets, folds))
(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))
def load_wav_for_map(filename, label, fold):
  return load_wav_16k_mono(filename), label, fold

main_ds =
(TensorSpec(shape=<unknown>, dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))
# applies the embedding extraction model to a wav data
def extract_embedding(wav_data, label, fold):
  ''' run YAMNet to extract embedding from the wav data '''
  scores, embeddings, spectrogram = yamnet_model(wav_data)
  num_embeddings = tf.shape(embeddings)[0]
  return (embeddings,
            tf.repeat(label, num_embeddings),
            tf.repeat(fold, num_embeddings))

# extract embedding
main_ds =
(TensorSpec(shape=(1024,), dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Split the data

You will use the fold column to split the dataset into train, validation and test.

The fold values are so that files from the same original wav file are keep on the same split, you can find more information on the paper describing the dataset.

The last step is to remove the fold column from the dataset since we're not going to use it anymore on the training process.

cached_ds = main_ds.cache()
train_ds = cached_ds.filter(lambda embedding, label, fold: fold < 4)
val_ds = cached_ds.filter(lambda embedding, label, fold: fold == 4)
test_ds = cached_ds.filter(lambda embedding, label, fold: fold == 5)

# remove the folds column now that it's not needed anymore
remove_fold_column = lambda embedding, label, fold: (embedding, label)

train_ds =
val_ds =
test_ds =

train_ds = train_ds.cache().shuffle(1000).batch(32).prefetch(
val_ds = val_ds.cache().batch(32).prefetch(
test_ds = test_ds.cache().batch(32).prefetch(

Create your model

You did most of the work! Next, define a very simple Sequential Model to start with -- one hiden layer and 2 outputs to recognize cats and dogs.

my_model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1024), dtype=tf.float32,
    tf.keras.layers.Dense(512, activation='relu'),
], name='my_model')

Model: "my_model"
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               524800    
dense_1 (Dense)              (None, 2)                 1026      
Total params: 525,826
Trainable params: 525,826
Non-trainable params: 0

callback = tf.keras.callbacks.EarlyStopping(monitor='loss',
history =,
Epoch 1/20
15/15 [==============================] - 8s 40ms/step - loss: 0.6974 - accuracy: 0.7843 - val_loss: 0.2084 - val_accuracy: 0.9125
Epoch 2/20
15/15 [==============================] - 0s 14ms/step - loss: 0.2580 - accuracy: 0.8928 - val_loss: 0.2362 - val_accuracy: 0.8813
Epoch 3/20
15/15 [==============================] - 0s 14ms/step - loss: 0.2606 - accuracy: 0.8948 - val_loss: 0.4986 - val_accuracy: 0.8750
Epoch 4/20
15/15 [==============================] - 0s 14ms/step - loss: 0.2383 - accuracy: 0.9164 - val_loss: 0.2165 - val_accuracy: 0.8750
Epoch 5/20
15/15 [==============================] - 0s 14ms/step - loss: 0.2330 - accuracy: 0.9074 - val_loss: 0.2299 - val_accuracy: 0.8875
Epoch 6/20
15/15 [==============================] - 0s 14ms/step - loss: 0.2113 - accuracy: 0.9430 - val_loss: 0.7519 - val_accuracy: 0.8625
Epoch 7/20
15/15 [==============================] - 0s 14ms/step - loss: 0.4895 - accuracy: 0.9407 - val_loss: 1.0282 - val_accuracy: 0.8562

Lets run the evaluate method on the test data just to be sure there's no overfitting.

loss, accuracy = my_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)
5/5 [==============================] - 0s 3ms/step - loss: 0.2846 - accuracy: 0.8438
Loss:  0.28461596369743347
Accuracy:  0.84375

You did it!

Test your model

Next, try your model on the embedding from the previous test using YAMNet only.

scores, embeddings, spectrogram = yamnet_model(testing_wav_data)
result = my_model(embeddings).numpy()

infered_class = my_classes[result.mean(axis=0).argmax()]
print(f'The main sound is: {infered_class}')
The main sound is: cat

Save a model that can directly take a wav file as input

Your model works when you give it the embeddings as input.

In a real situation you'll want to give it the sound data directly.

To do that you will combine YAMNet with your model into one single model that you can export for other applications.

To make it easier to use the model's result, the final layer will be a reduce_mean operation. When using this model for serving, as you will see bellow, you will need the name of the final layer. If you don't define one, TF will auto define an incremental one that makes it hard to test as it will keep changing everytime you train the model. When using a raw tf operation you can't assign a name to it. To address this issue, you'll create a custom layer that just apply reduce_mean and you will call it 'classifier'.

class ReduceMeanLayer(tf.keras.layers.Layer):
  def __init__(self, axis=0, **kwargs):
    super(ReduceMeanLayer, self).__init__(**kwargs)
    self.axis = axis

  def call(self, input):
    return tf.math.reduce_mean(input, axis=self.axis)
saved_model_path = './dogs_and_cats_yamnet'

input_segment = tf.keras.layers.Input(shape=(), dtype=tf.float32, name='audio')
embedding_extraction_layer = hub.KerasLayer(yamnet_model_handle,
                                            trainable=False, name='yamnet')
_, embeddings_output, _ = embedding_extraction_layer(input_segment)
serving_outputs = my_model(embeddings_output)
serving_outputs = ReduceMeanLayer(axis=0, name='classifier')(serving_outputs)
serving_model = tf.keras.Model(input_segment, serving_outputs), include_optimizer=False)
INFO:tensorflow:Assets written to: ./dogs_and_cats_yamnet/assets
INFO:tensorflow:Assets written to: ./dogs_and_cats_yamnet/assets


Load your saved model to verify that it works as expected.

reloaded_model = tf.saved_model.load(saved_model_path)

And for the final test: given some sound data, does your model return the correct result?

reloaded_results = reloaded_model(testing_wav_data)
cat_or_dog = my_classes[tf.argmax(reloaded_results)]
print(f'The main sound is: {cat_or_dog}')
The main sound is: cat

If you want to try your new model on a serving setup, you can use the 'serving_default' signature.

serving_results = reloaded_model.signatures['serving_default'](testing_wav_data)
cat_or_dog = my_classes[tf.argmax(serving_results['classifier'])]
print(f'The main sound is: {cat_or_dog}')
The main sound is: cat

(Optional) Some more testing

The model is ready.

Let's compare it to YAMNet on the test dataset.

test_pd = filtered_pd.loc[filtered_pd['fold'] == 5]
row = test_pd.sample(1)
filename = row['filename'].item()
waveform = load_wav_16k_mono(filename)
print(f'Waveform values: {waveform}')
_ = plt.plot(waveform)

display.Audio(waveform, rate=16000)
Waveform values: [ 0.0000000e+00  3.3992906e-10 -3.0525005e-10 ...  0.0000000e+00
  0.0000000e+00  0.0000000e+00]


# Run the model, check the output.
scores, embeddings, spectrogram = yamnet_model(waveform)
class_scores = tf.reduce_mean(scores, axis=0)
top_class = tf.argmax(class_scores)
infered_class = class_names[top_class]
top_score = class_scores[top_class]
print(f'[YAMNet] The main sound is: {infered_class} ({top_score})')

reloaded_results = reloaded_model(waveform)
your_top_class = tf.argmax(reloaded_results)
your_infered_class = my_classes[your_top_class]
class_probabilities = tf.nn.softmax(reloaded_results, axis=-1)
your_top_score = class_probabilities[your_top_class]
print(f'[Your model] The main sound is: {your_infered_class} ({your_top_score})')
[YAMNet] The main sound is: Silence (0.7000001668930054)
[Your model] The main sound is: dog (0.9658258557319641)

Next steps

You just created a model that can classify sounds from dogs or cats. With the same idea and proper data you could, for example, build a bird recognizer based on their singing.

Let us know what you come up with! Share your project with us on social media.