Arduino RP2040 Sound classification using Machine Learning: Animal Sounds

This tutorial covers how to use Arduino RP2040 to classify sound using Machine Learning. We will use Arduino RP2040 to classify animal sounds. To implement the machine learning model, this project uses Edge Impulse. Arduino RP2040 has a built-in microphone and we will use it to capture sound and classify it. Using this project we will discover how to implement an Arduino RP2040 sound classification system that we will use to classify animal sounds.

There are several aspects that make this project interesting:

  • finding a good dataset so that we can train our model
  • using the right features extraction so that we can classify the animal sounds

Thanks to Edge Impulse everything gets easier. Let see how to build the model.

Finding the dataset

It was not easy to find a dataset that holds the animal sounds so that we can train our model for Arduino RP2040. I found Environmental Sound Classification 50. This dataset has 50 different classes. For each class, there are 40 samples of 5 seconds. The audio is sampled at 16KHz. It looks perfect for this project, we have only to manipulate these files so that they can fit our needs. These classes and the relative files are described using a file named esc50.csv. We will use it to extract data and to upload the files to Edge Impulse where we will train our model to use with Arduino RP2040.

How to create the dataset for Edge Impulse

To build the dataset to use with Edge Impulse and Arduino RP2040 we will use Colab and Python. In this step, we will create the dataset where we will train the machine learning model to classify animal sounds with Arduino RP2040. Let us create a new Colab notebook and download the dataset from Kaggle:

!pip install -q kaggle
import os
os.environ['KAGGLE_USERNAME']='your_user_name'
os.environ['KAGGLE_KEY']='your_kaggle_key'
# Audio dataset
!kaggle datasets download -d mmoreaux/environmental-sound-classification-50
!unzip environmental-sound-classification-50.zipCode language: Python (python)

Before using the audio files in this dataset, we have to read the esc50.csv. To do it, we will use pandas library:

import pandas as pd
csv_labels = pd.read_csv("./esc50.csv")
csv_labels.head()Code language: Python (python)

The result is shown below:

filenamefoldtargetcategoryesc10src_filetake
01-100032-A-0.wav10dogTrue100032A
11-100038-A-14.wav114chirping_birdsFalse100038A
21-100210-A-36.wav136vacuum_cleanerFalse100210A
31-100210-B-36.wav136vacuum_cleanerFalse100210B
41-101296-A-19.wav119thunderstormFalse101296A

We can suppose we want to classify three different animal sounds:

  • cow
  • sheep
  • dog

Therefore, let us use the csv files to copy only the WAV files we need:

# labels
labels=['cow', 'dog', 'sheep']
# path
base_path = './audio/audio/16000'
# output dir
output_base_path = './edge_dataset'
import shutil
if not os.path.exists(output_base_path):
    print ("Create dir " + output_base_path)
    os.mkdir(output_base_path)
filtered_labels = csv_labels[csv_labels.category.isin(labels)]
for x in range(len(filtered_labels)):
  label = filtered_labels.iloc[x]['category']
  filename = filtered_labels.iloc[x]['filename']
  print("Label: " + label)
  print("File name: "  + filename)
  output_path = output_base_path + "/" + label
  if not os.path.exists(output_path):
    print ("Create dir " + output_path)
    os.mkdir(output_path)
  src = base_path + "/" + filename
  dest = output_path + "/" + filename
  shutil.copyfile(src, dest)Code language: Python (python)

Now it is necessary to add some noise to make the model more robust:

!wget https://cdn.edgeimpulse.com/datasets/keywords2.zip
# unzip the file
!unzip keywords2.zip -d ./other_dataset
import librosa
import soundfile as sf
import numpy as np
# noise samples
noise_samples = 120
counter = 1
noise_input_path = "other_dataset/noise";
if not os.path.exists(output_base_path + "/noise/"):
    print ("Create dir " + output_base_path + "/noise/")
    os.mkdir(output_base_path + "/noise/")
for filename in os.listdir(noise_input_path):
  input_file_path = noise_input_path + "/" + filename
  output_file_path = output_base_path + "/noise/" + filename
  s, sr = librosa.load(input_file_path, sr=16000, mono=True)
 
  
  if (counter > noise_samples):
    break;
  counter = counter + 1;
  sf.write(output_file_path, s, 16000)
Code language: PHP (php)

That’s all. Now we can simply upload this files to the Edge Impulse.

How to use Edge Impulse to build an Arduino RP2040 sound classification model

In this step, we will build the machine learning model to use with Arduino RP2040 to classify animal sounds. Let us go to the Edge Impulse web interface under dataset and we will find our WAV samples uploaded in the step before. Now it is necessary to split all the files as shown below:

Arduino RP2040 sound classification using Machine Learning and Edge Impulse

You have to repeat this process for all the uploaded files for 4 different categories: three are the animal chosen before and one is the noise. After you have rebalanced your dataset you should have:

How to build the machine learning model for Arduino RP2040

Now it is possible to build the machine learning model. This project has big difference compared to the “classic” keywords spotting using Arduino Nano 33 BLE or similar projects. In this project, we do not have to recognize human voice therefore the MFCC processing block does not give good results. It is necessary to change the processing block and use the Spectrogram. The final model is shown below:

Build a machine learning model using Edge Impulse with Arduino RP2040

You can play with spectogram parameter to find the best solution. Below my result:

Sound feature extraction using Machine Learning and Arduino RP2040

Training the machine learning model to use with Arduino RP2040 inference process

In this last step, we can train the model. I have used this parameters:

Training cycles: 300
Learning rate: 0.005
Minimium confidence: 0.60

Start the training process:

How to train a machine learning model with Arduino RP2040

Not bad considering we used only 5 minutes samples. We can deploy the model to the Arduino RP2040 and use it to classify sound: animal sound.

Running the machine learning model on the Arduino RP2040

In this last step, we will run the machine learning model on Arduino RP2040. You can simply import the project into the Arduino IDE and run it. Anyway, we prefer to display the result using a simple display (SSD1306). Therefore it is necessary to modify the source code adding a few lines:

/* Edge Impulse Arduino examples
 * Copyright (c) 2021 EdgeImpulse Inc.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */
#include <Adafruit_GFX.h>
#include <Adafruit_SSD1306.h>
#define SCREEN_WIDTH 128 // OLED display width, in pixels
#define SCREEN_HEIGHT 64 // OLED display height, in pixels
#define SCREEN_ADDRESS 0x3C ///< See datasheet for Address; 0x3D for 128x64, 0x3C for 128x32
Adafruit_SSD1306 display(SCREEN_WIDTH, SCREEN_HEIGHT, &Wire, -1);

// If your target is limited in memory remove this macro to save 10K RAM
#define EIDSP_QUANTIZE_FILTERBANK   0
/**
 * Define the number of slices per model window. E.g. a model window of 1000 ms
 * with slices per model window set to 4. Results in a slice size of 250 ms.
 * For more info: https://docs.edgeimpulse.com/docs/continuous-audio-sampling
 */
#define EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW 3
/* Includes ---------------------------------------------------------------- */
#include <PDM.h>
#include <Animal_sound_inferencing.h>
/** Audio buffers, pointers and selectors */
typedef struct {
    signed short *buffers[2];
    unsigned char buf_select;
    unsigned char buf_ready;
    unsigned int buf_count;
    unsigned int n_samples;
} inference_t;
static inference_t inference;
static bool record_ready = false;
static signed short *sampleBuffer;
static bool debug_nn = false; // Set this to true to see e.g. features generated from the raw signal
static int print_results = -(EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW);
/**
 * @brief      Arduino setup function
 */
void setup()
{
    // put your setup code here, to run once:
    Serial.begin(115200);
    if(!display.begin(SSD1306_SWITCHCAPVCC, SCREEN_ADDRESS)) {
      Serial.println("SSD1306 allocation failed");
      for(;;); // Don't proceed, loop forever
    }
    Serial.println("Edge Impulse Inferencing Demo");
    display.setTextSize(1);      // Normal 1:1 pixel scale
    display.setTextColor(SSD1306_WHITE); // Draw white text
    display.setCursor(0, 0);     // Start at top-left corner
    display.clearDisplay();
    display.println("ML Animal sound");
    display.display();
   
    
    // summary of inferencing settings (from model_metadata.h)
    ei_printf("Inferencing settings:\n");
    ei_printf("\tInterval: %.2f ms.\n", (float)EI_CLASSIFIER_INTERVAL_MS);
    ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
    ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
    ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) /
                                            sizeof(ei_classifier_inferencing_categories[0]));
    run_classifier_init();
    if (microphone_inference_start(EI_CLASSIFIER_SLICE_SIZE) == false) {
        ei_printf("ERR: Failed to setup audio sampling\r\n");
        return;
    }
}
/**
 * @brief      Arduino main function. Runs the inferencing loop.
 */
void loop()
{
    bool m = microphone_inference_record();
    if (!m) {
        ei_printf("ERR: Failed to record audio...\n");
        return;
    }
    signal_t signal;
    signal.total_length = EI_CLASSIFIER_SLICE_SIZE;
    signal.get_data = &microphone_audio_signal_get_data;
    ei_impulse_result_t result = {0};
    EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);
    if (r != EI_IMPULSE_OK) {
        ei_printf("ERR: Failed to run classifier (%d)\n", r);
        return;
    }
    if (++print_results >= (EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW)) {
        // print the predictions
        ei_printf("Predictions ");
        ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
            result.timing.dsp, result.timing.classification, result.timing.anomaly);
        ei_printf(": \n");
       
        float p = 0;
        int idx = 0;
        
        for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
            ei_printf("    %s: %.5f\n", result.classification[ix].label,
                      result.classification[ix].value);
            if (result.classification[ix].value > p) {
              p = result.classification[ix].value ;
              idx = ix;
            }
        }
#if EI_CLASSIFIER_HAS_ANOMALY == 1
        ei_printf("    anomaly score: %.3f\n", result.anomaly);
#endif
        int dtime = 0;
        
        print_results = 0;
       Serial.println(idx);
         display.setCursor(1,20);
        switch(idx) {
          case 0:
            display.clearDisplay();
            display.print("Muuuuu");
            
            dtime = 1000;
            break;
          case 1:
             display.clearDisplay();
             display.println("Bauuu");
             display.display();
             dtime = 1000;
             break;
           case 2:
              display.clearDisplay();
              display.println("Listening.....");
              display.display();
              Serial.println("Noise");
              dtime = 0;
              break;
          case 3:
              display.clearDisplay();
              display.println("Beeeee");
              display.display();
              Serial.println("Sheep");
              dtime = 1000;
              break;
         
          
          if (dtime > 0)
            delay(dtime);
        }
    }
}
/**
 * @brief      Printf function uses vsnprintf and output using Arduino Serial
 *
 * @param[in]  format     Variable argument list
 */
void ei_printf(const char *format, ...) {
    static char print_buf[1024] = { 0 };
    va_list args;
    va_start(args, format);
    int r = vsnprintf(print_buf, sizeof(print_buf), format, args);
    va_end(args);
    if (r > 0) {
        Serial.write(print_buf);
    }
}
/**
 * @brief      PDM buffer full callback
 *             Get data and call audio thread callback
 */
static void pdm_data_ready_inference_callback(void)
{
    int bytesAvailable = PDM.available();
    // read into the sample buffer
    int bytesRead = PDM.read((char *)&sampleBuffer[0], bytesAvailable);
    if (record_ready == true) {
        for (int i = 0; i<bytesRead>> 1; i++) {
            inference.buffers[inference.buf_select][inference.buf_count++] = sampleBuffer[i];
            if (inference.buf_count >= inference.n_samples) {
                inference.buf_select ^= 1;
                inference.buf_count = 0;
                inference.buf_ready = 1;
            }
        }
    }
}
/**
 * @brief      Init inferencing struct and setup/start PDM
 *
 * @param[in]  n_samples  The n samples
 *
 * @return     { description_of_the_return_value }
 */
static bool microphone_inference_start(uint32_t n_samples)
{
    inference.buffers[0] = (signed short *)malloc(n_samples * sizeof(signed short));
    if (inference.buffers[0] == NULL) {
        return false;
    }
    inference.buffers[1] = (signed short *)malloc(n_samples * sizeof(signed short));
    if (inference.buffers[0] == NULL) {
        free(inference.buffers[0]);
        return false;
    }
    sampleBuffer = (signed short *)malloc((n_samples >> 1) * sizeof(signed short));
    if (sampleBuffer == NULL) {
        free(inference.buffers[0]);
        free(inference.buffers[1]);
        return false;
    }
    inference.buf_select = 0;
    inference.buf_count = 0;
    inference.n_samples = n_samples;
    inference.buf_ready = 0;
    // configure the data receive callback
    PDM.onReceive(&pdm_data_ready_inference_callback);
    // optionally set the gain, defaults to 20
    PDM.setGain(80);
    PDM.setBufferSize((n_samples >> 1) * sizeof(int16_t));
    // initialize PDM with:
    // - one channel (mono mode)
    // - a 16 kHz sample rate
    if (!PDM.begin(1, EI_CLASSIFIER_FREQUENCY)) {
        ei_printf("Failed to start PDM!");
    }
    record_ready = true;
    return true;
}
/**
 * @brief      Wait on new data
 *
 * @return     True when finished
 */
static bool microphone_inference_record(void)
{
    bool ret = true;
    if (inference.buf_ready == 1) {
        ei_printf(
            "Error sample buffer overrun. Decrease the number of slices per model window "
            "(EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW)\n");
        ret = false;
    }
    while (inference.buf_ready == 0) {
        delay(1);
    }
    inference.buf_ready = 0;
    return ret;
}
/**
 * Get raw audio signal data
 */
static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)
{
    numpy::int16_to_float(&inference.buffers[inference.buf_select ^ 1][offset], out_ptr, length);
    return 0;
}
/**
 * @brief      Stop PDM and release buffers
 */
static void microphone_inference_end(void)
{
    PDM.end();
    free(inference.buffers[0]);
    free(inference.buffers[1]);
    free(sampleBuffer);
}
#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE
#error "Invalid model for current sensor."
#endifCode language: PHP (php)

That’s all we can now test it.

Testing Arduino RP2040 animal sound classification

Now you can test the code. To do it, we will use some animal sound from youtube this is the result:

Wrapping up

At the end of this Arduino RP2040 Machine learning tutoria, we have discovered how to use machine learning to classify animal sounds. Moreover, we have covered how to build a machine learning model using Edge Impulse and how to use it with Arduino RP2040. Now, you can experiment by yourself building your model or improving this one adding new animal sounds.