TinyML ESP32-CAM: Edge Image classification with Edge Impulse

This tutorial covers how to use TinyML with ESP32-CAM. It describes how to classify images using ESP32-CAM using deep learning. The machine learning model running directly on the device. To do it, it is necessary to create a machine learning model using Tensorflow lite and shrink the model. There are several ways to do it, this tutorial uses Edge Impulse that simplifies all the steps. We will explore the power of TinyML with ESP32-CAM to recognize and classify images.

How to use TinyML with ESP32-CAM

In order to use deep learning with ESP32-CAM, so that ESP32-CAM can classify images there are several steps to follow:

  1. Find the dataset where to train the model
  2. Manipulate the dataset if necessary
  3. Define the model architecture
  4. Train the model
  5. Develop the ESP32-CAM code to run the model

Edge Impulse helps us to speed up the deep learning model definition and the training phase producing a ready-to-use tinyml model that we can use with the ESP32-CAM. This is model is based on Tensorflow lite.

In this ESP32-CAM tutorial, we will use a dataset to recognize flowers. This project is still experimental and it must be improved in several aspects. Anyway, it provides a guide if you want to experiment with how to run a machine learning /deep learning model directly on your device.

Define the dataset to train the model to use with ESP32-CAM

There are several datasets we can use to train our tinyml model. You can find one you like. As said, we want to classify flowers using ESP32-CAM and deep learning. Therefore, we will look for a model that contains several flowers grouped in classes. Kaggle is a good starting point when you look for a dataset repository or you want to have information about Machine Learning.

This is the dataset we will use to train our machine learning model to use with ESP32-CAM. It contais 5 different flower classes:

  1. Daisy
  2. Dandelion
  3. Rose
  4. Sunflower
  5. Tulip

Before going on, it is necessary to create a Kaggle account.

Preparing the dataset to use with Edge Impulse

Before training the model, it is necessary to upload the data to Edge Impulse. To do it, we can install everything you need locally or you can use Google Colab to do it. This tutorial uses this second option. This is the link where you can download the code.

First of all, we have to download the dataset from Kaggle:

!pip install -q kaggle
import os
# Flower dataset
!kaggle datasets download -d alxmamaev/flowers-recognition
!unzip flowers-recognition.zip
# Fruit dataset
!kaggle datasets download -d moltean/fruits
!unzip fruits.zipCode language: Bash (bash)

Notice, we download two different dataset: one that contains flower and another one that contains fruits. When we train a model, it is necessary to select a class that differs from the class we want to classify. In this dataset, there are the classes we will use in the ESP32-CAM to classify images using tinyml.

I won’t cover all the datails about creating and uploading the dataset to edge impulse because it is very simple. You can refer to the colab code. Generally speaking, we will define a number of samples we want to use to classify our deep learning model, and the code uploads it to the edge impulse.

At the end, depending on the sample numbers you have configured, on the Edge Impulse side you will have:

tinyml dataset to use with ESP32-CAM to run a machine learning model on the device

Defining the model and training it to classify images

Once the data is ready, we can define the model we will use to recognize images with ESP32-CAM. Below the model in Edge Impulse:

TinyML machine learning model to use with ESP32-CAM. Deep learning model with transfer learning

There are some aspects to notice:

  1. As model input, we will use an image with 48×48 pixels. This is an important aspects. Keep in mind that the ESP32-CAM (but in general all the devices like ESP32) has a limited ammount of memory. Therefore, we have to reduce the image size. If we use a RGB image the features number that the ESP32-CAM has to handles is 48x48x3. You can easily understand that increasing the image size, the model won’t fit into the ESP32-CAM.
  2. We use the transfer learning to train the model. We will cover it later.

Training the machine learning model using Transfer learning

After the features are extracted, we can train the model. This is the parameters used to train the model:

ESP32 cam with machine learning model

Notice that we have used MobileNetV2 0.05 because the model must fit into the device memory. The confusion matrix is shown below:

The model accuracy is 77%. Of course, we should improve it somehow but for this project is enough.

The last step is the model quantization and finally we can download the library to use it with the ESP32 CAM. The libray contains all we need to run the image classification using ESP32 CAM.

How to run image classification on the ESP32-CAM using deep learnng

This is the time to implement the code on the ESP32-CAM device to run the classification model using deep learning. To do it, we can start from the static buffer example shipped with the library. It is necessary to modify the sample code so that we can:

  • acquire the image
  • adapt the image size to the dataset
  • run the classification process

The code is shown below:

#include <Arduino.h>
#include <WiFi.h>
#include "esp_http_server.h"
#include "esp_timer.h"
#include "img_converters.h"
#include "esp_camera.h"
#include "camera_pins.h"
#include <-image_inference.h>
#include "esp_camera.h"
#include "camera_pins.h"

// raw frame buffer from the camera
#define FRAME_BUFFER_COLS           240
#define FRAME_BUFFER_ROWS           240
const int cutout_row_start = (FRAME_BUFFER_ROWS - CUTOUT_ROWS) / 2;
const int cutout_col_start = (FRAME_BUFFER_COLS - CUTOUT_COLS) / 2;

#define PART_BOUNDARY "123456789000000000000987654321"

static const char* _STREAM_CONTENT_TYPE = "multipart/x-mixed-replace;boundary=" PART_BOUNDARY;
static const char* _STREAM_BOUNDARY = "\r\n--" PART_BOUNDARY "\r\n";
static const char* _STREAM_PART = "Content-Type: image/jpeg\r\nContent-Length: %u\r\n\r\n";
httpd_handle_t camera_httpd = NULL;
httpd_handle_t stream_httpd = NULL;

const char* ssid = "your_ssid";
const char* password = "your_wifi_pwd";
camera_fb_t * fb = NULL;
uint8_t * _jpg_buf = NULL;
void r565_to_rgb(uint16_t color, uint8_t *r, uint8_t *g, uint8_t *b) {
    *r = (color & 0xF800) >> 8;
    *g = (color & 0x07E0) >> 3;
    *b = (color & 0x1F) << 3;
int cutout_get_data(size_t offset, size_t length, float *out_ptr) {
    // so offset and length naturally operate on the *cutout*, so we need to cut it out from the real framebuffer
    size_t bytes_left = length;
    size_t out_ptr_ix = 0;
    // read byte for byte
    while (bytes_left != 0) {
        // find location of the byte in the cutout
        size_t cutout_row = floor(offset / CUTOUT_COLS);
        size_t cutout_col = offset - (cutout_row * CUTOUT_COLS);
        // then read the value from the real frame buffer
        size_t frame_buffer_row = cutout_row + cutout_row_start;
        size_t frame_buffer_col = cutout_col + cutout_col_start;
        uint16_t pixelTemp = fb->buf[(frame_buffer_row * FRAME_BUFFER_COLS) + frame_buffer_col];
        uint16_t pixel = (pixelTemp>>8) | (pixelTemp<<8);
        uint8_t r, g, b;
        r565_to_rgb(pixel, &r, &g, &b);
        float pixel_f = (r << 16) + (g << 8) + b;
        out_ptr[out_ptr_ix] = pixel_f;
    // and done!
    return 0;
void classify() {
    ei_printf("Edge Impulse standalone inferencing (Arduino)\n");
    ei_impulse_result_t result = { 0 };
    // Convert to RGB888
    //fmt2rgb888(fb->buf, fb->len, PIXFORMAT_RGB888, _jpg_buf);

    // Set up pointer to look after data, crop it and convert it to RGB888
    signal_t signal;
    signal.total_length = CUTOUT_COLS * CUTOUT_ROWS;
    signal.get_data = &cutout_get_data;
    // Feed signal to the classifier
    EI_IMPULSE_ERROR res = run_classifier(&signal, &result, false /* debug */);
    // Returned error variable "res" while data object.array in "result" 
    ei_printf("run_classifier returned: %d\n", res);
    if (res != 0) return;
    // print the predictions
    ei_printf("Predictions ");
    ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
        result.timing.dsp, result.timing.classification, result.timing.anomaly);
    ei_printf(": \n");
    // Print short form result data
    for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        ei_printf("%.5f", result.classification[ix].value);
             ei_printf(", ");
           if (ix != EI_CLASSIFIER_LABEL_COUNT - 1) {
               ei_printf(", ");
         ei_printf("%.3f", result.anomaly);
    // human-readable predictions
    for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        ei_printf("    %s: %.5f\n", result.classification[ix].label, result.classification[ix].value);
        ei_printf("    anomaly score: %.3f\n", result.anomaly);
static esp_err_t capture_handler(httpd_req_t *req){
    Serial.println("Capture image");
    esp_err_t res = ESP_OK;
    fb = esp_camera_fb_get();
    if (!fb) {
        Serial.println("Camera capture failed");
        return ESP_FAIL;
    httpd_resp_set_type(req, "image/jpeg");
    httpd_resp_set_hdr(req, "Content-Disposition", "inline; filename=capture.jpg");
    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");
    res = httpd_resp_send(req, (const char *)fb->buf, fb->len);
    return res;
static esp_err_t page_handler(httpd_req_t *req) {
    httpd_resp_set_type(req, "text/html");
    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");
   // httpd_resp_send(req, page, sizeof(page));
static esp_err_t stream_handler(httpd_req_t *req){
    camera_fb_t * fb = NULL;
    esp_err_t res = ESP_OK;
    size_t _jpg_buf_len = 0;
    uint8_t * _jpg_buf = NULL;
    char * part_buf[64];
    res = httpd_resp_set_type(req, _STREAM_CONTENT_TYPE);
    if(res != ESP_OK){
        return res;
    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");
        fb = esp_camera_fb_get();
        if (!fb) {
            Serial.println("Camera capture failed");
            res = ESP_FAIL;
        } else {
               if(fb->format != PIXFORMAT_JPEG){
                    bool jpeg_converted = frame2jpg(fb, 80, &_jpg_buf, &_jpg_buf_len);
                    fb = NULL;
                        Serial.println("JPEG compression failed");
                        res = ESP_FAIL;
                } else {
                    _jpg_buf_len = fb->len;
                    _jpg_buf = fb->buf;
        if(res == ESP_OK){
            res = httpd_resp_send_chunk(req, _STREAM_BOUNDARY, strlen(_STREAM_BOUNDARY));
        if(res == ESP_OK){
            size_t hlen = snprintf((char *)part_buf, 64, _STREAM_PART, _jpg_buf_len);
            res = httpd_resp_send_chunk(req, (const char *)part_buf, hlen);
        if(res == ESP_OK){
            res = httpd_resp_send_chunk(req, (const char *)_jpg_buf, _jpg_buf_len);
            fb = NULL;
            _jpg_buf = NULL;
        } else if(_jpg_buf){
            _jpg_buf = NULL;
        if(res != ESP_OK){
    return res;
void startCameraServer(){
    httpd_config_t config = HTTPD_DEFAULT_CONFIG();
    httpd_uri_t index_uri = {
        .uri       = "/",
        .method    = HTTP_GET,
        .handler   = stream_handler,
        .user_ctx  = NULL
    httpd_uri_t capture_uri = {
        .uri       = "/capture",
        .method    = HTTP_GET,
        .handler   = capture_handler,
        .user_ctx  = NULL

    Serial.printf("Starting web server on port: '%d'\n", config.server_port);
    if (httpd_start(&camera_httpd, &config) == ESP_OK) {
        httpd_register_uri_handler(camera_httpd, &capture_uri);
        //httpd_register_uri_handler(camera_httpd, &page_uri);
    // start stream using another webserver
    config.server_port += 1;
    config.ctrl_port += 1;
    Serial.printf("Starting stream server on port: '%d'\n", config.server_port);
    if (httpd_start(&stream_httpd, &config) == ESP_OK) {
        httpd_register_uri_handler(stream_httpd, &index_uri);
void setup() {
  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sscb_sda = SIOD_GPIO_NUM;
  config.pin_sscb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  config.xclk_freq_hz = 20000000;
  config.pixel_format = PIXFORMAT_JPEG;
  // if PSRAM IC present, init with UXGA resolution and higher JPEG quality
  //                      for larger pre-allocated frame buffer.
    config.frame_size = FRAMESIZE_240X240;
    config.jpeg_quality = 10;
    config.fb_count = 2;
  } else {
    config.frame_size = FRAMESIZE_240X240;
    config.jpeg_quality = 12;
    config.fb_count = 1;
  pinMode(13, INPUT_PULLUP);
  pinMode(14, INPUT_PULLUP);
  // camera init
  esp_err_t err = esp_camera_init(&config);
  if (err != ESP_OK) {
    Serial.printf("Camera init failed with error 0x%x", err);
  sensor_t * s = esp_camera_sensor_get();
  // initial sensors are flipped vertically and colors are a bit saturated
  if (s->id.PID == OV3660_PID) {
    s->set_vflip(s, 1); // flip it back
    s->set_brightness(s, 1); // up the brightness just a bit
    s->set_saturation(s, 0); // lower the saturation
  // drop down frame size for higher initial frame rate
  s->set_framesize(s, FRAMESIZE_240x240);
  s->set_vflip(s, 1);
  s->set_hmirror(s, 1);
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) {
  Serial.println("WiFi connected");
  Serial.print("Camera Ready! Use 'http://");
  Serial.println("' to connect");
void loop() {
  // put your main code here, to run repeatedly:
}Code language: C++ (cpp)

The code includes a Web server so that you can run the classification from a Web interface (the HTML source code is not included). Moreover, it implements a stream video server that sends video to the Web page.

More useful resources:
Run Tensorflow.js with ESP32-CAM
ESP32 Tensorflow Microspeech with Google Dataset
ESP32 KNN classifier
How to use Tensorflow lite micro with ESP32-CAM

How to feed the model with ESP32-CAM pictures

The first thing is adapt the image captured by the ESP32-CAM so that we can pass it to the model we have trained before. There are two aspects to consider:

  • the image size
  • the color codification

The images in the dataset are 48×48. Even if the ESP32-CAM can take pictures with size 96×96, I had some problems streaming the video to the Web interface using this resolution. The best resolution, after some trials, is 240×240. It is a waste of resources but it works by now. In the function cutout_get_data the image is resized to 48×48 from the picture size 240×240 and converted to RGB888. This method is described in the Edge Impulse forum. I’ve adapted it to the ESP32-CAM.

Implement the classification model on the ESP32-CAM

The last step is running the classification process on the ESP32-CAM. In the example above, the classification is triggered from the Web interface but you can easily modify the code if you don’t want to use it. It is import to notice this piece of code:

signal_t signal;
signal.total_length = CUTOUT_COLS * CUTOUT_ROWS;
signal.get_data = &cutout_get_data;Code language: C++ (cpp)

This is where we invoke the method that adapts the image size and the color.

Below the simple web interface and the final result:

and this is the result:

Result of running machine learning image classification on ESP32-CAM. Deep learning result

Wrapping up

At the end of this post, we have classified a simple machine learning model that reconizes flowers and we are able to run it on the ESP32-CAM directly. The ESP32-CAM captures the image and next classify it running a tinyml model on the device. This project is experimental and there are different aspects to improve: the model accuracy, the way the image is resized and adapted. Anyway, it could be a starting point if you want to explore how to classify images using ESP32-CAM with the inference process running on the device.

    1. CHAM March 27, 2021
      • Francesco Azzola March 27, 2021
    2. CHAM YONGJUN March 27, 2021
      • Francesco Azzola March 27, 2021
    3. CHAM YONGJUN March 27, 2021
      • Francesco Azzola March 27, 2021
    4. CYJ March 27, 2021
      • Francesco Azzola March 27, 2021
    5. CHAMYONGJUN March 27, 2021
    6. CHAMYONGJUN March 27, 2021
    7. rizkiosb March 29, 2021
    8. Cham YongJun March 30, 2021
    9. CYJ April 9, 2021
    10. Faisal Sani Bala April 23, 2021
    11. Ahmed Ali May 13, 2021
    12. Ahmed Ali May 13, 2021
    13. Prem kumar k April 15, 2023

    Add Your Comment