ESP32-CAM Object detection with Tensorflow.js

This tutorial covers how to implement ESP32-CAM object detection using Tensorflow.js. Even if the ESP32-CAM has the power to run a machine learning model directly, for example, ESP32-CAM can detect faces, it doesn’t have the power to run a complex model. Therefore, we will use Tensorflow.JS to mix it with the video coming from the ESP32-CAM to detect objects. You have to be aware that in this tutorial, the Tensorflow.JS runs in the computer browser and therefore the machine learning model runs inside your browser. We have already covered how to use ESP32-CAM to classify images using Tensorflow JS. In this tutorial, we will use COCO-SSD model to detect objects in a video streaming from the ESP32-CAM.

As you may already know, Tensorflow has several pre-trained models that we can use to start easily with machine learning. COCO-SSD is a ML model used to localize and identify objects in an image. In this tutorial, we will use the Tensorflow tutorial and we will modify it to adapt it to the ESP32-CAM.

If you like to explore how to detect objects using machine learning directly on the device, you can read this tutorial how to use Tensorflow Lite with Raspberry Pi.

How to use Tensorflow.JS javascript library in a web page

The first step to use ESP32-CAM with Tensorflow.js to detect objects is building the web page where the inference will happen. To use the Tensorflow javascript library we have to follow this steps:

  1. Importing the Tensorflow javascript libraries
  2. Loading the model, in this project COCO-SSD pretrained ML model
  3. Applying the COCO-SSD model to the incoming video and draw rectangles around identfied objects

Later we will see how to integrate ESP32-CAM with Tensorflow.JS using the page we are creating in this step. This is the full HTML page that holds the Tensorflow.JS:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>
      Multiple object detection using pre trained model in TensorFlow.js
    </title>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style>
          img {
        display: block;
      }

          .camView {
        position: relative;
        float: left;
        width: calc(100% - 20px);
        margin: 10px;
        cursor: pointer;
      }

      .camView p {
        position: absolute;
        padding: 5px;
        background-color: rgba(255, 111, 0, 0.85);
        color: #FFF;
        border: 1px dashed rgba(255, 255, 255, 0.7);
        z-index: 2;
        font-size: 12px;
      }
          .highlighter {
        background: rgba(0, 255, 0, 0.25);
        border: 1px dashed #fff;
        z-index: 1;
        position: absolute;
      }
    </style>
  </head>
  <body>
    <h1>ESP32 TensorFlow.js - Object Detection</h1>

    <p>
      Wait for the model to load before clicking the button to enable the webcam
      - at which point it will become visible to use.
    </p>

    <section id="camSection">
      <div id="liveView" class="camView">
        <button id="esp32camButton">Start ESP32 Webcam</button>
        <img id="esp32cam_video" width="640" height="480" crossorigin=" " />
      </div>
    </section>

    <!-- Import TensorFlow.js library -->
    <script
      src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js"
      type="text/javascript"
    ></script>
    <!-- Load the coco-ssd model to use to recognize things in images -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script>

    <script>
          const video = document.getElementById('esp32cam_video');
          const liveView = document.getElementById('liveView');
          const camSection = document.getElementById('camSection');
          const enableWebcamButton = document.getElementById('esp32camButton');

          esp32camButton.addEventListener('click', enableCam);

          // Store the resulting model in the global scope of our app.
          var model = undefined;

          esp32camButton.disabled = true;
          cocoSsd.load().then(function (loadedModel) {
              model = loadedModel;
              esp32camButton.disabled = false;
          });

          function enableCam(event) {

              video.src = 'http://' + window.location.hostname + ":81/";
              if (!model) {
                  return;
              }
              predictWebcam();

          }

          var children = [];

      function predictWebcam() {
        // Now let's start classifying a frame in the stream.
        video.crossorigin = ' ';

        model.detect(video).then(function (predictions) {
          // Remove any highlighting we did previous frame.
          for (let i = 0; i < children.length; i++) {
            liveView.removeChild(children[i]);
          }
          children.splice(0);

          // Now lets loop through predictions and draw them to the live view if
          // they have a high confidence score.
          for (let n = 0; n < predictions.length; n++) {

            console.log(predictions[n].class + ' ' + predictions[n].score);
            if (predictions[n].score > 0.55) {
              const p = document.createElement('p');
              p.innerText = predictions[n].class  + ' - with '
                  + Math.round(parseFloat(predictions[n].score) * 100)
                  + '% confidence.';
              p.style = 'margin-left: ' + predictions[n].bbox[0] + 'px; margin-top: '
                  + (predictions[n].bbox[1] - 10) + 'px; width: '
                  + (predictions[n].bbox[2] - 10) + 'px; top: 0; left: 0;';

              const highlighter = document.createElement('div');
              highlighter.setAttribute('class', 'highlighter');
              highlighter.style = 'left: ' + predictions[n].bbox[0] + 'px; top: '
                  + predictions[n].bbox[1] + 'px; width: '
                  + predictions[n].bbox[2] + 'px; height: '
                  + predictions[n].bbox[3] + 'px;';

              liveView.appendChild(highlighter);
              liveView.appendChild(p);
              children.push(highlighter);
              children.push(p);
            }
          }

          // Call this function again to keep predicting when the browser is ready.
          window.requestAnimationFrame(predictWebcam);
        });
      }
    </script>
  </body>
</html>

More useful resources:
How to classify objects using KNN and ESP32
How to classify object using ESP32-CAM and cloud machine learning model

Importing the Tensorflow javascript library

The first step is importing the javascript library:

 <!-- Import TensorFlow.js library -->
    <script
      src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js"
      type="text/javascript"
    ></script>
    <!-- Load the coco-ssd model to use to recognize things in images -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script>

As you can notice, the code above imports also the COCO-SSD model.

Loading the COCO-SSD model

Next, we have to load the model and wait until the model is fully loaded:

esp32camButton.disabled = true;
cocoSsd.load().then(function (loadedModel) {
     model = loadedModel;
     esp32camButton.disabled = false;
});

While the model is loading, the code disables the button to acquire video from the ESP32-CAM. When the load is complete, the button is enabled.

Applying Object dection COCO-SSD model to ESP32-CAM video

In this last step, we will use the COCO-SSD model to identify and classify objects in the video coming from the ESP32-CAM:

function enableCam(event) {
    video.src = 'http://' + window.location.hostname + ":81/";
    if (!model) {
        return;
     }
     predictWebcam();
}

var children = [];

function predictWebcam() {
  // Now let's start classifying a frame in the stream.
  video.crossorigin = ' ';

  model.detect(video).then(function (predictions) {
    // Remove any highlighting we did previous frame.
    for (let i = 0; i < children.length; i++) {
       liveView.removeChild(children[i]);
     }
     children.splice(0);

     // Now lets loop through predictions and draw them to the live view if
     // they have a high confidence score.
     for (let n = 0; n < predictions.length; n++) {
        console.log(predictions[n].class + ' ' + predictions[n].score);
            if (predictions[n].score > 0.55) {
              const p = document.createElement('p');
              p.innerText = predictions[n].class  + ' - with '
                  + Math.round(parseFloat(predictions[n].score) * 100)
                  + '% confidence.';
              p.style = 'margin-left: ' + predictions[n].bbox[0] + 'px; margin-top: '
                  + (predictions[n].bbox[1] - 10) + 'px; width: '
                  + (predictions[n].bbox[2] - 10) + 'px; top: 0; left: 0;';

              const highlighter = document.createElement('div');
              highlighter.setAttribute('class', 'highlighter');
              highlighter.style = 'left: ' + predictions[n].bbox[0] + 'px; top: '
                  + predictions[n].bbox[1] + 'px; width: '
                  + predictions[n].bbox[2] + 'px; height: '
                  + predictions[n].bbox[3] + 'px;';

              liveView.appendChild(highlighter);
              liveView.appendChild(p);
              children.push(highlighter);
              children.push(p);
            }
          }

          // Call this function again to keep predicting when the browser is ready.
          window.requestAnimationFrame(predictWebcam);
        });
      }

All the code is used to draws rectangles around the objects identified. Notice that the all the thing happen in this simple line of code:

model.detect(video).then(function (predictions) 

Do not forget that all the process happens in the user browser.

ESP32-CAM streaming video for object detection

In this last step, we will develop the code that we will use to stream the video from the ESP32-CAM and implementing a Web server that will serve the HTML page developed previously.

The full source code is shown below:

#include <Arduino.h>
#include <WiFi.h>
#include "esp_http_server.h"
#include "esp_timer.h"
#include "esp_camera.h"
#include "img_converters.h"
#include "camera_pins.h"

// HTML Page
const char index_html[] PROGMEM = R"rawliteral(
<!DOCTYPE html><html lang="en"><head><title>Multiple object detection using pre trained model in TensorFlow.js</title><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1"><style>img{display:block}.camView{position:relative;float:left;width:calc(100% - 20px);margin:10px;cursor:pointer}.camView p{position:absolute;padding:5px;background-color:rgba(255, 111, 0, 0.85);color:#FFF;border:1px dashed rgba(255,255,255,0.7);z-index:2;font-size:12px}.highlighter{background:rgba(0, 255, 0, 0.25);border:1px dashed #fff;z-index:1;position:absolute}</style></head><body><h1>ESP32 TensorFlow.js - Object Detection</h1><p>Wait for the model to load before clicking the button to enable the webcam - at which point it will become visible to use.</p> <section id="camSection" ><div id="liveView" class="camView" > <button id="esp32camButton">Start ESP32 Webcam</button> <img id="esp32cam_video" width="640" height="480" crossorigin=' '/></div> </section> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js" type="text/javascript"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script> <script>const video=document.getElementById('esp32cam_video');const liveView=document.getElementById('liveView');const camSection=document.getElementById('camSection');const enableWebcamButton=document.getElementById('esp32camButton');esp32camButton.addEventListener('click',enableCam);var model=undefined;esp32camButton.disabled=true;cocoSsd.load().then(function(loadedModel){model=loadedModel;esp32camButton.disabled=false;});function enableCam(event){video.src='http://'+window.location.hostname+":81/";if(!model){return;} predictWebcam();} var children=[];function predictWebcam(){video.crossorigin=' ';model.detect(video).then(function(predictions){for(let i=0;i<children.length;i++){liveView.removeChild(children[i]);} children.splice(0);for(let n=0;n<predictions.length;n++){console.log(predictions[n].class+' '+predictions[n].score);if(predictions[n].score>0.55){const p=document.createElement('p');p.innerText=predictions[n].class+' - with ' +Math.round(parseFloat(predictions[n].score)*100) +'% confidence.';p.style='margin-left: '+predictions[n].bbox[0]+'px; margin-top: ' +(predictions[n].bbox[1]-10)+'px; width: ' +(predictions[n].bbox[2]-10)+'px; top: 0; left: 0;';const highlighter=document.createElement('div');highlighter.setAttribute('class','highlighter');highlighter.style='left: '+predictions[n].bbox[0]+'px; top: ' +predictions[n].bbox[1]+'px; width: ' +predictions[n].bbox[2]+'px; height: ' +predictions[n].bbox[3]+'px;';liveView.appendChild(highlighter);liveView.appendChild(p);children.push(highlighter);children.push(p);}} window.requestAnimationFrame(predictWebcam);});}</script> </body></html>
 )rawliteral";    

#define PART_BOUNDARY "123456789000000000000987654321"


static const char* _STREAM_CONTENT_TYPE = "multipart/x-mixed-replace;boundary=" PART_BOUNDARY;
static const char* _STREAM_BOUNDARY = "\r\n--" PART_BOUNDARY "\r\n";
static const char* _STREAM_PART = "Content-Type: image/jpeg\r\nContent-Length: %u\r\n\r\n";

httpd_handle_t camera_httpd = NULL;
httpd_handle_t stream_httpd = NULL;


const char* ssid = "your_wifi_ssid";
const char* password = "your_wifi_password";

static esp_err_t page_handler(httpd_req_t *req) {
    httpd_resp_set_type(req, "text/html");
    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");
    httpd_resp_send(req, index_html, sizeof(index_html));
}

static esp_err_t stream_handler(httpd_req_t *req){
    camera_fb_t * fb = NULL;
    esp_err_t res = ESP_OK;
    size_t _jpg_buf_len = 0;
    uint8_t * _jpg_buf = NULL;
    char * part_buf[64];


    res = httpd_resp_set_type(req, _STREAM_CONTENT_TYPE);
    if(res != ESP_OK){
        return res;
    }

    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");

    while(true){
        fb = esp_camera_fb_get();
        if (!fb) {
            Serial.println("Camera capture failed");
            res = ESP_FAIL;
        } else {
            
                if(fb->format != PIXFORMAT_JPEG){
                    bool jpeg_converted = frame2jpg(fb, 80, &_jpg_buf, &_jpg_buf_len);
                    esp_camera_fb_return(fb);
                    fb = NULL;
                    if(!jpeg_converted){
                        Serial.println("JPEG compression failed");
                        res = ESP_FAIL;
                    }
                } else {
                    _jpg_buf_len = fb->len;
                    _jpg_buf = fb->buf;
                }
             }
        if(res == ESP_OK){
            res = httpd_resp_send_chunk(req, _STREAM_BOUNDARY, strlen(_STREAM_BOUNDARY));
        }
        if(res == ESP_OK){
            size_t hlen = snprintf((char *)part_buf, 64, _STREAM_PART, _jpg_buf_len);
            res = httpd_resp_send_chunk(req, (const char *)part_buf, hlen);
        }
        if(res == ESP_OK){
            res = httpd_resp_send_chunk(req, (const char *)_jpg_buf, _jpg_buf_len);
        }
        if(fb){
            esp_camera_fb_return(fb);
            fb = NULL;
            _jpg_buf = NULL;
        } else if(_jpg_buf){
            free(_jpg_buf);
            _jpg_buf = NULL;
        }
        if(res != ESP_OK){
            break;
        }
    }
    return res;
}

void startCameraServer(){
    httpd_config_t config = HTTPD_DEFAULT_CONFIG();

    httpd_uri_t index_uri = {
        .uri       = "/",
        .method    = HTTP_GET,
        .handler   = stream_handler,
        .user_ctx  = NULL
    };

    httpd_uri_t page_uri = {
        .uri       = "/ts",
        .method    = HTTP_GET,
        .handler   = page_handler,
        .user_ctx  = NULL
    };
 


    Serial.printf("Starting web server on port: '%d'\n", config.server_port);
    if (httpd_start(&camera_httpd, &config) == ESP_OK) {
      httpd_register_uri_handler(camera_httpd, &page_uri);
    }

    // start stream using another webserver
    config.server_port += 1;
    config.ctrl_port += 1;
    Serial.printf("Starting stream server on port: '%d'\n", config.server_port);
    if (httpd_start(&stream_httpd, &config) == ESP_OK) {
        httpd_register_uri_handler(stream_httpd, &index_uri);
    }

}

void setup() {
  Serial.begin(9600);
  Serial.setDebugOutput(true);
  Serial.println();

  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sscb_sda = SIOD_GPIO_NUM;
  config.pin_sscb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  config.xclk_freq_hz = 20000000;
  config.pixel_format = PIXFORMAT_JPEG;
  
  // if PSRAM IC present, init with UXGA resolution and higher JPEG quality
  //                      for larger pre-allocated frame buffer.
  if(psramFound()){
    config.frame_size = FRAMESIZE_UXGA;
    config.jpeg_quality = 10;
    config.fb_count = 2;
  } else {
    config.frame_size = FRAMESIZE_UXGA;
    config.jpeg_quality = 12;
    config.fb_count = 1;
  }

#if defined(CAMERA_MODEL_ESP_EYE)
  pinMode(13, INPUT_PULLUP);
  pinMode(14, INPUT_PULLUP);
#endif

  // camera init
  esp_err_t err = esp_camera_init(&config);
  if (err != ESP_OK) {
    Serial.printf("Camera init failed with error 0x%x", err);
    return;
  }

  sensor_t * s = esp_camera_sensor_get();
  // initial sensors are flipped vertically and colors are a bit saturated
  if (s->id.PID == OV3660_PID) {
    s->set_vflip(s, 1); // flip it back
    s->set_brightness(s, 1); // up the brightness just a bit
    s->set_saturation(s, 0); // lower the saturation
  }
  // drop down frame size for higher initial frame rate
  s->set_framesize(s, FRAMESIZE_QVGA);

#if defined(CAMERA_MODEL_M5STACK_WIDE)
  s->set_vflip(s, 1);
  s->set_hmirror(s, 1);
#endif

  WiFi.begin(ssid, password);

  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.println("");
  Serial.println("WiFi connected");

  startCameraServer();

  Serial.print("Camera Ready! Use 'http://");
  Serial.print(WiFi.localIP());
  Serial.println("' to connect");
}

void loop() {
  // put your main code here, to run repeatedly:
  delay(10);
}

while the camera_pins.h is:


#if defined(CAMERA_MODEL_WROVER_KIT)
#define PWDN_GPIO_NUM    -1
#define RESET_GPIO_NUM   -1
#define XCLK_GPIO_NUM    21
#define SIOD_GPIO_NUM    26
#define SIOC_GPIO_NUM    27

#define Y9_GPIO_NUM      35
#define Y8_GPIO_NUM      34
#define Y7_GPIO_NUM      39
#define Y6_GPIO_NUM      36
#define Y5_GPIO_NUM      19
#define Y4_GPIO_NUM      18
#define Y3_GPIO_NUM       5
#define Y2_GPIO_NUM       4
#define VSYNC_GPIO_NUM   25
#define HREF_GPIO_NUM    23
#define PCLK_GPIO_NUM    22

#elif defined(CAMERA_MODEL_ESP_EYE)
#define PWDN_GPIO_NUM    -1
#define RESET_GPIO_NUM   -1
#define XCLK_GPIO_NUM    4
#define SIOD_GPIO_NUM    18
#define SIOC_GPIO_NUM    23

#define Y9_GPIO_NUM      36
#define Y8_GPIO_NUM      37
#define Y7_GPIO_NUM      38
#define Y6_GPIO_NUM      39
#define Y5_GPIO_NUM      35
#define Y4_GPIO_NUM      14
#define Y3_GPIO_NUM      13
#define Y2_GPIO_NUM      34
#define VSYNC_GPIO_NUM   5
#define HREF_GPIO_NUM    27
#define PCLK_GPIO_NUM    25

#elif defined(CAMERA_MODEL_M5STACK_PSRAM)
#define PWDN_GPIO_NUM     -1
#define RESET_GPIO_NUM    15
#define XCLK_GPIO_NUM     27
#define SIOD_GPIO_NUM     25
#define SIOC_GPIO_NUM     23

#define Y9_GPIO_NUM       19
#define Y8_GPIO_NUM       36
#define Y7_GPIO_NUM       18
#define Y6_GPIO_NUM       39
#define Y5_GPIO_NUM        5
#define Y4_GPIO_NUM       34
#define Y3_GPIO_NUM       35
#define Y2_GPIO_NUM       32
#define VSYNC_GPIO_NUM    22
#define HREF_GPIO_NUM     26
#define PCLK_GPIO_NUM     21

#elif defined(CAMERA_MODEL_M5STACK_WIDE)
#define PWDN_GPIO_NUM     -1
#define RESET_GPIO_NUM    15
#define XCLK_GPIO_NUM     27
#define SIOD_GPIO_NUM     22
#define SIOC_GPIO_NUM     23

#define Y9_GPIO_NUM       19
#define Y8_GPIO_NUM       36
#define Y7_GPIO_NUM       18
#define Y6_GPIO_NUM       39
#define Y5_GPIO_NUM        5
#define Y4_GPIO_NUM       34
#define Y3_GPIO_NUM       35
#define Y2_GPIO_NUM       32
#define VSYNC_GPIO_NUM    25
#define HREF_GPIO_NUM     26
#define PCLK_GPIO_NUM     21

#elif defined(CAMERA_MODEL_AI_THINKER)
#define PWDN_GPIO_NUM     32
#define RESET_GPIO_NUM    -1
#define XCLK_GPIO_NUM      0
#define SIOD_GPIO_NUM     26
#define SIOC_GPIO_NUM     27

#define Y9_GPIO_NUM       35
#define Y8_GPIO_NUM       34
#define Y7_GPIO_NUM       39
#define Y6_GPIO_NUM       36
#define Y5_GPIO_NUM       21
#define Y4_GPIO_NUM       19
#define Y3_GPIO_NUM       18
#define Y2_GPIO_NUM        5
#define VSYNC_GPIO_NUM    25
#define HREF_GPIO_NUM     23
#define PCLK_GPIO_NUM     22

#elif defined(CAMERA_MODEL_TTGO_T_JOURNAL)
#define PWDN_GPIO_NUM      0
#define RESET_GPIO_NUM    15
#define XCLK_GPIO_NUM     27
#define SIOD_GPIO_NUM     25
#define SIOC_GPIO_NUM     23

#define Y9_GPIO_NUM       19
#define Y8_GPIO_NUM       36
#define Y7_GPIO_NUM       18
#define Y6_GPIO_NUM       39
#define Y5_GPIO_NUM        5
#define Y4_GPIO_NUM       34
#define Y3_GPIO_NUM       35
#define Y2_GPIO_NUM       17
#define VSYNC_GPIO_NUM    22
#define HREF_GPIO_NUM     26
#define PCLK_GPIO_NUM     21

#else
#error "Camera model not selected"
#endif

We have already covered several times how to use stream video from ESP32-CAM.

The code above opens two port:

  • 80 where there is a web server
  • 81 used to stream video

Notice that the HTML page we have built before was minified in the ESP32-CAM code.

Testing the ESP32-CAM object dection

Now we can test the ESP32-CAM object detection project. Before uploading the code into your device, set these two parameters according to your Wifi settings.

const char* ssid = "your_wifi_ssid";
const char* password = "your_wifi_password";

Next, run the code. If everything works you should see in the log file something like this:

ESP32-CAM with Tensorflow JS to detect objects

Now, open your browser and the following url:

http://<esp32-cam-ip>/ts

and the page loaded is shown below:

How to detect objects using ESP32-CAM and Tensorflow.js

Wrapping up

At the end of this post, we have covered how to implement ESP32-CAM object detection using Tensorflow.js. This tutorial describes how to capture video stream from ESP32-CAM and use Tensorflow javascript library to identify and classify objects. Do not forget, that the inference process runs inside the browser and not on ESP32-CAM. Integrating ESP32-CAM with pretrained Tensorflow models suche as COCO-SSD we are able to identify objects.