ESP32-CAM Object detection with Tensorflow.js

This tutorial covers how to implement ESP32-CAM object detection using Tensorflow.js. Even if the ESP32-CAM has the power to run a machine learning model directly, for example, ESP32-CAM can detect faces, it doesn’t have the power to run a complex model. Therefore, we will use Tensorflow.JS to mix it with the video coming from the ESP32-CAM to detect objects. You have to be aware that in this tutorial, the Tensorflow.JS runs in the computer browser and therefore the machine learning model runs inside your browser. We have already covered how to use ESP32-CAM to classify images using Tensorflow JS. In this tutorial, we will use COCO-SSD model to detect objects in a video streaming from the ESP32-CAM.

As you may already know, Tensorflow has several pre-trained models that we can use to start easily with machine learning. COCO-SSD is a ML model used to localize and identify objects in an image. In this tutorial, we will use the Tensorflow tutorial and we will modify it to adapt it to the ESP32-CAM.

If you like to explore how to detect objects using machine learning directly on the device, you can read this tutorial how to use Tensorflow Lite with Raspberry Pi.

How to use Tensorflow.JS javascript library in a web page

The first step to use ESP32-CAM with Tensorflow.js to detect objects is building the web page where the inference will happen. To use the Tensorflow javascript library we have to follow this steps:

  1. Importing the Tensorflow javascript libraries
  2. Loading the model, in this project COCO-SSD pretrained ML model
  3. Applying the COCO-SSD model to the incoming video and draw rectangles around identfied objects

Later we will see how to integrate ESP32-CAM with Tensorflow.JS using the page we are creating in this step. This is the full HTML page that holds the Tensorflow.JS:

<!DOCTYPE html> <html lang="en"> <head> <title> Multiple object detection using pre trained model in TensorFlow.js </title> <meta charset="utf-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <style> img { display: block; } .camView { position: relative; float: left; width: calc(100% - 20px); margin: 10px; cursor: pointer; } .camView p { position: absolute; padding: 5px; background-color: rgba(255, 111, 0, 0.85); color: #FFF; border: 1px dashed rgba(255, 255, 255, 0.7); z-index: 2; font-size: 12px; } .highlighter { background: rgba(0, 255, 0, 0.25); border: 1px dashed #fff; z-index: 1; position: absolute; } </style> </head> <body> <h1>ESP32 TensorFlow.js - Object Detection</h1> <p> Wait for the model to load before clicking the button to enable the webcam - at which point it will become visible to use. </p> <section id="camSection"> <div id="liveView" class="camView"> <button id="esp32camButton">Start ESP32 Webcam</button> <img id="esp32cam_video" width="640" height="480" crossorigin=" " /> </div> </section> <!-- Import TensorFlow.js library --> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js" type="text/javascript" ></script> <!-- Load the coco-ssd model to use to recognize things in images --> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script> <script> const video = document.getElementById('esp32cam_video'); const liveView = document.getElementById('liveView'); const camSection = document.getElementById('camSection'); const enableWebcamButton = document.getElementById('esp32camButton'); esp32camButton.addEventListener('click', enableCam); // Store the resulting model in the global scope of our app. var model = undefined; esp32camButton.disabled = true; cocoSsd.load().then(function (loadedModel) { model = loadedModel; esp32camButton.disabled = false; }); function enableCam(event) { video.src = 'http://' + window.location.hostname + ":81/"; if (!model) { return; } predictWebcam(); } var children = []; function predictWebcam() { // Now let's start classifying a frame in the stream. video.crossorigin = ' '; model.detect(video).then(function (predictions) { // Remove any highlighting we did previous frame. for (let i = 0; i < children.length; i++) { liveView.removeChild(children[i]); } children.splice(0); // Now lets loop through predictions and draw them to the live view if // they have a high confidence score. for (let n = 0; n < predictions.length; n++) { console.log(predictions[n].class + ' ' + predictions[n].score); if (predictions[n].score > 0.55) { const p = document.createElement('p'); p.innerText = predictions[n].class + ' - with ' + Math.round(parseFloat(predictions[n].score) * 100) + '% confidence.'; p.style = 'margin-left: ' + predictions[n].bbox[0] + 'px; margin-top: ' + (predictions[n].bbox[1] - 10) + 'px; width: ' + (predictions[n].bbox[2] - 10) + 'px; top: 0; left: 0;'; const highlighter = document.createElement('div'); highlighter.setAttribute('class', 'highlighter'); highlighter.style = 'left: ' + predictions[n].bbox[0] + 'px; top: ' + predictions[n].bbox[1] + 'px; width: ' + predictions[n].bbox[2] + 'px; height: ' + predictions[n].bbox[3] + 'px;'; liveView.appendChild(highlighter); liveView.appendChild(p); children.push(highlighter); children.push(p); } } // Call this function again to keep predicting when the browser is ready. window.requestAnimationFrame(predictWebcam); }); } </script> </body> </html>
Code language: HTML, XML (xml)

More useful resources:
How to classify objects using KNN and ESP32
How to classify object using ESP32-CAM and cloud machine learning model
ESP32-CAM image classification using Tensorflow lite micro

Importing the Tensorflow javascript library

The first step is importing the javascript library:

<!-- Import TensorFlow.js library --> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js" type="text/javascript" ></script> <!-- Load the coco-ssd model to use to recognize things in images --> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script>
Code language: JavaScript (javascript)

As you can notice, the code above imports also the COCO-SSD model.

Loading the COCO-SSD model

Next, we have to load the model and wait until the model is fully loaded:

esp32camButton.disabled = true; cocoSsd.load().then(function (loadedModel) { model = loadedModel; esp32camButton.disabled = false; });
Code language: JavaScript (javascript)

While the model is loading, the code disables the button to acquire video from the ESP32-CAM. When the load is complete, the button is enabled.

Applying Object dection COCO-SSD model to ESP32-CAM video

In this last step, we will use the COCO-SSD model to identify and classify objects in the video coming from the ESP32-CAM:

function enableCam(event) { video.src = 'http://' + window.location.hostname + ":81/"; if (!model) { return; } predictWebcam(); } var children = []; function predictWebcam() { // Now let's start classifying a frame in the stream. video.crossorigin = ' '; model.detect(video).then(function (predictions) { // Remove any highlighting we did previous frame. for (let i = 0; i < children.length; i++) { liveView.removeChild(children[i]); } children.splice(0); // Now lets loop through predictions and draw them to the live view if // they have a high confidence score. for (let n = 0; n < predictions.length; n++) { console.log(predictions[n].class + ' ' + predictions[n].score); if (predictions[n].score > 0.55) { const p = document.createElement('p'); p.innerText = predictions[n].class + ' - with ' + Math.round(parseFloat(predictions[n].score) * 100) + '% confidence.'; p.style = 'margin-left: ' + predictions[n].bbox[0] + 'px; margin-top: ' + (predictions[n].bbox[1] - 10) + 'px; width: ' + (predictions[n].bbox[2] - 10) + 'px; top: 0; left: 0;'; const highlighter = document.createElement('div'); highlighter.setAttribute('class', 'highlighter'); highlighter.style = 'left: ' + predictions[n].bbox[0] + 'px; top: ' + predictions[n].bbox[1] + 'px; width: ' + predictions[n].bbox[2] + 'px; height: ' + predictions[n].bbox[3] + 'px;'; liveView.appendChild(highlighter); liveView.appendChild(p); children.push(highlighter); children.push(p); } } // Call this function again to keep predicting when the browser is ready. window.requestAnimationFrame(predictWebcam); }); }
Code language: JavaScript (javascript)

All the code is used to draws rectangles around the objects identified. Notice that the all the thing happen in this simple line of code:

model.detect(video).then(function (predictions)
Code language: JavaScript (javascript)

Do not forget that all the process happens in the user browser.

ESP32-CAM streaming video for object detection

In this last step, we will develop the code that we will use to stream the video from the ESP32-CAM and implementing a Web server that will serve the HTML page developed previously.

The full source code is shown below:

#include <Arduino.h> #include <WiFi.h> #include "esp_http_server.h" #include "esp_timer.h" #include "esp_camera.h" #include "img_converters.h" #include "camera_pins.h" // HTML Page const char index_html[] PROGMEM = R"rawliteral( <!DOCTYPE html><html lang="en"><head><title>Multiple object detection using pre trained model in TensorFlow.js</title><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1"><style>img{display:block}.camView{position:relative;float:left;width:calc(100% - 20px);margin:10px;cursor:pointer}.camView p{position:absolute;padding:5px;background-color:rgba(255, 111, 0, 0.85);color:#FFF;border:1px dashed rgba(255,255,255,0.7);z-index:2;font-size:12px}.highlighter{background:rgba(0, 255, 0, 0.25);border:1px dashed #fff;z-index:1;position:absolute}</style></head><body><h1>ESP32 TensorFlow.js - Object Detection</h1><p>Wait for the model to load before clicking the button to enable the webcam - at which point it will become visible to use.</p> <section id="camSection" ><div id="liveView" class="camView" > <button id="esp32camButton">Start ESP32 Webcam</button> <img id="esp32cam_video" width="640" height="480" crossorigin=' '/></div> </section> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js" type="text/javascript"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script> <script>const video=document.getElementById('esp32cam_video');const liveView=document.getElementById('liveView');const camSection=document.getElementById('camSection');const enableWebcamButton=document.getElementById('esp32camButton');esp32camButton.addEventListener('click',enableCam);var model=undefined;esp32camButton.disabled=true;cocoSsd.load().then(function(loadedModel){model=loadedModel;esp32camButton.disabled=false;});function enableCam(event){video.src='https://'+window.location.hostname+":81/";if(!model){return;} predictWebcam();} var children=[];function predictWebcam(){video.crossorigin=' ';model.detect(video).then(function(predictions){for(let i=0;i<children.length;i++){liveView.removeChild(children[i]);} children.splice(0);for(let n=0;n<predictions.length;n++){console.log(predictions[n].class+' '+predictions[n].score);if(predictions[n].score>0.55){const p=document.createElement('p');p.innerText=predictions[n].class+' - with ' +Math.round(parseFloat(predictions[n].score)*100) +'% confidence.';p.style='margin-left: '+predictions[n].bbox[0]+'px; margin-top: ' +(predictions[n].bbox[1]-10)+'px; width: ' +(predictions[n].bbox[2]-10)+'px; top: 0; left: 0;';const highlighter=document.createElement('div');highlighter.setAttribute('class','highlighter');highlighter.style='left: '+predictions[n].bbox[0]+'px; top: ' +predictions[n].bbox[1]+'px; width: ' +predictions[n].bbox[2]+'px; height: ' +predictions[n].bbox[3]+'px;';liveView.appendChild(highlighter);liveView.appendChild(p);children.push(highlighter);children.push(p);}} window.requestAnimationFrame(predictWebcam);});}</script> </body></html> )rawliteral"; #define PART_BOUNDARY "123456789000000000000987654321" static const char* _STREAM_CONTENT_TYPE = "multipart/x-mixed-replace;boundary=" PART_BOUNDARY; static const char* _STREAM_BOUNDARY = "\r\n--" PART_BOUNDARY "\r\n"; static const char* _STREAM_PART = "Content-Type: image/jpeg\r\nContent-Length: %u\r\n\r\n"; httpd_handle_t camera_httpd = NULL; httpd_handle_t stream_httpd = NULL; const char* ssid = "your_wifi_ssid"; const char* password = "your_wifi_password"; static esp_err_t page_handler(httpd_req_t *req) { httpd_resp_set_type(req, "text/html"); httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*"); httpd_resp_send(req, index_html, sizeof(index_html)); } static esp_err_t stream_handler(httpd_req_t *req){ camera_fb_t * fb = NULL; esp_err_t res = ESP_OK; size_t _jpg_buf_len = 0; uint8_t * _jpg_buf = NULL; char * part_buf[64]; res = httpd_resp_set_type(req, _STREAM_CONTENT_TYPE); if(res != ESP_OK){ return res; } httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*"); while(true){ fb = esp_camera_fb_get(); if (!fb) { Serial.println("Camera capture failed"); res = ESP_FAIL; } else { if(fb->format != PIXFORMAT_JPEG){ bool jpeg_converted = frame2jpg(fb, 80, &_jpg_buf, &_jpg_buf_len); esp_camera_fb_return(fb); fb = NULL; if(!jpeg_converted){ Serial.println("JPEG compression failed"); res = ESP_FAIL; } } else { _jpg_buf_len = fb->len; _jpg_buf = fb->buf; } } if(res == ESP_OK){ res = httpd_resp_send_chunk(req, _STREAM_BOUNDARY, strlen(_STREAM_BOUNDARY)); } if(res == ESP_OK){ size_t hlen = snprintf((char *)part_buf, 64, _STREAM_PART, _jpg_buf_len); res = httpd_resp_send_chunk(req, (const char *)part_buf, hlen); } if(res == ESP_OK){ res = httpd_resp_send_chunk(req, (const char *)_jpg_buf, _jpg_buf_len); } if(fb){ esp_camera_fb_return(fb); fb = NULL; _jpg_buf = NULL; } else if(_jpg_buf){ free(_jpg_buf); _jpg_buf = NULL; } if(res != ESP_OK){ break; } } return res; } void startCameraServer(){ httpd_config_t config = HTTPD_DEFAULT_CONFIG(); httpd_uri_t index_uri = { .uri = "/", .method = HTTP_GET, .handler = stream_handler, .user_ctx = NULL }; httpd_uri_t page_uri = { .uri = "/ts", .method = HTTP_GET, .handler = page_handler, .user_ctx = NULL }; Serial.printf("Starting web server on port: '%d'\n", config.server_port); if (httpd_start(&camera_httpd, &config) == ESP_OK) { httpd_register_uri_handler(camera_httpd, &page_uri); } // start stream using another webserver config.server_port += 1; config.ctrl_port += 1; Serial.printf("Starting stream server on port: '%d'\n", config.server_port); if (httpd_start(&stream_httpd, &config) == ESP_OK) { httpd_register_uri_handler(stream_httpd, &index_uri); } } void setup() { Serial.begin(9600); Serial.setDebugOutput(true); Serial.println(); camera_config_t config; config.ledc_channel = LEDC_CHANNEL_0; config.ledc_timer = LEDC_TIMER_0; config.pin_d0 = Y2_GPIO_NUM; config.pin_d1 = Y3_GPIO_NUM; config.pin_d2 = Y4_GPIO_NUM; config.pin_d3 = Y5_GPIO_NUM; config.pin_d4 = Y6_GPIO_NUM; config.pin_d5 = Y7_GPIO_NUM; config.pin_d6 = Y8_GPIO_NUM; config.pin_d7 = Y9_GPIO_NUM; config.pin_xclk = XCLK_GPIO_NUM; config.pin_pclk = PCLK_GPIO_NUM; config.pin_vsync = VSYNC_GPIO_NUM; config.pin_href = HREF_GPIO_NUM; config.pin_sscb_sda = SIOD_GPIO_NUM; config.pin_sscb_scl = SIOC_GPIO_NUM; config.pin_pwdn = PWDN_GPIO_NUM; config.pin_reset = RESET_GPIO_NUM; config.xclk_freq_hz = 20000000; config.pixel_format = PIXFORMAT_JPEG; // if PSRAM IC present, init with UXGA resolution and higher JPEG quality // for larger pre-allocated frame buffer. if(psramFound()){ config.frame_size = FRAMESIZE_UXGA; config.jpeg_quality = 10; config.fb_count = 2; } else { config.frame_size = FRAMESIZE_UXGA; config.jpeg_quality = 12; config.fb_count = 1; } #if defined(CAMERA_MODEL_ESP_EYE) pinMode(13, INPUT_PULLUP); pinMode(14, INPUT_PULLUP); #endif // camera init esp_err_t err = esp_camera_init(&config); if (err != ESP_OK) { Serial.printf("Camera init failed with error 0x%x", err); return; } sensor_t * s = esp_camera_sensor_get(); // initial sensors are flipped vertically and colors are a bit saturated if (s->id.PID == OV3660_PID) { s->set_vflip(s, 1); // flip it back s->set_brightness(s, 1); // up the brightness just a bit s->set_saturation(s, 0); // lower the saturation } // drop down frame size for higher initial frame rate s->set_framesize(s, FRAMESIZE_QVGA); #if defined(CAMERA_MODEL_M5STACK_WIDE) s->set_vflip(s, 1); s->set_hmirror(s, 1); #endif WiFi.begin(ssid, password); while (WiFi.status() != WL_CONNECTED) { delay(500); Serial.print("."); } Serial.println(""); Serial.println("WiFi connected"); startCameraServer(); Serial.print("Camera Ready! Use 'http://"); Serial.print(WiFi.localIP()); Serial.println("' to connect"); } void loop() { // put your main code here, to run repeatedly: delay(10); }
Code language: C++ (cpp)

while the camera_pins.h is:

#if defined(CAMERA_MODEL_WROVER_KIT) #define PWDN_GPIO_NUM -1 #define RESET_GPIO_NUM -1 #define XCLK_GPIO_NUM 21 #define SIOD_GPIO_NUM 26 #define SIOC_GPIO_NUM 27 #define Y9_GPIO_NUM 35 #define Y8_GPIO_NUM 34 #define Y7_GPIO_NUM 39 #define Y6_GPIO_NUM 36 #define Y5_GPIO_NUM 19 #define Y4_GPIO_NUM 18 #define Y3_GPIO_NUM 5 #define Y2_GPIO_NUM 4 #define VSYNC_GPIO_NUM 25 #define HREF_GPIO_NUM 23 #define PCLK_GPIO_NUM 22 #elif defined(CAMERA_MODEL_ESP_EYE) #define PWDN_GPIO_NUM -1 #define RESET_GPIO_NUM -1 #define XCLK_GPIO_NUM 4 #define SIOD_GPIO_NUM 18 #define SIOC_GPIO_NUM 23 #define Y9_GPIO_NUM 36 #define Y8_GPIO_NUM 37 #define Y7_GPIO_NUM 38 #define Y6_GPIO_NUM 39 #define Y5_GPIO_NUM 35 #define Y4_GPIO_NUM 14 #define Y3_GPIO_NUM 13 #define Y2_GPIO_NUM 34 #define VSYNC_GPIO_NUM 5 #define HREF_GPIO_NUM 27 #define PCLK_GPIO_NUM 25 #elif defined(CAMERA_MODEL_M5STACK_PSRAM) #define PWDN_GPIO_NUM -1 #define RESET_GPIO_NUM 15 #define XCLK_GPIO_NUM 27 #define SIOD_GPIO_NUM 25 #define SIOC_GPIO_NUM 23 #define Y9_GPIO_NUM 19 #define Y8_GPIO_NUM 36 #define Y7_GPIO_NUM 18 #define Y6_GPIO_NUM 39 #define Y5_GPIO_NUM 5 #define Y4_GPIO_NUM 34 #define Y3_GPIO_NUM 35 #define Y2_GPIO_NUM 32 #define VSYNC_GPIO_NUM 22 #define HREF_GPIO_NUM 26 #define PCLK_GPIO_NUM 21 #elif defined(CAMERA_MODEL_M5STACK_WIDE) #define PWDN_GPIO_NUM -1 #define RESET_GPIO_NUM 15 #define XCLK_GPIO_NUM 27 #define SIOD_GPIO_NUM 22 #define SIOC_GPIO_NUM 23 #define Y9_GPIO_NUM 19 #define Y8_GPIO_NUM 36 #define Y7_GPIO_NUM 18 #define Y6_GPIO_NUM 39 #define Y5_GPIO_NUM 5 #define Y4_GPIO_NUM 34 #define Y3_GPIO_NUM 35 #define Y2_GPIO_NUM 32 #define VSYNC_GPIO_NUM 25 #define HREF_GPIO_NUM 26 #define PCLK_GPIO_NUM 21 #elif defined(CAMERA_MODEL_AI_THINKER) #define PWDN_GPIO_NUM 32 #define RESET_GPIO_NUM -1 #define XCLK_GPIO_NUM 0 #define SIOD_GPIO_NUM 26 #define SIOC_GPIO_NUM 27 #define Y9_GPIO_NUM 35 #define Y8_GPIO_NUM 34 #define Y7_GPIO_NUM 39 #define Y6_GPIO_NUM 36 #define Y5_GPIO_NUM 21 #define Y4_GPIO_NUM 19 #define Y3_GPIO_NUM 18 #define Y2_GPIO_NUM 5 #define VSYNC_GPIO_NUM 25 #define HREF_GPIO_NUM 23 #define PCLK_GPIO_NUM 22 #elif defined(CAMERA_MODEL_TTGO_T_JOURNAL) #define PWDN_GPIO_NUM 0 #define RESET_GPIO_NUM 15 #define XCLK_GPIO_NUM 27 #define SIOD_GPIO_NUM 25 #define SIOC_GPIO_NUM 23 #define Y9_GPIO_NUM 19 #define Y8_GPIO_NUM 36 #define Y7_GPIO_NUM 18 #define Y6_GPIO_NUM 39 #define Y5_GPIO_NUM 5 #define Y4_GPIO_NUM 34 #define Y3_GPIO_NUM 35 #define Y2_GPIO_NUM 17 #define VSYNC_GPIO_NUM 22 #define HREF_GPIO_NUM 26 #define PCLK_GPIO_NUM 21 #else #error "Camera model not selected" #endif
Code language: C++ (cpp)

We have already covered several times how to use stream video from ESP32-CAM.

The code above opens two port:

  • 80 where there is a web server
  • 81 used to stream video

Notice that the HTML page we have built before was minified in the ESP32-CAM code.

Testing the ESP32-CAM object dection

Now we can test the ESP32-CAM object detection project. Before uploading the code into your device, set these two parameters according to your Wifi settings.

const char* ssid = "your_wifi_ssid"; const char* password = "your_wifi_password";
Code language: JavaScript (javascript)

Next, run the code. If everything works you should see in the log file something like this:

ESP32-CAM with Tensorflow JS to detect objects

Now, open your browser and the following url:

http://<esp32-cam-ip>/ts
Code language: HTTP (http)

and the page loaded is shown below:

How to detect objects using ESP32-CAM and Tensorflow.js

Wrapping up

At the end of this post, we have covered how to implement ESP32-CAM object detection using Tensorflow.js. This tutorial describes how to capture video stream from ESP32-CAM and use Tensorflow javascript library to identify and classify objects. Do not forget, that the inference process runs inside the browser and not on ESP32-CAM. Integrating ESP32-CAM with pretrained Tensorflow models suche as COCO-SSD we are able to identify objects.

    1. Anonymous June 10, 2021
      • Francesco Azzola June 11, 2021
    2. Anonymous June 12, 2021
    3. Raj June 12, 2021
      • Francesco Azzola June 15, 2021
    4. Immanuel June 24, 2021
    5. Immanuel June 24, 2021
      • Francesco Azzola June 24, 2021
    6. Brandon August 18, 2021

    Add Your Comment