Gesture Recognition Using Tensorflow.js

Hand gesture recognition has become increasingly important in computer vision and human-computer interaction. With the rise of video conferencing and virtual interactions, there's a growing need for intuitive ways to control our digital environments. In this tutorial, we'll explore building a hand gesture detection system using TensorFlow.js that can recognize various hand poses in real-time through our webcam.

We'll start with a basic web project and gradually build up to a sophisticated application that detects hand poses and triggers actions based on specific gestures. By the end of this tutorial, we’ll have a working prototype that can recognize gestures like thumbs up, victory signs, and even custom poses. More importantly, we’ll learn how to integrate this technology into real-world applications, such as video conferencing platforms (Stream’s Video SDK in this example), where gestures could control features like mute/unmute or trigger interactive effects.

This project is particularly relevant for web developers interested in machine learning applications or anyone looking to enhance their web applications with gesture-based controls. While we'll use TensorFlow.js and the fingerpose package for our implementation, the concepts you'll learn can be applied to various other gesture recognition scenarios.

Setting Up a Basic Web Project

To keep things clean and straightforward, we'll work with a basic web project without any dependencies on frameworks. If you want to learn how to integrate with other frameworks, let me know in the comments.

Of course, the basic building blocks of a web project are a basic index.html file, an index.js logic file, and a package.json to tie it all together. Let's create all of them in a root directory.

We can create a new folder for the project and inside, run the following command:

bash

1
npm init -y

The -y parameter automatically answers all questions during the setup process and generates a package.json file.

Then we create an empty index.js file and an index.html file that we fill with the following skeleton content:

html

1
2
3
4
5
6
7
8
9
<!DOCTYPE html>
<html lang="en">
    <head>
    </head>
    <body>
        <h1>Pose Detection</h1>
        <script src="./index.js"></script>
    </body>
</html>

To finish setting up the project, we want to expose our app on the web, so we use the http-server package, making it easy to achieve this. We first install it:

bash

1
npm install http-server

Then, we add a script to the package.json to start the server on port 1234:

json

1
2
3
4
"scripts": {
    "watch": "npm run build && node_modules/http-server/bin/http-server dist -p 1234",
    "build": "mkdir -p dist/ && cp index.html dist/ && cp index.js dist/"
},

When executing yarn watch now, we can see our project when visiting localhost:1234 in the browser of our choice.

Setting up the HTML

Before implementing the logic, we create the basic HTML structure that holds our components and add some dependencies that we’ll use later.

First, let's start with the basic HTML structure:

html

1
2
3
4
5
6
7
8
9
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
  </head>
  <body>
    <h1>Pose Detection</h1>
  </body>
</html>

Next, we add some CSS to properly position our video elements inside the head element:

html

1
2
3
4
5
6
7
8
9
<style>
  #video-container {
    position: relative;
  }
  #webcam {
    position: absolute;
    visibility: hidden;
  }
</style>

The webcam is hidden because we'll draw the video feed onto a canvas instead of directly showing it.

We need to include several TensorFlow.js dependencies and the fingerpose library for hand gesture detection:

html

1
2
3
4
5
<script src="https://cdn.jsdelivr.net/npm/fingerpose@0.1.0/dist/fingerpose.min.js"></script>
<script src="https://unpkg.com/@tensorflow/tfjs-core@3.7.0/dist/tf-core.js"></script>
<script src="https://unpkg.com/@tensorflow/tfjs-converter@3.7.0/dist/tf-converter.js"></script>
<script src="https://unpkg.com/@tensorflow/tfjs-backend-webgl@3.7.0/dist/tf-backend-webgl.js"></script>
<script src="https://unpkg.com/@tensorflow-models/handpose@0.0.7/dist/handpose.js"></script>

Finally, we add the video container with both a canvas (for drawing) and a video element (for the webcam feed):

html

1
2
3
4
<div id="video-container">
  <canvas id="canvas"></canvas>
  <video id="webcam"></video>
</div>

This setup will allow us to capture webcam input and process it using TensorFlow.js for hand pose detection. The project is designed to work with gesture recognition, which will later enable features like mute/unmute with hand gestures and trigger special effects like confetti with specific poses.

Retrieve and Display the Webcam Stream

Let's explain how to implement webcam functionality in JavaScript. We’ll go over the code step-by-step.

First, we'll create a function to handle webcam access:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
async function loadWebcam(width, height, fps) {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      'Browser API navigator.mediaDevices.getUserMedia is not available'
    );
  }

  let video = document.getElementById('webcam');
  video.muted = true;
  video.width = width;
  video.height = height;
}

This function first checks if the browser supports webcam access and sets up basic video properties.

Next, inside the loadWebcam function, we add the media configuration and initialize the video stream:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
const mediaConfig = {
  audio: false,
  video: {
    facingMode: 'user',
    width: width,
    height: height,
    frameRate: { max: fps },
  },
};

const stream = await navigator.mediaDevices.getUserMedia(mediaConfig);
video.srcObject = stream;

The configuration specifies we want video only (no audio), using the front-facing camera with specified dimensions and frame rate.

We then create a helper function to load the video with predefined settings that we define as a global config object:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
const config = {
  video: { width: 640, height: 480, fps: 30 },
};

async function loadVideo() {
  const video = await loadWebcam(
    config.video.width,
    config.video.height,
    config.video.fps
  );
  video.play();
  return video;
}

Finally, we set up the main function that ties everything together and prepares the canvas for drawing:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
let drawingContext;

async function main() {
  let video = await loadVideo();

  videoWidth = video.videoWidth;
  videoHeight = video.videoHeight;

  canvas = document.getElementById('canvas');
  canvas.width = videoWidth;
  canvas.height = videoHeight;

  drawingContext = canvas.getContext('2d');
  drawingContext.clearRect(0, 0, videoWidth, videoHeight);

  // Set up drawing style
  drawingContext.fillStyle = 'white';

  // Mirror the video horizontally
  drawingContext.translate(canvas.width, 0);
  drawingContext.scale(-1, 1);
}

The main function initializes the video feed, sets up the canvas dimensions, and applies a horizontal mirror effect to make the webcam feed more intuitive. It hasn’t drawn the context onto the canvas yet, so let’s add that next.

Determine and Draw Hand Landmarks

Let's break down how to implement hand landmark detection using TensorFlow.js with the power of the fingerpose package.

First, we'll create two helper functions for drawing on our canvas:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function drawPoint(y, x, r) {
  drawingContext.beginPath();
  drawingContext.arc(x, y, r, 0, 2 * Math.PI);
  drawingContext.fill();
}

function drawPath(points, closePath, color) {
  drawingContext.strokeStyle = color;
  const region = new Path2D();
  region.moveTo(points[0][0], points[0][1]);
  for (let i = 1; i < points.length; i++) {
    const point = points[i];
    region.lineTo(point[0], point[1]);
  }

  if (closePath) {
    region.closePath();
  }
  drawingContext.stroke(region);
}

Integrate LLMs fast! Our UI components are perfect for any AI chatbot interface right out of the box. Try them today and launch tomorrow!

These functions handle the following two functionalities:

drawPoint: Creates circular points for hand landmarks
drawPath: Draws lines connecting the landmarks to form hand outlines

Next, we implement the function that draws all hand key points. These unique identifiers roughly resemble the finger joints used as landmarks for finger detection. We first define two objects for fingerLookupIndices (used for polyline rendering of each finger) and landmarkColors (to draw each finger in a different color):

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
const fingerLookupIndices = {
  thumb: [0, 1, 2, 3, 4],
  indexFinger: [0, 5, 6, 7, 8],
  middleFinger: [0, 9, 10, 11, 12],
  ringFinger: [0, 13, 14, 15, 16],
  pinky: [0, 17, 18, 19, 20],
};

const landmarkColors = {
  thumb: 'red',
  indexFinger: 'blue',
  middleFinger: 'yellow',
  ringFinger: 'green',
  pinky: 'pink',
  palmBase: 'white',
};

function drawKeypoints(keypoints) {
  for (let i = 0; i < keypoints.length; i++) {
    const y = keypoints[i][0];
    const x = keypoints[i][1];
    drawPoint(x - 2, y - 2, 3);
  }

  const fingers = Object.keys(fingerLookupIndices);
  for (let i = 0; i < fingers.length; i++) {
    const finger = fingers[i];
    const points = fingerLookupIndices[finger].map((idx) => keypoints[idx]);
    drawPath(points, false, landmarkColors[finger]);
  }
}

Finally, we set up the main detection loop that processes the video feed:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
let model;

async function continuouslyDetectLandmarks(video) {
  async function runDetection() {
    drawingContext.drawImage(
      video,
      0,
      0,
      videoWidth,
      videoHeight,
      0,
      0,
      canvas.width,
      canvas.height
    );

    // Draw hand landmarks
    const predictions = await model.estimateHands(video);
    if (predictions.length > 0) {
      const result = predictions[0].landmarks;
      drawKeypoints(result, predictions[0].annotations);
    }

    requestAnimationFrame(runDetection);
  }

  model = await handpose.load();
  runDetection();
}

This function does the following things:

It loads the TensorFlow.js hand pose model and assigns it to the global model variable for later use.
It continuously captures frames from the video feed and draws them onto our canvas object.
Processes each frame to detect hand positions using the estimateHands function of our previously instantiated model.
Draws the detected landmarks on the canvas using the drawKeypoints function.
Uses requestAnimationFrame to create a smooth animation loop re-executing our function

Finally, we need to incorporate continuouslyDetectLandmarks into our main function and call it at the end:

javascript

1
2
3
4
5
async function main() {
    // rest of the code

    continuouslyDetectLandmarks(video);
}

With that, we have a rendering loop that continuously draws the detected hand (if there is any) onto the screen.

Detect Gestures and Visualize Them

Let's explain how to implement gesture detection using TensorFlow.js and the fingerpose package. Here's a step-by-step guide:

First, we'll create a function that continuously detects landmarks and gestures from our video stream. Here's the basic structure:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
async function continuouslyDetectLandmarks(video) {
  async function runDetection() {
    // Detection logic will go here
  }

  // Initialize gesture detection
  const knownGestures = [
    fp.Gestures.VictoryGesture,
    fp.Gestures.ThumbsUpGesture,
  ];

  gestureEstimator = new fp.GestureEstimator(knownGestures);
}

The function sets up our initial gesture recognition by defining which gestures we want to detect - in this case, the victory sign and thumbs up shipping with the fingerpose package.

Next, we'll add the core detection logic that processes each frame:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if (predictions.length > 0 && Object.keys(predictions[0]).includes('landmarks')) {
  const est = gestureEstimator.estimate(predictions[0].landmarks, 9);
  if (est.gestures.length > 0) {
    // Find gesture with highest match score
    let result = est.gestures.reduce((p, c) => {
      return p.score > c.score ? p : c;
    });

    if (result.score > 9.9) {
      document.getElementById('gesture-text').textContent = 
        gestureStrings[result.name];
    }
  }
}

This code does several important things:

It first checks if any hand predictions are available at all.
Estimates the gesture using a confidence threshold of 9 (this is an arbitrary value, the maximum is 10, but this is a value to play around with).
Finds the gesture with the highest confidence score
Updates the UI when a gesture is detected with high confidence (score > 9.9)

Creating a Custom Thumbs Down Gesture Detection

Let's walk through how to create a custom gesture detection for a thumbs-down sign using the fingerpose package. We'll break this down into clear, manageable steps.

First, we create a new gesture description with a unique identifier:

javascript

1
2
function createThumbsDownGesture() {
  const thumbsDown = new fp.GestureDescription('thumbs_down');

Next, we define how the thumb should be positioned. For a thumbs-down gesture, the thumb needs to be straight (no curl) and pointing downward (again, we can attach a confidence value of 1.0 here):

javascript

1
2
3
4
5
6
thumbsDown.addCurl(fp.Finger.Thumb, fp.FingerCurl.NoCurl);
  thumbsDown.addDirection(
    fp.Finger.Thumb,
    fp.FingerDirection.VerticalDown,
    1.0
  );

We also add diagonal directions to make the gesture detection more flexible:

javascript

1
2
3
4
5
6
7
8
9
10
thumbsDown.addDirection(
    fp.Finger.Thumb,
    fp.FingerDirection.DiagonalDownLeft,
    0.9
  );
  thumbsDown.addDirection(
    fp.Finger.Thumb,
    fp.FingerDirection.DiagonalDownRight,
    0.9
  );

For the remaining fingers, we want them to be curled into the palm. We use a loop to configure all four fingers at once:

javascript

1
2
3
4
5
6
7
8
9
10
11
12
for (let finger of [
    fp.Finger.Index,
    fp.Finger.Middle,
    fp.Finger.Ring,
    fp.Finger.Pinky,
  ]) {
    thumbsDown.addCurl(finger, fp.FingerCurl.FullCurl, 0.9);
    thumbsDown.addCurl(finger, fp.FingerCurl.HalfCurl, 0.9);
  }

  return thumbsDown;
}

Finally, we add our custom gesture to the list of known gestures that our application will detect:

javascript

1
2
3
4
5
const knownGestures = [
  fp.Gestures.VictoryGesture,
  fp.Gestures.ThumbsUpGesture,
  createThumbsDownGesture()
];

This implementation allows for some natural variation in how users might perform the gesture, making it more robust in real-world usage.

Integration Into a Video-Calling Application

The project we have built up so far demonstrates the general capabilities of Tensorflow.js in a basic project. However, we can leverage this to build features into real applications as well. We want to give just a quick demo of how this could look by building gesture control into a video calling application.

We use a web application built on the Stream React SDK (here’s a tutorial on how to set it up). We can hijack the raw video stream that we’re getting from the call participant in the current browser using the useCameraState hook and feed that into a fingerpose model object.

Here’s what the (slightly simplified) code for this looks like:

tsx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
const videoRef = useRef<HTMLVideoElement>(null);
const { mediaStream } = useCameraState();

useEffect(() => {
  videoRef.current.srcObject = mediaStream;

  const knownGestures = [
    fp.Gestures.VictoryGesture,
    fp.Gestures.ThumbsUpGesture,
  ];

  const gestureEstimator = new fp.GestureEstimator(knownGestures);
  handpose.load().then((m) => {
    setGestureEstimator(gestureEstimator);
    setModel(m);
  });
}, [mediaStream]);

const runDetectionCallback = useCallback(async () => {
  const predictions = await model.estimateHands(videoRef.current);
  if (predictions && predictions.length > 0) {
    const gesture = gestureEstimator?.estimate(predictions[0].landmarks, 9);

    if (gesture?.gestures.length > 0) {
      setGestureName(gesture.gestures[0].name as GestureName);
    }
  }

  setTimeout(() => {
    runDetectionCallback();
  }, 1000);
}, [gestureEstimator, model, videoRef]);

With this short snippet, we can get gesture detection running on a live video-calling application. We can also easily react to the gestures by, e.g., unmuting ourselves with a thumbs-up gesture or playing a fun effect on the victory sign.

Here’s a demo of that running in a live video call:

Conclusion

This comprehensive guide explored TensorFlow.js's powerful capabilities for implementing hand gesture recognition in web applications. We've covered everything from setting up a basic project to implementing advanced features like custom gesture detection, including detailed code examples for hand landmark detection, gesture recognition, and creating custom gestures like the thumbs-down sign.

This technology opens up exciting possibilities for creating more intuitive and interactive web experiences, particularly in areas like accessibility, gaming, and virtual communication.

Ready to take your web applications to the next level with gesture recognition? Check out our other resources and products that can help you build engaging, interactive experiences:

Explore our comprehensive documentation on machine learning implementations
Join our developer community to share ideas and get support
Try our enterprise solutions for scalable AI-powered applications

Start building amazing gesture-controlled experiences today!