Hand gesture recognition has become increasingly important in computer vision and human-computer interaction. With the rise of video conferencing and virtual interactions, there's a growing need for intuitive ways to control our digital environments. In this tutorial, we'll explore building a hand gesture detection system using TensorFlow.js that can recognize various hand poses in real-time through our webcam.
We'll start with a basic web project and gradually build up to a sophisticated application that detects hand poses and triggers actions based on specific gestures. By the end of this tutorial, we’ll have a working prototype that can recognize gestures like thumbs up, victory signs, and even custom poses. More importantly, we’ll learn how to integrate this technology into real-world applications, such as video conferencing platforms (Stream’s Video SDK in this example), where gestures could control features like mute/unmute or trigger interactive effects.
This project is particularly relevant for web developers interested in machine learning applications or anyone looking to enhance their web applications with gesture-based controls. While we'll use TensorFlow.js and the fingerpose package for our implementation, the concepts you'll learn can be applied to various other gesture recognition scenarios.
Setting Up a Basic Web Project
To keep things clean and straightforward, we'll work with a basic web project without any dependencies on frameworks. If you want to learn how to integrate with other frameworks, let me know in the comments.
Of course, the basic building blocks of a web project are a basic index.html
file, an index.js
logic file, and a package.json
to tie it all together. Let's create all of them in a root directory.
We can create a new folder for the project and inside, run the following command:
1npm init -y
The -y
parameter automatically answers all questions during the setup process and generates a package.json
file.
Then we create an empty index.js
file and an index.html
file that we fill with the following skeleton content:
123456789<!DOCTYPE html> <html lang="en"> <head> </head> <body> <h1>Pose Detection</h1> <script src="./index.js"></script> </body> </html>
To finish setting up the project, we want to expose our app on the web, so we use the http-server package, making it easy to achieve this. We first install it:
1npm install http-server
Then, we add a script to the package.json
to start the server on port 1234
:
1234"scripts": { "watch": "npm run build && node_modules/http-server/bin/http-server dist -p 1234", "build": "mkdir -p dist/ && cp index.html dist/ && cp index.js dist/" },
When executing yarn watch
now, we can see our project when visiting localhost:1234
in the browser of our choice.
Setting up the HTML
Before implementing the logic, we create the basic HTML structure that holds our components and add some dependencies that we’ll use later.
First, let's start with the basic HTML structure:
123456789<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8" /> </head> <body> <h1>Pose Detection</h1> </body> </html>
Next, we add some CSS to properly position our video elements inside the head
element:
123456789<style> #video-container { position: relative; } #webcam { position: absolute; visibility: hidden; } </style>
The webcam is hidden because we'll draw the video feed onto a canvas instead of directly showing it.
We need to include several TensorFlow.js dependencies and the fingerpose library for hand gesture detection:
12345<script src="https://cdn.jsdelivr.net/npm/fingerpose@0.1.0/dist/fingerpose.min.js"></script> <script src="https://unpkg.com/@tensorflow/tfjs-core@3.7.0/dist/tf-core.js"></script> <script src="https://unpkg.com/@tensorflow/tfjs-converter@3.7.0/dist/tf-converter.js"></script> <script src="https://unpkg.com/@tensorflow/tfjs-backend-webgl@3.7.0/dist/tf-backend-webgl.js"></script> <script src="https://unpkg.com/@tensorflow-models/handpose@0.0.7/dist/handpose.js"></script>
Finally, we add the video container with both a canvas (for drawing) and a video element (for the webcam feed):
1234<div id="video-container"> <canvas id="canvas"></canvas> <video id="webcam"></video> </div>
This setup will allow us to capture webcam input and process it using TensorFlow.js for hand pose detection. The project is designed to work with gesture recognition, which will later enable features like mute/unmute with hand gestures and trigger special effects like confetti with specific poses.
Retrieve and Display the Webcam Stream
Let's explain how to implement webcam functionality in JavaScript. We’ll go over the code step-by-step.
First, we'll create a function to handle webcam access:
123456789101112async function loadWebcam(width, height, fps) { if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) { throw new Error( 'Browser API navigator.mediaDevices.getUserMedia is not available' ); } let video = document.getElementById('webcam'); video.muted = true; video.width = width; video.height = height; }
This function first checks if the browser supports webcam access and sets up basic video properties.
Next, inside the loadWebcam
function, we add the media configuration and initialize the video stream:
123456789101112const mediaConfig = { audio: false, video: { facingMode: 'user', width: width, height: height, frameRate: { max: fps }, }, }; const stream = await navigator.mediaDevices.getUserMedia(mediaConfig); video.srcObject = stream;
The configuration specifies we want video only (no audio), using the front-facing camera with specified dimensions and frame rate.
We then create a helper function to load the video with predefined settings that we define as a global config
object:
12345678910111213const config = { video: { width: 640, height: 480, fps: 30 }, }; async function loadVideo() { const video = await loadWebcam( config.video.width, config.video.height, config.video.fps ); video.play(); return video; }
Finally, we set up the main
function that ties everything together and prepares the canvas for drawing:
12345678910111213141516171819202122let drawingContext; async function main() { let video = await loadVideo(); videoWidth = video.videoWidth; videoHeight = video.videoHeight; canvas = document.getElementById('canvas'); canvas.width = videoWidth; canvas.height = videoHeight; drawingContext = canvas.getContext('2d'); drawingContext.clearRect(0, 0, videoWidth, videoHeight); // Set up drawing style drawingContext.fillStyle = 'white'; // Mirror the video horizontally drawingContext.translate(canvas.width, 0); drawingContext.scale(-1, 1); }
The main
function initializes the video feed, sets up the canvas dimensions, and applies a horizontal mirror effect to make the webcam feed more intuitive. It hasn’t drawn the context onto the canvas yet, so let’s add that next.
Determine and Draw Hand Landmarks
Let's break down how to implement hand landmark detection using TensorFlow.js with the power of the fingerpose package.
First, we'll create two helper functions for drawing on our canvas:
1234567891011121314151617181920function drawPoint(y, x, r) { drawingContext.beginPath(); drawingContext.arc(x, y, r, 0, 2 * Math.PI); drawingContext.fill(); } function drawPath(points, closePath, color) { drawingContext.strokeStyle = color; const region = new Path2D(); region.moveTo(points[0][0], points[0][1]); for (let i = 1; i < points.length; i++) { const point = points[i]; region.lineTo(point[0], point[1]); } if (closePath) { region.closePath(); } drawingContext.stroke(region); }
These functions handle the following two functionalities:
drawPoint
: Creates circular points for hand landmarksdrawPath
: Draws lines connecting the landmarks to form hand outlines
Next, we implement the function that draws all hand key points. These unique identifiers roughly resemble the finger joints used as landmarks for finger detection. We first define two objects for fingerLookupIndices
(used for polyline rendering of each finger) and landmarkColors
(to draw each finger in a different color):
12345678910111213141516171819202122232425262728293031const fingerLookupIndices = { thumb: [0, 1, 2, 3, 4], indexFinger: [0, 5, 6, 7, 8], middleFinger: [0, 9, 10, 11, 12], ringFinger: [0, 13, 14, 15, 16], pinky: [0, 17, 18, 19, 20], }; const landmarkColors = { thumb: 'red', indexFinger: 'blue', middleFinger: 'yellow', ringFinger: 'green', pinky: 'pink', palmBase: 'white', }; function drawKeypoints(keypoints) { for (let i = 0; i < keypoints.length; i++) { const y = keypoints[i][0]; const x = keypoints[i][1]; drawPoint(x - 2, y - 2, 3); } const fingers = Object.keys(fingerLookupIndices); for (let i = 0; i < fingers.length; i++) { const finger = fingers[i]; const points = fingerLookupIndices[finger].map((idx) => keypoints[idx]); drawPath(points, false, landmarkColors[finger]); } }
Finally, we set up the main detection loop that processes the video feed:
1234567891011121314151617181920212223242526272829let model; async function continuouslyDetectLandmarks(video) { async function runDetection() { drawingContext.drawImage( video, 0, 0, videoWidth, videoHeight, 0, 0, canvas.width, canvas.height ); // Draw hand landmarks const predictions = await model.estimateHands(video); if (predictions.length > 0) { const result = predictions[0].landmarks; drawKeypoints(result, predictions[0].annotations); } requestAnimationFrame(runDetection); } model = await handpose.load(); runDetection(); }
This function does the following things:
- It loads the TensorFlow.js hand pose model and assigns it to the global
model
variable for later use. - It continuously captures frames from the video feed and draws them onto our
canvas
object. - Processes each frame to detect hand positions using the
estimateHands
function of our previously instantiated model. - Draws the detected landmarks on the canvas using the
drawKeypoints
function. - Uses
requestAnimationFrame
to create a smooth animation loop re-executing our function
Finally, we need to incorporate continuouslyDetectLandmarks
into our main
function and call it at the end:
12345async function main() { // rest of the code continuouslyDetectLandmarks(video); }
With that, we have a rendering loop that continuously draws the detected hand (if there is any) onto the screen.
Detect Gestures and Visualize Them
Let's explain how to implement gesture detection using TensorFlow.js and the fingerpose package. Here's a step-by-step guide:
First, we'll create a function that continuously detects landmarks and gestures from our video stream. Here's the basic structure:
12345678910111213async function continuouslyDetectLandmarks(video) { async function runDetection() { // Detection logic will go here } // Initialize gesture detection const knownGestures = [ fp.Gestures.VictoryGesture, fp.Gestures.ThumbsUpGesture, ]; gestureEstimator = new fp.GestureEstimator(knownGestures); }
The function sets up our initial gesture recognition by defining which gestures we want to detect - in this case, the victory sign and thumbs up shipping with the fingerpose
package.
Next, we'll add the core detection logic that processes each frame:
1234567891011121314if (predictions.length > 0 && Object.keys(predictions[0]).includes('landmarks')) { const est = gestureEstimator.estimate(predictions[0].landmarks, 9); if (est.gestures.length > 0) { // Find gesture with highest match score let result = est.gestures.reduce((p, c) => { return p.score > c.score ? p : c; }); if (result.score > 9.9) { document.getElementById('gesture-text').textContent = gestureStrings[result.name]; } } }
This code does several important things:
- It first checks if any hand
predictions
are available at all. - Estimates the gesture using a confidence threshold of
9
(this is an arbitrary value, the maximum is10
, but this is a value to play around with). - Finds the gesture with the highest confidence score
- Updates the UI when a gesture is detected with high confidence (score >
9.9
)
Creating a Custom Thumbs Down Gesture Detection
Let's walk through how to create a custom gesture detection for a thumbs-down sign using the fingerpose package. We'll break this down into clear, manageable steps.
First, we create a new gesture description with a unique identifier:
12function createThumbsDownGesture() { const thumbsDown = new fp.GestureDescription('thumbs_down');
Next, we define how the thumb should be positioned. For a thumbs-down gesture, the thumb needs to be straight (no curl) and pointing downward (again, we can attach a confidence value of 1.0
here):
123456thumbsDown.addCurl(fp.Finger.Thumb, fp.FingerCurl.NoCurl); thumbsDown.addDirection( fp.Finger.Thumb, fp.FingerDirection.VerticalDown, 1.0 );
We also add diagonal directions to make the gesture detection more flexible:
12345678910thumbsDown.addDirection( fp.Finger.Thumb, fp.FingerDirection.DiagonalDownLeft, 0.9 ); thumbsDown.addDirection( fp.Finger.Thumb, fp.FingerDirection.DiagonalDownRight, 0.9 );
For the remaining fingers, we want them to be curled into the palm. We use a loop to configure all four fingers at once:
123456789101112for (let finger of [ fp.Finger.Index, fp.Finger.Middle, fp.Finger.Ring, fp.Finger.Pinky, ]) { thumbsDown.addCurl(finger, fp.FingerCurl.FullCurl, 0.9); thumbsDown.addCurl(finger, fp.FingerCurl.HalfCurl, 0.9); } return thumbsDown; }
Finally, we add our custom gesture to the list of known gestures that our application will detect:
12345const knownGestures = [ fp.Gestures.VictoryGesture, fp.Gestures.ThumbsUpGesture, createThumbsDownGesture() ];
This implementation allows for some natural variation in how users might perform the gesture, making it more robust in real-world usage.
Integration Into a Video-Calling Application
The project we have built up so far demonstrates the general capabilities of Tensorflow.js in a basic project. However, we can leverage this to build features into real applications as well. We want to give just a quick demo of how this could look by building gesture control into a video calling application.
We use a web application built on the Stream React SDK (here’s a tutorial on how to set it up). We can hijack the raw video stream that we’re getting from the call participant in the current browser using the useCameraState
hook and feed that into a fingerpose model object.
Here’s what the (slightly simplified) code for this looks like:
1234567891011121314151617181920212223242526272829303132const videoRef = useRef<HTMLVideoElement>(null); const { mediaStream } = useCameraState(); useEffect(() => { videoRef.current.srcObject = mediaStream; const knownGestures = [ fp.Gestures.VictoryGesture, fp.Gestures.ThumbsUpGesture, ]; const gestureEstimator = new fp.GestureEstimator(knownGestures); handpose.load().then((m) => { setGestureEstimator(gestureEstimator); setModel(m); }); }, [mediaStream]); const runDetectionCallback = useCallback(async () => { const predictions = await model.estimateHands(videoRef.current); if (predictions && predictions.length > 0) { const gesture = gestureEstimator?.estimate(predictions[0].landmarks, 9); if (gesture?.gestures.length > 0) { setGestureName(gesture.gestures[0].name as GestureName); } } setTimeout(() => { runDetectionCallback(); }, 1000); }, [gestureEstimator, model, videoRef]);
With this short snippet, we can get gesture detection running on a live video-calling application. We can also easily react to the gestures by, e.g., unmuting ourselves with a thumbs-up gesture or playing a fun effect on the victory sign.
Here’s a demo of that running in a live video call:
Conclusion
This comprehensive guide explored TensorFlow.js's powerful capabilities for implementing hand gesture recognition in web applications. We've covered everything from setting up a basic project to implementing advanced features like custom gesture detection, including detailed code examples for hand landmark detection, gesture recognition, and creating custom gestures like the thumbs-down sign.
This technology opens up exciting possibilities for creating more intuitive and interactive web experiences, particularly in areas like accessibility, gaming, and virtual communication.
Ready to take your web applications to the next level with gesture recognition? Check out our other resources and products that can help you build engaging, interactive experiences:
- Explore our comprehensive documentation on machine learning implementations
- Join our developer community to share ideas and get support
- Try our enterprise solutions for scalable AI-powered applications
Start building amazing gesture-controlled experiences today!