Object detectors such as YOLO and EfficientDet treat each video frame independently. This works fine for static images, but in real-time video streams, it causes detections to flicker. Bounding boxes jitter, confidence scores oscillate near thresholds, and objects "blink" in and out of existence.
In a display overlay, this is merely annoying. In a closed-loop control system where detections trigger actuators, it can be catastrophic. A flickering detection might cause a robotic gripper to spasm, a security gate to oscillate, or a safety system to flood operators with false alarms.
The solution is building a temporal consistency layer between your detector and your actuation logic.
What Causes Detections to Flicker in the First Place?
Three distinct types of instability compound into the flickering you observe:
- Position jitter occurs when the bounding box coordinates fluctuate rapidly, even for stationary objects. This stems from regression uncertainty in the detection head, sensor noise, and the discrete nature of anchor box scales.
-
Confidence fluctuation happens when the class probability oscillates around your threshold. If your threshold is 0.5 and the detector outputs [0.51, 0.49, 0.52, 0.48] across four frames, you get a binary presence signal that toggles every frame.
-
Existence flicker is the complete loss and re-acquisition of detections due to occlusion, motion blur, or extreme poses. This is especially problematic on edge devices running quantized INT8 models, where the reduced precision amplifies decision boundary noise.
How Do You Stabilize Bounding Box Coordinates?
For smoothing spatial coordinates, the One Euro Filter outperforms traditional approaches on edge devices. Standard filters force you to choose between low jitter (high smoothing, high latency) and fast response (low smoothing, noisy output). The One Euro Filter sidesteps this trade-off by adapting its cutoff frequency based on signal velocity.
When the tracked object is stationary, the filter applies heavy smoothing to eliminate micro-jitter. When the object moves quickly, the filter opens up to track motion with minimal lag. This adaptive behavior is ideal for robotics and human-computer interaction, where both stability and responsiveness matter.
The filter is also computationally trivial, requiring only scalar arithmetic with no matrix operations. It handles variable frame rates gracefully, which is critical on thermally-throttled edge hardware where inference times fluctuate.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576import math import time class LowPassFilter: def __init__(self): self.previous = None def __call__(self, x: float, alpha: float) -> float: if self.previous is None: self.previous = x else: self.previous = alpha * x + (1 - alpha) * self.previous return self.previous class OneEuroFilter: def __init__(self, min_cutoff: float = 1.0, beta: float = 0.007, d_cutoff: float = 1.0): """ min_cutoff: minimum cutoff frequency (Hz). Lower = more smoothing when stationary. beta: speed coefficient. Higher = less lag during fast motion. d_cutoff: cutoff frequency for derivative estimation. """ self.min_cutoff = min_cutoff self.beta = beta self.d_cutoff = d_cutoff self.x_filter = LowPassFilter() self.dx_filter = LowPassFilter() self.last_time = None def _smoothing_factor(self, t_e: float, cutoff: float) -> float: tau = 1.0 / (2 * math.pi * cutoff) return 1.0 / (1.0 + tau / t_e) def __call__(self, x: float, t: float = None) -> float: if t is None: t = time.time() if self.last_time is None: self.last_time = t return self.x_filter(x, 1.0) # no smoothing on first sample t_e = t - self.last_time if t_e <= 0: t_e = 1e-6 self.last_time = t # estimate derivative dx = (x - (self.x_filter.previous or x)) / t_e alpha_d = self._smoothing_factor(t_e, self.d_cutoff) dx_smooth = self.dx_filter(dx, alpha_d) # adapt cutoff based on speed cutoff = self.min_cutoff + self.beta * abs(dx_smooth) alpha = self._smoothing_factor(t_e, cutoff) return self.x_filter(x, alpha) class BoundingBoxSmoother: def __init__(self, min_cutoff: float = 1.0, beta: float = 0.007): """Applies One Euro filtering independently to each bounding box coordinate.""" self.filters = { 'x1': OneEuroFilter(min_cutoff, beta), 'y1': OneEuroFilter(min_cutoff, beta), 'x2': OneEuroFilter(min_cutoff, beta), 'y2': OneEuroFilter(min_cutoff, beta), } def __call__(self, bbox: tuple, t: float = None) -> tuple: """ bbox: (x1, y1, x2, y2) raw detection coordinates returns: smoothed (x1, y1, x2, y2) """ x1, y1, x2, y2 = bbox return ( self.filters['x1'](x1, t), self.filters['y1'](y1, t), self.filters['x2'](x2, t), self.filters['y2'](y2, t), )
To use this in a detection loop:
123456smoother = BoundingBoxSmoother(min_cutoff=0.5, beta=0.01) for frame in video_stream: detections = detector(frame) for det in detections: det.bbox = smoother(det.bbox)
Tune min_cutoff lower for more aggressive smoothing when stationary. Increase beta if the smoothed box lags during fast motion.
For simpler scenarios with predictable linear motion (e.g., conveyor belts, vehicle tracking), Double Exponential Smoothing yields good results. It explicitly models velocity, compensating for the lag inherent in basic exponential smoothing.
How Do You Prevent the Detection State From Toggling?
Smoothing coordinates does not solve the binary present/absent flickering. For that, you need logical debouncing.
-
Hysteresis thresholding introduces two thresholds instead of one. Set an upper threshold (e.g., 0.6) for activation and a lower threshold (e.g., 0.4) for deactivation. An object must exceed 0.6 confidence to trigger the system, but once triggered, the system holds even if confidence dips to 0.5. This prevents rapid toggling when confidence hovers near a single threshold.
-
N-out-of-M confirmation requires an object to be detected in M of the last N frames before it is considered valid. A setting of 3-out-of-5 filters transient false positives (ghosts appearing for a single frame) while remaining responsive enough for real-time applications.
-
Time-to-Live (TTL) prevents premature track termination. When a detection disappears, maintain its existence for a grace period (e.g., 500ms). If it reappears within that window, preserve the identity and suppress the disappearance event entirely. This is essential for people counting, where someone walking behind a pillar should not be registered as two separate individuals.
Which Tracker Should You Use on Resource-Constrained Hardware?
ByteTrack has emerged as the optimal choice for edge deployments. Its key innovation is the intelligent handling of low-confidence detections.
Standard trackers discard detections below a confidence threshold (say, 0.5) before association. If an object becomes partially occluded and its confidence drops to 0.4, the track dies and restarts later with a new ID. ByteTrack instead retains all detections and divides them into high- and low-confidence groups. It first matches high-confidence detections to existing tracks, then matches remaining unmatched tracks to low-confidence detections.
This "rescue" mechanism dramatically reduces ID switches during occlusions. Because ByteTrack relies solely on motion and geometry (without a heavy Re-ID neural network), it runs significantly faster than DeepSORT while delivering superior tracking accuracy.
What Does a Complete Edge Pipeline Look Like?
| Layer | Recommended Approach | Why |
|---|---|---|
| Tracker | ByteTrack | Best speed/accuracy balance; handles confidence dips |
| Coordinate Smoothing | One Euro Filter | Adaptive latency; O(1) complexity |
| State Debouncing | N-out-of-M + TTL | Filters transient noise; preserves identity through occlusions |
| Threshold Logic | Hysteresis | Prevents toggling at decision boundaries |
| Hardware | TensorRT / DeepStream | Maximizes FPS; offloads post-processing from CPU |
The layers work together. ByteTrack maintains object identity through temporary detection failures. The One Euro Filter smooths each track's coordinates without introducing lag during motion. N-out-of-M and TTL logic ensure transient glitches never reach your actuation layer. Hysteresis prevents the final binary signal from chattering.
Temporal consistency requires treating the entire pipeline as a system, not bolting filters onto a frame-by-frame detector as an afterthought.