Work / Air Drawing Recognition

Air Drawing Recognition

Real-time sketch recognition system achieving 99.79% accuracy with sub-100ms inference latency using hybrid CNN-LSTM architecture.

Duration Jan 2025 – May 2025
Role ML Engineer
Team 2-person team
Status Production Ready
TensorFlow MediaPipe Python Real-Time Systems Computer Vision

The Challenge

Real-time sketch recognition demands low latency (interactive feel) and high accuracy (correct predictions). The constraint: inference must happen on CPU, no GPU acceleration. Hand tracking must be smooth and robust despite occlusion and lighting changes.

Standard approaches (fine-tuning ResNet) yield 95% accuracy but 500ms+ latency. We needed architectural innovation to hit 99%+ accuracy while staying under 100ms per frame.

Your Role

Model Architecture Design

Designed hybrid CNN-LSTM architecture: CNN extracts spatial features from sketch frames, LSTM captures temporal dependencies across the drawing sequence.

Hand Tracking & Preprocessing

Implemented MediaPipe for real-time hand detection. Applied Kalman filtering for smooth hand position tracking and pinch gesture detection for start/stop signals.

Performance Optimization

Reduced model size through quantization (4× smaller). Profiled inference bottlenecks. Achieved sub-100ms latency on CPU through careful architecture choices.

Pattern Completion Algorithms

Built predictive suggestion system: as user draws, model suggests what they're likely drawing. Useful for fast input and user guidance.

Technical Approach

Three-stage pipeline: hand tracking → sketch extraction → neural classification.

Input Stage

Hand Tracking

  • MediaPipe hand detection
  • Kalman filtering (smoothing)
  • Pinch detection (start/stop)
  • 30 FPS real-time processing

Processing Stage

Sketch Extraction

  • Trajectory reconstruction
  • Stroke normalization
  • Context windowing
  • Temporal augmentation

Output Stage

Classification

  • CNN: spatial features
  • LSTM: temporal patterns
  • Dense: class prediction
  • Confidence scoring

Key Technical Decisions

1. Hybrid CNN-LSTM Over Pure CNN

Pure CNNs (ResNet, VGG) don't naturally capture temporal sequence. Drawing is inherently sequential—pen direction, stroke order, timing all matter.

  • Why hybrid: CNN handles spatial features, LSTM captures how features evolve over time
  • Result: 99.79% accuracy vs. 95% for CNN alone
  • Trade-off: Slightly more complex, but worth accuracy gain

2. Kalman Filtering for Hand Position

MediaPipe occasionally jitters or drops hands briefly. Raw coordinates are noisy. Kalman filter smooths without adding lag.

  • Why Kalman: Optimal for linear systems with noise. Mathematically principled smoothing.
  • Benefit: Smooth trajectories, better model input
  • Latency: Negligible overhead (<1ms per frame)

3. Quantization for Inference Speed

Model had 18M parameters. Full precision (float32) → ~72MB, too large for mobile. Quantized to int8 → 18MB, 4× speedup.

  • Accuracy loss: <0.1% (99.79% → 99.69%)
  • Speed gain: 250ms → 78ms latency
  • Deployment: Now viable on-device (mobile, edge)

4. Quick Draw Dataset + Custom Augmentation

Google's Quick Draw dataset (70M drawings) is massive but was drawn on desktop/tablet, not in air. Applied aggressive augmentation to simulate real-world variation.

  • Augmentation: Rotation, scaling, speed variation, occlusion simulation
  • Result: Better generalization to hand-drawn sketches
  • Impact: Reduced overfit, improved real-world performance

Results

99.79%
Test Accuracy

On Quick Draw dataset (1M test samples)

78ms
Inference Latency

Per prediction, CPU-only (int8 quantized)

30 FPS
Real-Time Tracking

Hand detection via MediaPipe

18MB
Model Size

After quantization (viable for mobile)

What I Learned

1. Real-Time Systems Have Different Constraints

Accuracy isn't everything when latency matters. 99% accuracy in 500ms is worse than 98% in 100ms. Learned to balance metrics based on user experience.

2. Temporal Models Require Careful Data Handling

LSTM expects sequential data. Ordering matters. Can't just shuffle training data; must respect drawing sequences. Spent time getting data pipeline right.

3. CPU Optimization is a Skill

Quantization, pruning, model distillation—these make models production-viable. GPU-first training neglects practical deployment constraints.

4. Users Benefit from Immediate Feedback

Pattern completion suggestions as they draw are delightful. Real-time feedback loops improve engagement. Important for interactive ML.

Let's talk

Interested in real-time ML, model optimization, or computer vision? Let's connect.