Air Drawing Recognition

The Challenge

Real-time sketch recognition demands low latency (interactive feel) and high accuracy (correct predictions). The constraint: inference must happen on CPU, no GPU acceleration. Hand tracking must be smooth and robust despite occlusion and lighting changes.

Standard approaches (fine-tuning ResNet) yield 95% accuracy but 500ms+ latency. We needed architectural innovation to hit 99%+ accuracy while staying under 100ms per frame.

Your Role

Model Architecture Design

Designed hybrid CNN-LSTM architecture: CNN extracts spatial features from sketch frames, LSTM captures temporal dependencies across the drawing sequence.

Hand Tracking & Preprocessing

Implemented MediaPipe for real-time hand detection. Applied Kalman filtering for smooth hand position tracking and pinch gesture detection for start/stop signals.

Performance Optimization

Reduced model size through quantization (4× smaller). Profiled inference bottlenecks. Achieved sub-100ms latency on CPU through careful architecture choices.

Pattern Completion Algorithms

Built predictive suggestion system: as user draws, model suggests what they're likely drawing. Useful for fast input and user guidance.

Technical Approach

Three-stage pipeline: hand tracking → sketch extraction → neural classification.

Input Stage

Hand Tracking

MediaPipe hand detection
Kalman filtering (smoothing)
Pinch detection (start/stop)
30 FPS real-time processing

↓

Processing Stage

Sketch Extraction

Trajectory reconstruction
Stroke normalization
Context windowing
Temporal augmentation

↓

Output Stage

Classification

CNN: spatial features
LSTM: temporal patterns
Dense: class prediction
Confidence scoring

Key Technical Decisions

1. Hybrid CNN-LSTM Over Pure CNN

Pure CNNs (ResNet, VGG) don't naturally capture temporal sequence. Drawing is inherently sequential—pen direction, stroke order, timing all matter.

Why hybrid: CNN handles spatial features, LSTM captures how features evolve over time
Result: 99.79% accuracy vs. 95% for CNN alone
Trade-off: Slightly more complex, but worth accuracy gain

2. Kalman Filtering for Hand Position

MediaPipe occasionally jitters or drops hands briefly. Raw coordinates are noisy. Kalman filter smooths without adding lag.

Why Kalman: Optimal for linear systems with noise. Mathematically principled smoothing.
Benefit: Smooth trajectories, better model input
Latency: Negligible overhead (<1ms per frame)

3. Quantization for Inference Speed

Model had 18M parameters. Full precision (float32) → ~72MB, too large for mobile. Quantized to int8 → 18MB, 4× speedup.

Accuracy loss: <0.1% (99.79% → 99.69%)
Speed gain: 250ms → 78ms latency
Deployment: Now viable on-device (mobile, edge)

4. Quick Draw Dataset + Custom Augmentation

Google's Quick Draw dataset (70M drawings) is massive but was drawn on desktop/tablet, not in air. Applied aggressive augmentation to simulate real-world variation.

Augmentation: Rotation, scaling, speed variation, occlusion simulation
Result: Better generalization to hand-drawn sketches
Impact: Reduced overfit, improved real-world performance

Results

99.79%

Test Accuracy

On Quick Draw dataset (1M test samples)

78ms

Inference Latency

Per prediction, CPU-only (int8 quantized)

30 FPS

Real-Time Tracking

Hand detection via MediaPipe

18MB

Model Size

After quantization (viable for mobile)

What I Learned

1. Real-Time Systems Have Different Constraints

Accuracy isn't everything when latency matters. 99% accuracy in 500ms is worse than 98% in 100ms. Learned to balance metrics based on user experience.

2. Temporal Models Require Careful Data Handling

LSTM expects sequential data. Ordering matters. Can't just shuffle training data; must respect drawing sequences. Spent time getting data pipeline right.

3. CPU Optimization is a Skill

Quantization, pruning, model distillation—these make models production-viable. GPU-first training neglects practical deployment constraints.

4. Users Benefit from Immediate Feedback

Pattern completion suggestions as they draw are delightful. Real-time feedback loops improve engagement. Important for interactive ML.