Air Drawing Recognition
Real-time sketch recognition system achieving 99.79% accuracy with sub-100ms inference latency using hybrid CNN-LSTM architecture.
The Challenge
Real-time sketch recognition demands low latency (interactive feel) and high accuracy (correct predictions). The constraint: inference must happen on CPU, no GPU acceleration. Hand tracking must be smooth and robust despite occlusion and lighting changes.
Standard approaches (fine-tuning ResNet) yield 95% accuracy but 500ms+ latency. We needed architectural innovation to hit 99%+ accuracy while staying under 100ms per frame.
Your Role
Model Architecture Design
Designed hybrid CNN-LSTM architecture: CNN extracts spatial features from sketch frames, LSTM captures temporal dependencies across the drawing sequence.
Hand Tracking & Preprocessing
Implemented MediaPipe for real-time hand detection. Applied Kalman filtering for smooth hand position tracking and pinch gesture detection for start/stop signals.
Performance Optimization
Reduced model size through quantization (4× smaller). Profiled inference bottlenecks. Achieved sub-100ms latency on CPU through careful architecture choices.
Pattern Completion Algorithms
Built predictive suggestion system: as user draws, model suggests what they're likely drawing. Useful for fast input and user guidance.
Technical Approach
Three-stage pipeline: hand tracking → sketch extraction → neural classification.
Input Stage
Hand Tracking
- MediaPipe hand detection
- Kalman filtering (smoothing)
- Pinch detection (start/stop)
- 30 FPS real-time processing
Processing Stage
Sketch Extraction
- Trajectory reconstruction
- Stroke normalization
- Context windowing
- Temporal augmentation
Output Stage
Classification
- CNN: spatial features
- LSTM: temporal patterns
- Dense: class prediction
- Confidence scoring
Key Technical Decisions
1. Hybrid CNN-LSTM Over Pure CNN
Pure CNNs (ResNet, VGG) don't naturally capture temporal sequence. Drawing is inherently sequential—pen direction, stroke order, timing all matter.
- Why hybrid: CNN handles spatial features, LSTM captures how features evolve over time
- Result: 99.79% accuracy vs. 95% for CNN alone
- Trade-off: Slightly more complex, but worth accuracy gain
2. Kalman Filtering for Hand Position
MediaPipe occasionally jitters or drops hands briefly. Raw coordinates are noisy. Kalman filter smooths without adding lag.
- Why Kalman: Optimal for linear systems with noise. Mathematically principled smoothing.
- Benefit: Smooth trajectories, better model input
- Latency: Negligible overhead (<1ms per frame)
3. Quantization for Inference Speed
Model had 18M parameters. Full precision (float32) → ~72MB, too large for mobile. Quantized to int8 → 18MB, 4× speedup.
- Accuracy loss: <0.1% (99.79% → 99.69%)
- Speed gain: 250ms → 78ms latency
- Deployment: Now viable on-device (mobile, edge)
4. Quick Draw Dataset + Custom Augmentation
Google's Quick Draw dataset (70M drawings) is massive but was drawn on desktop/tablet, not in air. Applied aggressive augmentation to simulate real-world variation.
- Augmentation: Rotation, scaling, speed variation, occlusion simulation
- Result: Better generalization to hand-drawn sketches
- Impact: Reduced overfit, improved real-world performance
Results
On Quick Draw dataset (1M test samples)
Per prediction, CPU-only (int8 quantized)
Hand detection via MediaPipe
After quantization (viable for mobile)
What I Learned
1. Real-Time Systems Have Different Constraints
Accuracy isn't everything when latency matters. 99% accuracy in 500ms is worse than 98% in 100ms. Learned to balance metrics based on user experience.
2. Temporal Models Require Careful Data Handling
LSTM expects sequential data. Ordering matters. Can't just shuffle training data; must respect drawing sequences. Spent time getting data pipeline right.
3. CPU Optimization is a Skill
Quantization, pruning, model distillation—these make models production-viable. GPU-first training neglects practical deployment constraints.
4. Users Benefit from Immediate Feedback
Pattern completion suggestions as they draw are delightful. Real-time feedback loops improve engagement. Important for interactive ML.
Let's talk
Interested in real-time ML, model optimization, or computer vision? Let's connect.