
Here are two declassified case studies showing how classic computer vision delivered real‑time results on modest hardware from the early 2000s. They mark the first known, publicly deployed real‑time, multi‑camera tracking of their kind on commodity CPUs:
- Speedway (2003): real‑time multi‑motorcycle tracking in a stadium.
- Greyhound (2004-2005): multi‑camera dog (and mechanical hare) tracking at a racetrack.
Both ran live, streamed data to dial‑up users, and had to succeed on tiny FLOPS budgets. That scarcity forged habits that still matter on modern edge devices. For additional details the above articles are the best place to start (in particular the Greyhound piece includes lots of photos). However, a brief summary and lessons learned are outlined below.
2003–2005: What Shipped Under Tight Constraints
Context
When these systems went live, OpenCV was a fledgling v0.x, AlexNet (2012) was years away, and a single Pentium 4 could push on the order of ~12 GFLOPS (≈0.012 TOPS) of peak compute. No practical GPGPU. 1 GB RAM on high-end systems. Interlaced PAL video at 25 fps.
Systems at a glance
System | Cameras | Compute | Objects | End‑to‑End Latency | Notable Tricks |
---|---|---|---|---|---|
Speedway (2003) | 9 × PAL CCTV | 3 × Pentium 4 | 4 motorcycles | <200 ms | SSE2 color kernels, helmet‑cam identity hints |
Greyhound (2004-2005) | 52 × PAL CCTV | 9 × Pentium 4 | 6 dogs + hare | <220 ms | 64×16 analog video matrix; 1‑D “track‑unwrapped” EKF |
Why the “primitive” hardware helped
Constraint | Counter‑measure |
25 fps interlaced PAL | Use single‑field processing to halve motion blur; regain detail via multi‑view geometry |
Zero GPUs, 1 GB RAM | Hand‑rolled SIMD, LUT color classifiers, early‑exit motion masks, ROI pyramids |
100 Mb LAN; many dial‑up users | Stream state vectors (<1 kB per frame) instead of video |
Dust, glare, dropouts | Per‑pixel variance masks; auto‑recovery and camera failover |
Engineering highlights
- Geometry‑first pipelines. In both systems the oval track was “unwrapped” to a 1‑D arclength coordinate s. Each camera produced observations z = s + noise; a single EKF fused them into smooth trajectories. Occlusions became gaps along s, not hard 2‑D re‑identification problems.
- Deterministic latency. Fixed time‑budgets per stage (capture → mask → blob → association → fuse), with watchdogs that degraded gracefully (smaller ROIs, shorter association windows) under load.
- Robust association. Simple gating (Mahalanobis distance) + nearest‑neighbour across cameras outperformed heavier global solvers on commodity CPUs of the era.
- Operational pragmatism. Camera‑by‑camera health scores; automatic de‑weighting in the filter when variance spiked (rain, floodlights, spectators).
Core idea: Use the world’s structure (track layout, motion priors, order constraints) so that simple algorithms win in real time.
Some of the lessons which are still relevant today
2003 Approach | 2025 Equivalent |
Hand‑optimised kernels and cache awareness | Better quantisation strategies, compiler pragmas, and memory layouts for edge TPUs / NPUs |
Geometry before deep nets | Smaller models, fewer labels; homographies and EKFs reduce training burden |
Bandwidth‑first design | On‑device inference + light-weight uplinks (telemetry, not video) lowers cost and improves privacy |
Designed‑for‑failure | Self‑healing nodes, health telemetry, and graceful degradation are as important as the mean Average Precision of your models. |
Early deployments showed that real‑time is achievable on modest hardware by leaning on geometry, priors, and simplification of the problem space.
Evergreen lesson: Scarcity clarifies vision, whether it’s squeezing handcrafted kernels onto a Pentium 4 in 2003 or quantising modern detectors onto a 3-5 W edge accelerator in 2025.