Kinetic: A Cognitive Layer for Real-Time Physical Movement Intelligence
Rajashekar Vennavelli
TreeHacks 2026, Stanford University, Stanford, 94305, CA, USA
Abstract
We present Kinetic, the first real-time physical movement intelligence system that perceives, reasons about, and enhances human physical capability across multiple domains: skill coaching, physical therapy rehabilitation, autonomous fall detection, and spatial monitoring. Kinetic introduces a 4-tier expert generation pipeline that synthesizes biomechanically accurate reference poses from natural language descriptions, leveraging canonical biomechanical templates, large language model semantic mapping (Claude Sonnet 4), and state-of-the-art text-to-3D motion generation (HY-Motion 1.0-Lite on NVIDIA A100). Our system evaluates movement using a novel triple-metric scoring engine combining Gaussian joint angle comparison, cosine spatial similarity, and COCO Object Keypoint Similarity (OKS). Real-time feedback is delivered through conversational voice coaching (GPT-4o Realtime API) with 3-layer interruption handling. In autonomous monitoring mode, Kinetic watches physical spaces for falls, inactivity, and safety events — sending alerts via Telegram with photos. Edge AI inference on NVIDIA DGX Spark's GB10 Superchip provides sub-50ms pose estimation latency. The system orchestrates 44 MCP tools across 12 categories through the Anthropic Claude Agent SDK with 3 sub-agents. Kinetic demonstrates that AI-driven physical intelligence can augment human physical agency — helping people move, recover, express, and perform at their full potential.
Keywords: Physical Movement Intelligence, Pose Estimation, Motion Generation, Fall Detection, Autonomous Monitoring, Edge AI, Multi-Agent Systems, Real-Time Voice, Computer Vision, DGX Spark, Modal GPU, Claude Agent SDK
1. Introduction
Today, AI augments cognitive intelligence — helping us write, code, and reason. Yet human physical capability remains fundamentally unaugmented. Movement literacy, motor skill development, rehabilitation, injury prevention, and safety monitoring still depend on expensive human experts ($80–200/hour for personal trainers, $100+/hour for physical therapists) or crude apps that count reps without understanding biomechanics. An estimated 1.7 billion people worldwide desire to learn physical skills, 55 million elderly Americans need fall monitoring, and millions need accessible PT rehab.
Recent advances in computer vision (MediaPipe [1], YOLOv8 [2]), large language models (Claude [3], GPT-4o [4]), and motion generation (HY-Motion 1.0 [5]) create an unprecedented opportunity to build AI coaching systems that see, understand, and communicate corrections in real-time. However, existing approaches suffer from three key limitations: (1) they require pre-recorded expert demonstrations, limiting skill coverage; (2) they use simplistic single-metric scoring that fails to capture the nuances of human movement; and (3) they lack natural communication interfaces, relying on visual overlays that users cannot observe during active movement.
Kinetic addresses all three limitations and goes further — introducing autonomous spatial monitoring for fall detection, goal-based reasoning for safety-critical environments, and multi-modal intelligence that spans coaching, rehabilitation, and patient monitoring. Our 4-tier expert generation pipeline eliminates the need for demonstration videos. Our triple-metric scoring engine provides robust, multi-dimensional form evaluation. And our voice-first interface delivers corrections naturally during active movement.
2. System Architecture
Kinetic's architecture spans six infrastructure pillars designed for real-time, multi-modal coaching:
- NVIDIA DGX Spark (Edge AI) — YOLOv8n-pose on the GB10 Superchip for 17-keypoint pose estimation at 15+ FPS with <50ms latency.
- Modal + NVIDIA A100 (Cloud GPU) — HY-Motion 1.0-Lite (0.46B params) deployed for text-to-3D motion generation, producing 30-frame skeleton sequences in ~26 seconds.
- Anthropic Claude Agent SDK — Multi-agent orchestration with 44 MCP tools across 12 categories, 3 sub-agents (Perception, Coach, Communicator), and 3 hooks (PreToolUse, PostToolUse, Stop).
- OpenAI GPT-4o Realtime API — Bidirectional voice coaching with 3-layer interruption handling and real-time pose score injection.
- Google MediaPipe + Ultralytics YOLO — 33 body + 21 hand landmarks tracked at 30 FPS with phase detection (preparation → execution → peak → recovery).
- Triple-Metric Scoring Engine — Gaussian joint angles (σ=15°) + Cosine spatial similarity + COCO OKS, weighted 40/30/30.
Data flows from camera frames through the CV pipeline to the scoring engine, with Claude orchestrating the coaching logic and GPT-4o delivering voice corrections. The architecture supports both edge-first inference (DGX Spark) for latency-critical pose estimation and cloud GPU inference (Modal A100) for computationally intensive motion generation.
3. AI Expert Generation Pipeline
The core innovation of Kinetic is generating expert references from text alone, eliminating the requirement for demonstration videos. Our 4-tier pipeline provides graceful degradation with increasing latency and sophistication:
| Tier | Method | Latency | Coverage |
|---|---|---|---|
| 1 | Canonical Templates (10+ exercises) | <1ms | Common exercises |
| 2 | Claude Semantic Mapping | ~500ms | Aliases & variations |
| 3 | Claude Angle Generation | ~1s | Any describable skill |
| 4 | HY-Motion 1.0 on A100 | ~26s | SOTA 3D motion sequences |
Each tier is attempted in order, falling through to the next only when necessary. Tier 1 handles ~70% of coaching requests instantly. Tier 2 resolves natural language variations ("barbell back squat" → squat). Tier 3 uses Claude's biomechanical reasoning to generate per-phase joint angles for truly novel skills. Tier 4 leverages Tencent's HY-Motion 1.0-Lite, a 0.46B parameter text-to-3D motion diffusion model, to generate full skeleton sequences with temporal dynamics.
4. Triple-Metric Scoring Engine
Traditional pose scoring relies on a single metric (typically joint angle difference), which fails to capture spatial relationships and overall pose shape. We propose a triple-metric approach:
- Gaussian Joint Angles (weight: 0.4) — Each of 16 key joint angles is scored using a Gaussian function with σ=15°. This provides smooth, interpretable per-joint scores that naturally penalize large deviations while being tolerant of minor variations within the acceptable range.
- Cosine Spatial Similarity (weight: 0.3) — Normalized skeleton vectors are compared using cosine similarity, capturing overall pose shape independent of body proportions. This metric detects global pose errors (e.g., leaning too far forward) that individual joint angles might miss.
- COCO OKS (weight: 0.3) — Object Keypoint Similarity, the standard metric in academic pose estimation research (used in the COCO benchmark [6]), provides a weighted evaluation based on joint importance and localization accuracy. Each keypoint has a per-type standard deviation (σk) reflecting its annotation variance.
The final score is a weighted combination: S = 0.4 × S_gaussian + 0.3 × S_cosine + 0.3 × S_oks. This multi-metric approach provides a more robust and nuanced evaluation than any single metric alone.
5. Autonomous Monitoring Mode
Beyond coaching, Kinetic operates as an autonomous spatial intelligence agent. In monitoring mode, the system accepts a goal (fall detection, desk security, posture watch, study focus) and runs a continuous perception→reasoning→action loop:
- Perception — MediaPipe pose + YOLO detection at 2 FPS (efficiency-optimized for 24/7 monitoring).
- Activity Classification — Multi-signal approach with temporal smoothing: standing, walking, sitting, fallen, exercising, lying_down.
- Goal-Based Reasoning — Claude Agent SDK evaluates whether detected activity matches the configured goal's alert triggers.
- Autonomous Action — Sends Telegram alerts with snapshot photos, voice alerts via OpenAI Realtime, and logs to the tool call panel.
The monitoring loop is fully bidirectional: caregivers can send Telegram commands (/status, /goals, /photo, /start, /stop) to control the system remotely. For hospital deployment, the elderly_care goal prioritizes fall detection with immediate photo alerts — designed for patient safety scenarios where response time is critical.
6. Edge AI Inference on DGX Spark
Real-time coaching demands sub-100ms feedback latency. Cloud-based pose estimation introduces 200-500ms of network latency, which creates a perceptible delay between movement and feedback. We deploy YOLOv8n-pose on the NVIDIA DGX Spark's GB10 Superchip, achieving 17-keypoint skeleton extraction at 15+ FPS with <50ms end-to-end latency — a 4x improvement over cloud alternatives.
The DGX Spark runs our custom inference server that handles concurrent pose estimation requests while maintaining real-time frame rates. The pipeline processes each frame in a single pass: detection → keypoint extraction → skeleton normalization → joint angle computation. This tight feedback loop is critical for movement correction during active exercise.
7. Voice-First Coaching Interface
When users are actively performing physical movements, they cannot look at a screen. Voice is the only interface modality that works during active exercise. Kinetic uses OpenAI's GPT-4o Realtime API for bidirectional audio streaming with three key innovations:
- 3-Layer Interruption Handling — Users can interrupt mid-sentence; the AI pauses gracefully and adjusts its response based on the interruption context.
- Context Injection — Real-time pose scores are injected into the voice model's context window, ensuring corrections reference the user's current form rather than stale data.
- Fallback TTS — If the Realtime API is unavailable, the system falls back to standard TTS to maintain coaching continuity.
8. Multi-Agent Orchestration with Claude Agent SDK
Kinetic uses the Anthropic Claude Agent SDK for deep multi-agent orchestration. The physical intelligence brain consists of Claude Sonnet 4 with 46 MCP tools organized across 12 categories: spatial analysis, pose comparison, skill coaching, expert generation, recording, reference management, phase detection, rep counting, training data, document parsing, skill intelligence, and system configuration.
Three sub-agents specialize in different aspects: the Perception Agent handles computer vision and pose analysis; the Coach Agent manages form evaluation, corrections, and skill pedagogy; and the Communicator Agent orchestrates voice delivery and user interaction. Pre-tool and post-tool hooks enforce safety guardrails and maintain an audit log of all tool invocations.
9. Implementation & Results
| Component | Metric | Value |
|---|---|---|
| DGX Pose Estimation | Latency | <50ms |
| DGX Pose Estimation | Frame Rate | 15+ FPS |
| Modal HY-Motion | Generation Time | ~26s (30 frames) |
| Modal HY-Motion | Output Shape | [30, 52, 3] |
| Expert Pipeline Tier 1 | Latency | <1ms |
| Expert Pipeline Tier 3 | Latency | ~1s |
| Claude Agent | MCP Tools | 46 (HTTP-exposed) |
| Intelligence Modes | Count | 4 (coach, PT, monitor, hospital) |
| Telegram Bot | Commands | 8 bidirectional |
| Codebase | Lines of Code | 17,000+ |
| Build Time | Duration | 20 hours (solo) |
10. Conclusion
Kinetic demonstrates that real-time physical movement intelligence is achievable through the combination of edge AI inference, multi-agent LLM orchestration, and state-of-the-art motion generation. By unifying skill coaching, physical therapy, autonomous monitoring, and hospital safety into a single platform, Kinetic shows that the same CV + AI stack that coaches a squat can detect a fall, guide PT rehab, and monitor a hospital room. The triple-metric scoring engine provides robust form evaluation, the voice-first interface enables natural coaching during active movement, and the Telegram integration enables autonomous operation without any human in the loop.
AI has transformed cognitive work. Kinetic brings that transformation to physical capability. Future work will explore wearable AR integration for projected skeleton overlays, multi-camera 3D reconstruction for depth-aware fall detection, clinical PT dashboards for remote rehabilitation monitoring, and hospital deployment with ceiling-mounted cameras for 24/7 patient safety.
References
- C. Lugaresi et al. "MediaPipe: A Framework for Building Perception Pipelines." arXiv:1906.08172, 2019.
- G. Jocher et al. "Ultralytics YOLOv8." https://github.com/ultralytics/ultralytics, 2023.
- Anthropic. "Claude Agent SDK Documentation." https://docs.anthropic.com/agent-sdk, 2025.
- OpenAI. "GPT-4o Realtime API." https://platform.openai.com/docs/guides/realtime, 2025.
- Tencent. "HY-Motion 1.0: Text-to-3D Motion Generation." https://huggingface.co/tencent/HY-Motion-1.0, 2025.
- T. Lin et al. "Microsoft COCO: Common Objects in Context." ECCV, 2014.
- NVIDIA. "DGX Spark with GB10 Superchip." https://www.nvidia.com/dgx-spark, 2025.
- Modal Labs. "Modal: Serverless GPU Infrastructure." https://modal.com, 2025.