Kinetic

Real-Time Physical Movement Intelligence

A cognitive layer for understanding, improving, and preserving human physical capability. From skill coaching to fall detection to PT rehab — Kinetic perceives, reasons about, and enhances human movement in real-time.

GitHub Repository Modal API Endpoint Technical Paper

17,000+

Lines of Code

AI Models

Infrastructure Pillars

20hrs

Solo-Built

4 Intelligence Modes

One platform, four ways to enhance human physical capability.

🎯

AI Skill Coach

Learn any movement from video, text, or AI-generated expert motion with real-time voice feedback

🏥

Physical Therapy

Guided rehab exercises with safety boundaries, ROM tracking, and progress monitoring

👁️

Autonomous Monitor

Goal-based spatial AI with fall detection, desk watch, and focus tracking → Telegram alerts

🏨

Hospital Monitor

Patient fall detection, inactivity alerts, and autonomous caregiver notifications with photos

Demo Video

Loom recording coming soon

6 Infrastructure Pillars

Each pillar represents a major technology powering the real-time coaching pipeline.

NVIDIA DGX Spark

Edge AI — GB10 Superchip

17-keypoint real-time pose estimation at 15+ FPS with <50ms latency. YOLOv8n-pose runs directly on the GB10 Superchip for instant skeleton extraction.

15+ FPS<50ms17 keypoints

Modal + NVIDIA A100

Cloud GPU — HY-Motion 1.0-Lite

Tencent's SOTA text-to-3D motion model (Dec 2025, 0.46B params) deployed on A100 GPU. Generates 30-frame motion sequences from text in ~26 seconds.

0.46B params~26s genA100 80GB

Anthropic Claude Agent SDK

Multi-Agent Orchestration

Claude Sonnet 4 orchestrates 44 MCP tools across 12 categories. 3 sub-agents (Perception, Coach, Communicator) with hooks for safety and audit.

44 tools3 agents12 categories

OpenAI Realtime API

Bidirectional Voice Coaching

GPT-4o Realtime API delivers natural voice coaching with 3-layer interruption handling. Context injection feeds live pose scores into voice.

3-layer interruptionGPT-4oReal-time

Computer Vision Pipeline

MediaPipe + YOLOv8n-pose

33 body landmarks + 21 hand landmarks tracked in real-time. Phase detection identifies preparation → execution → peak → recovery automatically.

33+21 landmarks4 phases30 FPS

Triple-Metric Scoring

Gaussian + Cosine + COCO OKS

Three complementary metrics for robust pose evaluation: Gaussian joint angles (σ=15°), Cosine spatial similarity, and COCO OKS — the academic standard.

3 metrics16 anglesσ=15°

4-Tier AI Expert Generation

No expert video needed. Kinetic generates references through 4 tiers of increasing sophistication.

Canonical Templates

Instant

10+ pre-built exercises with biomechanically accurate joint angles per phase

Claude Semantic Mapping

~0.5s

Natural language → canonical exercise mapping via Claude

Claude Angle Generation

~1s

Novel skill description → per-phase joint angles generated by Claude

HY-Motion on Modal A100

~26s

SOTA text-to-3D motion diffusion generates full skeleton sequences

How It Works

You say: "teach me a roundhouse kick"

→ Tier 1: Alias lookup... miss
→ Tier 2: Claude semantic mapping... no canonical match
→ Tier 3: Claude generates 4-phase joint angles (1.2s)
✓ Expert reference loaded (16 joints × 4 phases)

→ Camera: MediaPipe detects 33 landmarks at 30 FPS
→ Scoring: Triple-metric comparison every frame
→ Voice: "Great hip rotation! Extend your kicking leg more — aim for 160°"

→ Score: 87/100 (angles: 82, cosine: 91, OKS: 88)

10 AI Models

YOLOv8n-poseMediaPipe PoseMediaPipe HandsHY-Motion 1.0-LiteCLIP-ViT-LargeQwen3-8BClaude Sonnet 4GPT-4o RealtimeOpenAI TTSOpenAI Whisper

Live Modal A100 Endpoint

HY-Motion 1.0-Lite generating real 3D motion on NVIDIA A100 GPU.

$ curl -X POST https://rajashekarvennavelli--aegis-motion-generate-endpoint.modal.run \
-H "Content-Type: application/json" \
-d '{"prompt": "a person doing a squat", "num_frames": 30}'

{
  "model": "HY-Motion 1.0-Lite (Tencent, SOTA Dec 2025)",
  "device": "NVIDIA A100 (Modal)",
  "generation_ms": 26525.3,
  "num_frames": 30,
  "raw_shape": [30, 52, 3],
  "keypoints": [... 30 frames of MediaPipe 33-point keypoints ...]
}

Technical Paper

Our approach to AI-driven physical skill coaching.

Kinetic: A Cognitive Layer for Real-Time Physical Movement Intelligence

TreeHacks 2026, Stanford University, Stanford, CA

Abstract

We present Kinetic, the first real-time physical movement intelligence system that perceives, reasons about, and enhances human physical capability across multiple domains: skill coaching, physical therapy rehabilitation, autonomous fall detection, and spatial monitoring. Kinetic introduces a 4-tier expert generation pipeline, a triple-metric scoring engine (Gaussian joint angles + cosine spatial similarity + COCO OKS), voice-first coaching via GPT-4o Realtime, and autonomous monitoring with Telegram alerts. Edge AI on NVIDIA DGX Spark provides sub-50ms latency. 44 MCP tools orchestrated through the Anthropic Claude Agent SDK with 3 sub-agents.

1. Introduction & Motivation

AI has transformed cognitive work. Yet human physical capability remains unaugmented. Movement literacy, rehabilitation, injury prevention, and safety monitoring still depend on expensive human experts or crude apps. 1.7 billion people want to learn physical skills. 55 million elderly Americans need fall monitoring. Millions more need accessible PT rehab.

Kinetic is the first system to unify skill coaching, physical therapy, autonomous monitoring, and hospital safety into a single Physical Movement Intelligence platform — powered by the same CV + AI stack across all four modes.

2. System Architecture

Kinetic's architecture spans 6 infrastructure pillars: (1) NVIDIA DGX Spark for edge AI pose estimation, (2) Modal + NVIDIA A100 for cloud-based motion generation, (3) Anthropic Claude Agent SDK for multi-agent orchestration, (4) OpenAI GPT-4o Realtime API for voice coaching, (5) Google MediaPipe + Ultralytics YOLO for computer vision, and (6) a custom triple-metric scoring engine. Data flows from camera frames through the CV pipeline to the scoring engine, with Claude orchestrating the coaching logic and voice delivering corrections.

3. Triple-Metric Scoring Engine

Traditional pose scoring relies on a single metric (typically joint angle difference), which fails to capture spatial relationships and overall pose shape. We propose a triple-metric approach:

Gaussian Joint Angles (40%): Each of 16 key joint angles is scored using a Gaussian function with σ=15°, providing smooth, interpretable per-joint scores.
Cosine Spatial Similarity (30%): Normalized skeleton vectors are compared using cosine similarity, capturing overall pose shape independent of body proportions.
COCO OKS (30%): Object Keypoint Similarity, the standard metric in academic pose estimation (used in COCO benchmark), provides a weighted evaluation based on joint importance and localization accuracy.

4. AI Expert Generation Pipeline

The key innovation is generating expert references from text alone. Our 4-tier pipeline provides graceful degradation: Tier 1 (canonical templates) serves instant responses for common exercises; Tier 2 (Claude semantic mapping) resolves aliases and variations; Tier 3 (Claude angle generation) handles novel skills through biomechanical reasoning; and Tier 4 (HY-Motion 1.0-Lite on A100) provides state-of-the-art 3D motion diffusion for the most complex cases. Each tier is attempted in order, falling through to the next only when necessary.

5. Autonomous Monitoring

In monitoring mode, Kinetic accepts a goal (fall detection, desk security, posture watch) and runs a continuous perception→reasoning→action loop. Falls are detected via activity classification with temporal smoothing. Claude evaluates alert triggers. Telegram delivers photo alerts to caregivers instantly. The bot is bidirectional: caregivers send /status, /goals, /photo commands.

6. Edge AI & Latency

Real-time coaching demands sub-100ms feedback latency. We achieve this through edge AI inference on the NVIDIA DGX Spark's GB10 Superchip, running YOLOv8n-pose for 17-keypoint estimation at 15+ FPS with <50ms end-to-end latency. This is 4x faster than cloud-based alternatives and enables the tight feedback loop required for movement correction during active exercise.

7. Conclusion

Kinetic demonstrates that the same CV + AI stack that coaches a squat can detect a fall, guide PT rehab, and monitor a hospital room. By unifying four intelligence modes into one platform, Kinetic brings to physical capability what AI has already brought to cognitive work. AI has transformed how we think. Kinetic transforms how we move.

AI has transformed cognitive work. Kinetic brings that transformation to physical capability.

Solo-built in 20 hours at TreeHacks 2026 🌲 Stanford University · 17,000+ lines · 10 AI models

AnthropicOpenAINVIDIAModalTencentGoogle