Vision and Video Dynamics Lab · Research
A single research agenda
at increasing levels of abstraction.
Our research follows one progression — from pixel-level motion at the bottom, to world-scale simulation at the top. Each pillar builds on the one below it. The unifying object of study, across all four, is temporal dynamics — how the visual world changes over time.
A scroll-through tour of the four pillars, with live demos.
Video Processing
Low-level video enhancement
We make video look better at the pixel level. By exploiting temporal structure across frames, we synthesize smoother motion, restore detail beyond a single frame's limits, stabilize shaky perspective, and minimize quality loss under bandwidth constraints. Each task reduces to one core problem — modeling motion and correspondence between frames precisely enough to act on it.
Topics
- Video Frame Interpolation
- Video Compression
- Video Stabilization
- Video Super-Resolution & Enhancement
- Motion Estimation & Compensation
Representative Work
- AdaCoF: Adaptive Collaboration of Flows for Video Frame InterpolationCVPR 2020
- Exploring Discontinuity for Video Frame InterpolationCVPR 2023 Highlight · Top 10%
Connecting threadAll tasks reduce to precise modeling of inter-frame motion and correspondence.
Video Understanding
Temporal semantics & representation learning
We extract semantic meaning from temporal visual data — understanding what changes in a video and why. From action recognition to large-scale video foundation models, we study representations that capture meaningful temporal structure rather than treating video as a sequence of independent images.
Topics
- Video Foundation Models
- Action Recognition
- Video Retrieval & Question Answering
- Temporal Representation Learning
- Video–Language Alignment
Representative Work
- Learning Temporally Invariant and Localizable Features via Data Augmentation for Video RecognitionECCVW 2020 Oral
- TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation ModelsTwelve Labs Technical Report, 2024
Connecting threadTemporal semantics is the bridge from pixels to meaning — understanding what changes, and why.
3D / 4D Vision
3D scene reconstruction & rendering
We reconstruct 3D scenes from 2D observations and render them from new viewpoints. Neural representations — radiance fields, Gaussian splatting — capture geometry and appearance with high fidelity. From this foundation, we extend naturally into 4D, modeling how scenes evolve over time.
Topics
- 3D Scene Reconstruction
- Neural Radiance Fields (NeRF)
- 3D & 4D Gaussian Splatting
- Novel View Synthesis
- Dynamic Scene Reconstruction
Representative Work
- Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian SplattingNeurIPS 2025
Connecting threadRecovering 3D structure from 2D observations — naturally extending into time when scenes move.
World Models
Generative simulation of the visual world
We build generative systems for the visual world — from single-image synthesis with diffusion, through action-conditioned and physics-aware video generation, and into world foundation models that let robots and embodied agents anticipate the consequences of their actions. Across these directions, the aim is the same: visual intelligence — perception, prediction, and simulation — for systems that act in the world.
Topics
- World Foundation Models
- Action-Conditioned Video Generation
- Latent Action Models
- Physics-Aware Generation
- Foundation Models for Embodied AI
- Image/Video Generation
Current Direction
An active research direction, building on recent advances in action-conditioned video diffusion and latent action pretraining. Publications expected as the lab matures.
Connecting threadGeneration, prediction, and action-conditioning share one foundation — and one purpose: equipping agents to anticipate the visual consequences of action.