Vision and Video Dynamics Lab · Research

A single research agenda
at increasing levels of abstraction.

Our research follows one progression — from pixel-level motion at the bottom, to world-scale simulation at the top. Each pillar builds on the one below it. The unifying object of study, across all four, is temporal dynamics — how the visual world changes over time.

Visual showcase

A scroll-through tour of the four pillars, with live demos.

Pillar 01

Video Processing

Low-level video enhancement

We make video look better at the pixel level. By exploiting temporal structure across frames, we synthesize smoother motion, restore detail beyond a single frame's limits, stabilize shaky perspective, and minimize quality loss under bandwidth constraints. Each task reduces to one core problem — modeling motion and correspondence between frames precisely enough to act on it.

Topics

Video Frame Interpolation
Video Compression
Video Stabilization
Video Super-Resolution & Enhancement
Motion Estimation & Compensation

Representative Work

AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation

CVPR 2020
Exploring Discontinuity for Video Frame Interpolation

CVPR 2023 Highlight · Top 10%

Connecting threadAll tasks reduce to precise modeling of inter-frame motion and correspondence.

Pillar 02

Video Understanding

Temporal semantics & representation learning

We extract semantic meaning from temporal visual data — understanding what changes in a video and why. From action recognition to large-scale video foundation models, we study representations that capture meaningful temporal structure rather than treating video as a sequence of independent images.

Topics

Video Foundation Models
Action Recognition
Video Retrieval & Question Answering
Temporal Representation Learning
Video–Language Alignment

Representative Work

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition

ECCVW 2020 Oral
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Twelve Labs Technical Report, 2024

Connecting threadTemporal semantics is the bridge from pixels to meaning — understanding what changes, and why.

Pillar 03

3D / 4D Vision

3D scene reconstruction & rendering

We reconstruct 3D scenes from 2D observations and render them from new viewpoints. Neural representations — radiance fields, Gaussian splatting — capture geometry and appearance with high fidelity. From this foundation, we extend naturally into 4D, modeling how scenes evolve over time.

Topics

3D Scene Reconstruction
Neural Radiance Fields (NeRF)
3D & 4D Gaussian Splatting
Novel View Synthesis
Dynamic Scene Reconstruction

Representative Work

Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian Splatting

NeurIPS 2025

Connecting threadRecovering 3D structure from 2D observations — naturally extending into time when scenes move.

Pillar 04

World Models

Generative simulation of the visual world

We build generative systems for the visual world — from single-image synthesis with diffusion, through action-conditioned and physics-aware video generation, and into world foundation models that let robots and embodied agents anticipate the consequences of their actions. Across these directions, the aim is the same: visual intelligence — perception, prediction, and simulation — for systems that act in the world.

Topics

World Foundation Models
Action-Conditioned Video Generation
Latent Action Models
Physics-Aware Generation
Foundation Models for Embodied AI
Image/Video Generation

Current Direction

An active research direction, building on recent advances in action-conditioned video diffusion and latent action pretraining. Publications expected as the lab matures.

Connecting threadGeneration, prediction, and action-conditioning share one foundation — and one purpose: equipping agents to anticipate the visual consequences of action.