ViViD·Lab

Vision and Video Dynamics Lab · Research

A single research agenda at increasing levels of abstraction.

Our research follows one progression — from pixel-level motion at the bottom, to world-scale simulation at the top. Each pillar builds on the one below it. The unifying object of study, across all four, is temporal dynamics — how the visual world changes over time.

Visual showcase

A scroll-through tour of the four pillars, with live demos.

Pillar 01

Video Processing

Low-level video enhancement

We make video look better at the pixel level. By exploiting temporal structure across frames, we synthesize smoother motion, restore detail beyond a single frame's limits, stabilize shaky perspective, and minimize quality loss under bandwidth constraints. Each task reduces to one core problem — modeling motion and correspondence between frames precisely enough to act on it.

Topics

  • Video Frame Interpolation
  • Video Compression
  • Video Stabilization
  • Video Super-Resolution & Enhancement
  • Motion Estimation & Compensation

Representative Work

  • AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation
    CVPR 2020
  • Exploring Discontinuity for Video Frame Interpolation
    CVPR 2023 Highlight · Top 10%

Connecting threadAll tasks reduce to precise modeling of inter-frame motion and correspondence.

Pillar 02

Video Understanding

Temporal semantics & representation learning

We extract semantic meaning from temporal visual data — understanding what changes in a video and why. From action recognition to large-scale video foundation models, we study representations that capture meaningful temporal structure rather than treating video as a sequence of independent images.

Topics

  • Video Foundation Models
  • Action Recognition
  • Video Retrieval & Question Answering
  • Temporal Representation Learning
  • Video–Language Alignment

Representative Work

  • Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition
    ECCVW 2020 Oral
  • TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
    Twelve Labs Technical Report, 2024

Connecting threadTemporal semantics is the bridge from pixels to meaning — understanding what changes, and why.

Pillar 03

3D / 4D Vision

3D scene reconstruction & rendering

We reconstruct 3D scenes from 2D observations and render them from new viewpoints. Neural representations — radiance fields, Gaussian splatting — capture geometry and appearance with high fidelity. From this foundation, we extend naturally into 4D, modeling how scenes evolve over time.

Topics

  • 3D Scene Reconstruction
  • Neural Radiance Fields (NeRF)
  • 3D & 4D Gaussian Splatting
  • Novel View Synthesis
  • Dynamic Scene Reconstruction

Representative Work

  • Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian Splatting
    NeurIPS 2025

Connecting threadRecovering 3D structure from 2D observations — naturally extending into time when scenes move.

Pillar 04

World Models

Generative simulation of the visual world

We build generative systems for the visual world — from single-image synthesis with diffusion, through action-conditioned and physics-aware video generation, and into world foundation models that let robots and embodied agents anticipate the consequences of their actions. Across these directions, the aim is the same: visual intelligence — perception, prediction, and simulation — for systems that act in the world.

Topics

  • World Foundation Models
  • Action-Conditioned Video Generation
  • Latent Action Models
  • Physics-Aware Generation
  • Foundation Models for Embodied AI
  • Image/Video Generation

Current Direction

An active research direction, building on recent advances in action-conditioned video diffusion and latent action pretraining. Publications expected as the lab matures.

Connecting threadGeneration, prediction, and action-conditioning share one foundation — and one purpose: equipping agents to anticipate the visual consequences of action.