Research
My research focuses on video understanding and generation for social interaction:
how models can understand, simulate, and generate people, relationships, goals,
and behavior. I view social intelligence as a two-sided problem, where better
models of human behavior can inform more coherent human-centric generation, and
generation can reveal what our models truly understand. To that end, I study
both human-centric methods and general methods for multimodal understanding and
generation.
-
Human-centric video generation with Professor Huaizu Jiang
- January 2026 – Present
- Addressing the weakness of video generation models in generating socially coherent scenes.
- Building diffusion and flow models to address these gaps by explicitly modeling each agent in the video.
- By representing intent, belief, and knowledge in a latent space, we can allow the model to perform higher-order reasoning about goals and actions.
-
Addressing social degradation in pre-trained vision-language models with Professors Weiyan Shi, Gang Hua, and Huaizu Jiang
- February 2025 – December 2025
- Published in TMLR. [arxiv] [openreview]
- Led a project to unify different visual social interaction understanding tasks under one model, leveraging the synergies between diverse tasks to achieve positive transfer and competitive performance overall.
- Revealed popular VLMs of the same scale suffer a degradation impairing their social understanding and leading to negative transfer, which I uncovered comes from reduced social decodability of the visual representations after VLM training.
- Working on extending the work to handle complex compositional social tasks.
-
Egocentric Werewolf strategy classification and utterance prediction with Harrison Kim and Professors Weiyan Shi and Huaizu Jiang
- January 2024 – January 2025
- Led a project to understand subtle social cues from an egocentric perspective.
- Significantly improved performance in strategy prediction over prior methods.
- Worked on producing a strategic game-playing agent, which eventually motivated a pivot to more general social interaction understanding.
-
Modeling nuclei segmentation with Evan Liu and Harrison Kim @ Genentech gRED
- October 2023 – December 2023
- Contributed to novel approaches and implemented state-of-the-art methods for nuclei semantic segmentation as part of the Genentech Computer Vision R&D team.
-
Medical QA fine-tuning with Dr. Michael Wu, Chloe Kim, and Ayush Zenith @ Genentech gRED
- July 2023 – December 2023
- Trained ensembles of language models and NER/RE models on large-scale in-house medical datasets.
- Designed and conducted extensive experiments to evaluate the performance of different models and techniques.
-
Long-form audio-visual understanding with Huaizu Jiang
-
Visual common sense understanding with Alberto Mario Ceballos Arroyo and Professors Byron Wallace and Huaizu Jiang
- August 2022 – August 2023
- Focused first on visual question answering commonsense datasets and explored various approaches to solving the tasks.
- Pivoted to early concepts in reasoning like chain-of-thought prompting, discovering that prompting with intermediate reasoning harmed the performance of smaller language models, contrary to popular belief at the time. We documented our findings in a preprint.
|
Based on Jon Barron's website.
|