Multimodal AI - Weekly

Multimodal Monday #29: Sampling Smarts, Composable Control

Week of October 13-19, 2025

October 20, 202517 Resources
Multimodal Monday #29: Sampling Smarts, Composable Control
View Original on Substack
General

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Harvard researchers developed power sampling, an MCMC-based method that unlocks latent reasoning in base models without any training. Their approach matches or beats RL-finetuned models on MATH500, HumanEval, and GPQA while maintaining generation diversity.

#reasoning#with#sampling+2
General

Ctrl-VI: Controllable Video Synthesis via Variational Inference

Stanford and MIT built Ctrl-VI, a video synthesis system that handles everything from text prompts to precise 4D object trajectories and camera paths. The framework uses variational inference with step-wise KL divergence minimization to produce controllable, diverse videos with 3D consistency.

#ctrl#controllable#video+2
General

FlashWorld: High-quality 3D Scene Generation within Seconds

Tencent, Xiamen University, and Fudan created FlashWorld, which generates high-quality 3D scenes from text or image prompts in 5-10 seconds. The model produces 3D Gaussian representations directly instead of going through multi-view intermediates, combining 2D diffusion quality with 3D geometric consistency.

#flashworld#high#quality+2
General

Trace Anything: Representing Any Video in 4D via Trajectory Fields

ByteDance SEED released Trace Anything, which maps every pixel in a video to a continuous 3D trajectory using B-splines in a single forward pass. The model achieves state-of-the-art performance on trajectory estimation and point-tracking while being significantly faster than iterative methods.

#trace#anything#representing+2
General

VIST3A: Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator

ETH Zurich and Google unified a video generator with a 3D reconstruction model through model stitching, connecting a pretrained 3D foundation model into the video VAE’s latent space via lightweight linear mapping. The system generates 3D representations directly from text without needing 3D training labels.

#vist3a#text#stitching+2
General

Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Eyeline Labs enables multi-view character consistency and 3D camera control in video diffusion using 4D Gaussian Splatting and video relighting.

#virtually#being#customizing+2
General

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

LabOS combines computational reasoning with physical experimentation through multimodal perception and XR-enabled human-AI collaboration.

#labos#the#scientist+2
General

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN enhances multi-turn VLM agents by integrating explicit visual state reasoning into the model’s thinking process.

#vagen#reinforcing#world+2
General

LAKAN: Landmark-Assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection

LAKAN introduces a new network architecture for detecting face forgeries. Paper [https://arxiv.org/pdf/2510.00634] https://substackcdn.com/image/fetch/$s_!LxhP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f5f5b41-ce8e-4697-84b1-44e73257892f_2076x906.jpegThe LAKAN module leverages facial landmarks to generate adaptive parameters for KAN and is applied to downsampled features from four different stages of the image encoder through gating mechanisms.

#lakan#landmark#assisted+2
General

Simple Projection Variants Improve ColBERT Performance

Mixedbread AI investigated architectural improvements to ColBERT’s projection layer for late-interaction models. Paper [https://arxiv.org/abs/2510.12327] TOOLS & TECHNIQUES

#simple#projection#variants+2
General

Google Veo 3.1

Google released Veo 3.1 and Veo 3.1 Fast through the Gemini API with richer native audio, better cinematic style understanding, and enhanced image-to-video. New features include ingredient-based generation with up to 3 reference images, scene extension for longer videos, and first-and-last frame interpolation.

#google#veo
General

Anthropic Claude Haiku 4.5

Anthropic released Claude Haiku 4.5, delivering near-frontier performance at one-third the cost and twice the speed of models from five months ago. The model handles real-time, low-latency tasks and is available on Claude API, Amazon Bedrock, and Google Cloud’s Vertex AI.

#anthropic#claude#haiku
General

Baidu PaddleOCR VL 0.9B

Baidu released a 0.9B parameter multilingual VLM for OCR tasks. https://substackcdn.com/image/fetch/$s_!jckW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a461f8-c9f7-459d-b67d-c85f23e49d4d_1200x675.jpeg

#baidu#paddleocr#09b
General

Alibaba Qwen3-VL-4B/8B

Alibaba released Qwen3-VL models in Instruct and Thinking variants. More options in the 4B-8B parameter range for vision-language tasks.

#alibaba#qwen3#4b8b
General

ImagenWorld

Google released ImagenWorld, a large-scale benchmark for image generation and editing that makes model failures more visible. Better benchmarks expose where generation models actually break. TRENDS & PREDICTIONS

#imagenworld
General

1. Continuous Representations Replace Discrete Ones

Trace Anything doesn’t compute frame-to-frame correspondences. It learns continuous 3D trajectories. FlashWorld doesn’t generate multiple views. It produces 3D Gaussians directly. Veo 3.1 doesn’t concatenate clips. It interpolates smooth transitions. The shift from discrete to continuous representations runs through every major paper this week. This matters because continuous representations are more compact, more queryable, and more composable. You can sample at any resolution. You can query based on derivatives and integrals. You can blend and interpolate smoothly. Discrete representations lock you into fixed sampling rates and make interpolation messy. Continuous ones give you infinite resolution and natural interpolation. The move to continuous isn’t just cleaner math. It’s fundamentally more powerful.

#continuous#representations#replace+2
General

2. Composable Control Wins

Ctrl-VI lets you combine text prompts with 4D trajectories and camera paths. Veo 3.1 adds reference images and scene extension. VIST3A stitches models together with linear mappings. The pattern is clear: systems that combine multiple control signals beat single-mode approaches. You don’t choose between high-level creativity and low-level precision anymore. You get both. This matters because creative workflows are messy. You need broad strokes and fine details at different stages. Composable control means you can start with a text prompt, refine with reference images, adjust specific object paths, and modify camera movement without switching tools. COMMUNITY + SHOUTOUTS Builder of the week: Real-time head pose estimation for perspective correction. Clean implementation, practical use case.

#composable#control#wins