Reasoning with Sampling: Your Base Model is Smarter Than You Think
Harvard researchers developed power sampling, an MCMC-based method that unlocks latent reasoning in base models without any training. Their approach matches or beats RL-finetuned models on MATH500, HumanEval, and GPQA while maintaining generation diversity.
Ctrl-VI: Controllable Video Synthesis via Variational Inference
Stanford and MIT built Ctrl-VI, a video synthesis system that handles everything from text prompts to precise 4D object trajectories and camera paths. The framework uses variational inference with step-wise KL divergence minimization to produce controllable, diverse videos with 3D consistency.
FlashWorld: High-quality 3D Scene Generation within Seconds
Tencent, Xiamen University, and Fudan created FlashWorld, which generates high-quality 3D scenes from text or image prompts in 5-10 seconds. The model produces 3D Gaussian representations directly instead of going through multi-view intermediates, combining 2D diffusion quality with 3D geometric consistency.
Trace Anything: Representing Any Video in 4D via Trajectory Fields
ByteDance SEED released Trace Anything, which maps every pixel in a video to a continuous 3D trajectory using B-splines in a single forward pass. The model achieves state-of-the-art performance on trajectory estimation and point-tracking while being significantly faster than iterative methods.
VIST3A: Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator
ETH Zurich and Google unified a video generator with a 3D reconstruction model through model stitching, connecting a pretrained 3D foundation model into the video VAE’s latent space via lightweight linear mapping. The system generates 3D representations directly from text without needing 3D training labels.
Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures
Eyeline Labs enables multi-view character consistency and 3D camera control in video diffusion using 4D Gaussian Splatting and video relighting.
LabOS: The AI-XR Co-Scientist That Sees and Works With Humans
LabOS combines computational reasoning with physical experimentation through multimodal perception and XR-enabled human-AI collaboration.
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents
VAGEN enhances multi-turn VLM agents by integrating explicit visual state reasoning into the model’s thinking process.
LAKAN: Landmark-Assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection
LAKAN introduces a new network architecture for detecting face forgeries. Paper [https://arxiv.org/pdf/2510.00634] https://substackcdn.com/image/fetch/$s_!LxhP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f5f5b41-ce8e-4697-84b1-44e73257892f_2076x906.jpegThe LAKAN module leverages facial landmarks to generate adaptive parameters for KAN and is applied to downsampled features from four different stages of the image encoder through gating mechanisms.
Simple Projection Variants Improve ColBERT Performance
Mixedbread AI investigated architectural improvements to ColBERT’s projection layer for late-interaction models. Paper [https://arxiv.org/abs/2510.12327] TOOLS & TECHNIQUES
Google Veo 3.1
Google released Veo 3.1 and Veo 3.1 Fast through the Gemini API with richer native audio, better cinematic style understanding, and enhanced image-to-video. New features include ingredient-based generation with up to 3 reference images, scene extension for longer videos, and first-and-last frame interpolation.
Anthropic Claude Haiku 4.5
Anthropic released Claude Haiku 4.5, delivering near-frontier performance at one-third the cost and twice the speed of models from five months ago. The model handles real-time, low-latency tasks and is available on Claude API, Amazon Bedrock, and Google Cloud’s Vertex AI.
Baidu PaddleOCR VL 0.9B
Baidu released a 0.9B parameter multilingual VLM for OCR tasks. https://substackcdn.com/image/fetch/$s_!jckW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a461f8-c9f7-459d-b67d-c85f23e49d4d_1200x675.jpeg
Alibaba Qwen3-VL-4B/8B
Alibaba released Qwen3-VL models in Instruct and Thinking variants. More options in the 4B-8B parameter range for vision-language tasks.
ImagenWorld
Google released ImagenWorld, a large-scale benchmark for image generation and editing that makes model failures more visible. Better benchmarks expose where generation models actually break. TRENDS & PREDICTIONS
1. Continuous Representations Replace Discrete Ones
Trace Anything doesn’t compute frame-to-frame correspondences. It learns continuous 3D trajectories. FlashWorld doesn’t generate multiple views. It produces 3D Gaussians directly. Veo 3.1 doesn’t concatenate clips. It interpolates smooth transitions. The shift from discrete to continuous representations runs through every major paper this week. This matters because continuous representations are more compact, more queryable, and more composable. You can sample at any resolution. You can query based on derivatives and integrals. You can blend and interpolate smoothly. Discrete representations lock you into fixed sampling rates and make interpolation messy. Continuous ones give you infinite resolution and natural interpolation. The move to continuous isn’t just cleaner math. It’s fundamentally more powerful.
2. Composable Control Wins
Ctrl-VI lets you combine text prompts with 4D trajectories and camera paths. Veo 3.1 adds reference images and scene extension. VIST3A stitches models together with linear mappings. The pattern is clear: systems that combine multiple control signals beat single-mode approaches. You don’t choose between high-level creativity and low-level precision anymore. You get both. This matters because creative workflows are messy. You need broad strokes and fine details at different stages. Composable control means you can start with a text prompt, refine with reference images, adjust specific object paths, and modify camera movement without switching tools. COMMUNITY + SHOUTOUTS Builder of the week: Real-time head pose estimation for perspective correction. Clean implementation, practical use case.
