AI Newsletter Knowledge Hub

General

T5Gemma 2

Google released T5Gemma 2, the next generation of encoder-decoder models. The architecture combines bidirectional understanding with flexible text generation.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

Perception Encoder Audiovisual (PE-AV)

Meta released PE-AV, the technical engine behind SAM Audio’s audio separation capabilities. The model processes both visual and audio information to isolate individual sound sources. https://substackcdn.com/image/fetch/$s_!uJMQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccc9bdf8-c72e-4558-b456-bbd0f3142122_2698x588.png

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

MiMo-V2-Flash

Xiaomi released MiMo-V2-Flash, optimized for speed in real-time applications. The model sacrifices some accuracy for dramatic latency reductions.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

TurboDiffusion

TurboDiffusion accelerates video diffusion models by 100-205 times through architectural optimizations. The speedup comes from reducing redundant computations without quality loss.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

Qwen-Image-Layered

Qwen-Image-Layered decomposes images into multiple RGBA layers that can be independently edited. Each layer isolates specific semantic or structural components while maintaining visual coherence.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

N3D-VLM

N3D-VLM grounds spatial reasoning in native 3D representations rather than 2D projections. The model understands depth, distance, and spatial relationships directly.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

MemFlow

MemFlow maintains adaptive memory for long streaming videos, deciding which frames to remember and which to discard. The system balances memory efficiency with video understanding quality.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

WorldPlay

Tencent’s WorldPlay generates interactive 3D worlds with long-term geometric consistency. The model maintains spatial relationships across extended video sequences, allowing persistent interaction with generated environments.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

LongVie 2

LongVie 2 generates 5-minute continuous videos with controllable elements and consistent geometry. The model handles multiple modalities and maintains coherence across thousands of frames.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

FoundationMotion

FoundationMotion labels and analyzes spatial movement in videos automatically. The system identifies motion patterns and spatial trajectories without manual annotation.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

Generative Refocusing

Generative Refocusing controls depth of field in existing images, simulating camera focus changes after capture. The model infers 3D scene structure to generate realistic blur patterns.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

StereoPilot

StereoPilot converts 2D videos to stereo 3D through learned generative priors. The system produces depth-aware conversions suitable for VR headsets.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

KV-Tracker: Real-Time Pose Tracking with Transformers

KV-Tracker achieves real-time tracking at 30 FPS without any training. The approach uses transformer key-value pairs to track objects and scenes across frames.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

DeContext: Protecting Images from Unwanted In-Context Edits

DeContext adds imperceptible perturbations that prevent DiT models like FLUX and Qwen-Image from making unwanted edits. The protection preserves visual quality while blocking manipulation attempts.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

EgoX: Generate Immersive First-Person Video from Any Third-Person Clip

EgoX transforms third-person videos into realistic first-person perspectives using video diffusion. The framework from KAIST AI and Seoul National University maintains spatial and temporal coherence during the transformation.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

MMGR: Multi-Modal Generative Reasoning

MMGR benchmarks reveal systematic reasoning failures in GPT-4o and other leading multimodal models. The evaluation exposes gaps between perception and logical inference. https://substackcdn.com/image/fetch/$s_!c2F6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95443f95-0cd1-4046-83cf-153c58be99ae_2198x980.jpegMMGR Overview

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

KlingAvatar 2.0 and Kling-Omni Technical Report

KlingAvatar 2.0 generates high-fidelity avatar videos through a spatio-temporal cascade framework. Kling-Omni provides a generalist framework for multimodal video generation with a Co-Reasoning Director that fuses instructions across modalities.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

Step-GUI Technical Report

Step-GUI introduces a self-evolving pipeline for GUI automation. The system reaches state-of-the-art on AndroidWorld and OSWorld benchmarks through iterative improvement.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

ReFusion

ReFusion combines diffusion models with parallel autoregressive decoding. The architecture bridges autoregressive models and diffusion models for faster text generation.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

DEER

DEER uses diffusion models to draft content and autoregressive models to verify it. The two-stage approach balances generation quality with computational efficiency.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

IC-Effect

IC-Effect applies video effects through in-context learning without fine-tuning. The system learns effect patterns from examples and applies them to new videos.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

Flow Map Trajectory Tilting

Flow Map Trajectory Tilting improves diffusion model outputs at test time using flow maps. The technique adjusts generation trajectories without retraining.

Dec 22, 2025•Last Week In Multimodal AI #38: From Clips to Worlds

General

UniVA: Universal Video Agent

UniVA works like LEGO for video AI, you plug in whatever tools you need. The demo shows it tracking objects, editing footage, and understanding complex scenes all in one system.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Phys2Real: Sim-to-Real Transfer

This method trains robots in simulation then transfers that knowledge to the real world by accounting for real-world messiness. The robot learns what it doesn’t know and adapts accordingly.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Pelican-VL 1.0: The Embodied Intelligence Brain

Beijing’s Pelican-VL converts what robots see into 3D movement commands directly. Their DPPO training method works like human practice, make mistakes, reflect, improve.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

OmniVinci: Omni-Modal Understanding LLM

NVIDIA’s OmniVinci processes vision, audio, and language in one unified space. It beats Qwen2.5-Omni by 19% while using 6x less training data. https://substackcdn.com/image/fetch/$s_!gUPi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6ff5d2a-81ba-41df-926d-6e11009c5cdb_2566x1108.png

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Teaching AI to See the World More Like We Do

DeepMind used an “odd-one-out” test to show how differently AI sees things compared to humans. Their three-step alignment method fixes this, making AI group concepts the way you naturally would. https://substackcdn.com/image/fetch/$s_!w4ED!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e138237-cba0-4da9-90ff-16329311306c_1440x617.pngDiagram of their three-step model-alignment method.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

SIMA 2

Google’s SIMA 2 plays games with you, learns through trial and error, and actually reasons about what to do. Talk to it through text, voice, or images, it understands high-level goals and figures out how to achieve them.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Depth Anything 3 (DA3)

DA3 generates depth maps from regular images with unprecedented accuracy. The demo shows it working on everything from selfies to satellite imagery.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Marble

World Labs’ Marble creates persistent 3D worlds from a single image, video, or text prompt. Upload a photo of your living room, get a walkable 3D space.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Holo2

H-Company’s Holo2 beats all computer-use benchmarks across web, desktop, and mobile. Drop it into your existing Holo setup, it works immediately on Ubuntu, Android, or Chrome. https://substackcdn.com/image/fetch/$s_!nTtj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0c7ac30-6b34-4a8f-8018-467ea25caffa_4997x3836.pngWeb Surfing with Holo2

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Music Flamingo

NVIDIA’s Music Flamingo understands full songs, not just clips. It analyzes music structure, identifies instruments, and reasons about compositions.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

The Perception-to-Action Gap Closes

This week shows three distinct approaches to the same problem: how do you get AI to actually do things, not just understand them? Pelican-VL tackles this for robotics with its DPPO training method, the model practices tasks, fails, analyzes what went wrong, then adjusts. Think of it like teaching a robot to play piano: it doesn’t just memorize finger positions, it learns the relationship between what it sees and how to move. The Beijing team tested this on real humanoid robots doing manipulation tasks, and the results show genuine spatial reasoning emerging from visual input alone. SIMA 2 solves this in virtual environments. Google’s agent doesn’t just execute commands, it maintains persistent goals across gaming sessions, reasons about cause and effect, and learns new skills without being explicitly programmed. When you tell it “build a house,” it figures out it needs to gather materials first, find a good location, and plan the structure. This kind of multi-step reasoning with envi

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

dLLM

Zhanhui Zhou turned BERT into a chatbot using diffusion. Yes, you read that right—BERT can now chat.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Next Scene LoRA

OdinLovis built a LoRA that adds camera movement to image generation. Type “Next Scene” and watch your static image become a cinematic sequence.

Nov 17, 2025•Multimodal Monday 33: Physical AI, Human Vision

General

Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval

Single-vector retrieval breaks when queries have multiple distinct answers. The Autoregressive Multi-Embedding Retriever (AMER) generates a sequence of query embeddings instead of one, capturing diverse relevant documents for ambiguous or list-based queries. https://substackcdn.com/image/fetch/$s_!rHn4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec1de12-7834-43c6-9d2b-1cffdf8b7e14_1844x1070.jpegThe proposed model takes as input the target document embedding (order decided randomly) or predicted embedding in the previous step, and output the next embedding. During inference, AMER predicts the first embedding after seeing the query text, and outputs multiple query embeddings autoregressively.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

FractalForensics: Proactive Deepfake Detection and Localization

This detector embeds fractal watermarks into images before they’re shared online. The watermarks survive normal edits but break under AI manipulation, showing you exactly where an image was altered. https://substackcdn.com/image/fetch/$s_!8eH5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89b07b2-26e4-4239-a40d-ed120f9aeac9_2414x766.jpegWorkflow of the proposed FractalForensics.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

Cambrian-S: Advancing Spatial Supersensing in Video

NYU and Stanford researchers built models that anticipate and organize complex visual experiences in long videos. The system selects relevant information and reasons about relationships between objects and events over time. https://substackcdn.com/image/fetch/$s_!6HNX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380cbe99-4d92-40a1-b0ab-352e7af92a8c_1790x1466.jpeg

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

The Underappreciated Power of Vision Models for Graph Structural Understanding

Vision models outperform graph neural networks at understanding global graph properties. GraphAbstract benchmark shows vision models intuitively grasp overall structure better than specialized GNNs.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Models improve reasoning on both vision and text tasks by generating video sequences. The Video Thinking Benchmark shows that video generation helps models explore possibilities and think dynamically.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

OlmoEarth-v1-Large

AllenAI released a foundation model for remote sensing trained on Sentinel and Landsat satellite data. OlmoEarth turns Earth data into insights within hours using ready-made infrastructure for both image and time series tasks.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

BindWeave

ByteDance’s model for subject-consistent video generation uses cross-modal integration to keep subjects consistent across multiple shots. BindWeave already works in ComfyUI.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

GEN-0

GeneralistAI built a 10B+ foundation model for robots with Harmonic Reasoning architecture. GEN-0 trains on 270,000+ hours of dexterous data to think and act simultaneously.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

Step-Audio-EditX

StepFun open-sourced the first LLM-grade audio editing model. Control emotion, speaking style, breaths, laughs, and sighs through text prompts in a 3B-parameter model that runs on a single GPU. https://substackcdn.com/image/fetch/$s_!fZzc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf776f14-08cd-47ea-8465-c002b6b64cd8_5012x932.pngAn overview of the architecture of Step-Audio-EditX

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

Rolling Forcing

This technique generates multi-minute streaming videos in real-time on a single GPU. Rolling Forcing denoises multiple frames jointly and anchors context with attention sinks for temporal consistency.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

Retrieval Gets Smarter

Search broke when you started asking it to do two things at once. AMER fixes this by generating multiple query embeddings instead of forcing everything through a single vector. Here’s what that means. When you search for “climate change impacts and economic solutions,” a single-vector system picks one interpretation and misses the other. AMER showed 4x better performance than single embedding models on synthetic data where queries had multiple distinct answers arXiv [https://arxiv.org/html/2511.02770]. The gains get bigger when your target documents are conceptually distant from each other. The technique works by predicting query embeddings autoregressively. Each embedding captures a different facet of what you want. Think of it as asking the question from multiple angles simultaneously rather than hoping one angle catches everything. On real-world datasets, AMER showed 4-21% gains on average, but the improvements jumped to 5-144% on queries where target documents formed distinct clu

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

Replicate Mouse Tracker

Shoutout to fofr and kylancodes for putting together a dedicated Replicate model that generates HTML with a face that follows the cursor.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

VideoSwarm 0.5

Shoutout to Cerzi for releasing VideoSwarm 0.5, a mass video player for easy browsing of large video datasets.

Nov 10, 2025•Multimodal Monday 32: Multi-Query Retrieval, Streaming Video

General

WALT: Web Agents that Learn Tools

Salesforce built WALT to make browser agents stop clicking around like lost tourists. Instead, agents now reverse-engineer website features into structured APIs through a demonstrate-generate-validate loop, turning messy UI interactions into clean function calls like search(query). https://substackcdn.com/image/fetch/$s_!UUKI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe71d6969-4654-4255-86d3-f82501bf4995_6583x1666.png

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

AGILE: Agentic Jigsaw Interaction Learning

Researchers trained a VLM by making it solve jigsaw puzzles through trial and error. The model observes the puzzle, generates code to swap pieces, sees the result, and tries again. This simple interactive loop took accuracy from 9.5% to 82.8% and improved performance on nine other vision tasks by an average of 3.1%. https://substackcdn.com/image/fetch/$s_!xJUk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3a2142e-1719-44a0-83bd-65e8e2519884_2882x1628.pngOverview of AGILE.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Sa2VA: Dense Grounded Understanding of Images and Videos

ByteDance combined SAM-2’s segmentation with LLaVA’s vision-language understanding into one unified model. Sa2VA handles both images and videos, producing pixel-precise masks for any object you ask about through conversational prompts. https://substackcdn.com/image/fetch/$s_!NMiu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a0bc258-4427-49e5-9d8d-bb486a42e15d_1200x853.jpeg

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Apple’s UltraCUA mixes low-level GUI actions with high-level API calls in one model. Train it with supervised learning, then online RL on hybrid action trajectories. The result beats baselines by 22% while running 11% faster. https://substackcdn.com/image/fetch/$s_!wB7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b0da10e-02f5-4e7b-90ff-7f429063645d_2246x936.jpegAn overview of UltraCUA’s design. The agent adaptively switches between visual grounding and programmatic tool call, establishing the hybrid action mechanism.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Grasp Any Region (GAR): Precise Pixel-Level Understanding for MLLMs

GAR lets you ask detailed questions about any specific region of an image. It uses global context plus a region-of-interest replay mechanism to beat a 78B-parameter baseline with a much smaller model, and works zero-shot on video tasks.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

DeepSeek OCR

DeepSeek’s OCR reads text in 100 languages and parses complex structures like charts and tables into HTML. It combines CLIP and SAM features for better grounding and a more efficient performance-to-vision-token ratio. https://substackcdn.com/image/fetch/$s_!6pPF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cc935-245c-4f71-a06d-3bc1ad6cae05_2646x1152.png

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Tencent Hunyuan World 1.1 (WorldMirror)

Tencent open-sourced WorldMirror, a feed-forward 3D reconstruction model that now handles video-to-3D and multi-view-to-3D. It runs on a single GPU and delivers complete 3D attributes in one forward pass within seconds.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

ByteDance Seed3D 1.0

ByteDance released Seed3D 1.0, which generates high-fidelity, simulation-ready 3D assets from a single image. The output works directly in physics simulations without additional processing.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

HoloCine by Ant Group

Generates complete cinematic narratives from text prompts. The model maintains global consistency across multiple shots, creating coherent stories instead of disconnected clips.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Krea Realtime by Krea AI

Krea AI released a 14B autoregressive model that generates video at 11 fps on a single B200 GPU. It’s 10x larger than any open-source alternative and handles long-form video generation in real time.

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Web Agents Learn to Think in Functions, Not Pixels

Salesforce and Apple both shipped the same insight this week: stop teaching agents to click buttons and start teaching them to extract functionality. WALT and UltraCUA both move from pixel-level automation to API-level understanding. Here’s why pixel-clicking fails. You train an agent to navigate a website by clicking specific coordinates or finding specific UI elements. Then the site updates its design. Or it loads slower than expected. Or a popup appears. Your agent breaks. Every edge case becomes a new failure mode. You’re essentially teaching the agent to memorize a choreographed dance routine on a stage that keeps changing. WALT and UltraCUA flip this. Instead of “click the search button at these coordinates,” the agent learns “this website has a search function that takes a query parameter.” It reverse-engineers the underlying capabilities. What can this site do? Search. Filter. Sort. Post. Each capability becomes a callable function. The agent demonstrates an action through th

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Document AI Understands Structure, Not Just Text

DeepSeek OCR marks a shift from text extraction to document understanding. Reading characters in 100 languages is table stakes. The real advance is parsing charts and tables into HTML, understanding layout and structure, preserving semantic relationships. Traditional OCR gives you a wall of text. You know what words are on the page but you’ve lost everything else. Which numbers belong to which row in a table? What’s a header versus a data point? How do the chart labels connect to the values? That context disappears. DeepSeek OCR preserves it. The model doesn’t just read text, it understands document semantics. A financial table stays a table with intact relationships between columns and rows. A chart becomes structured data with labels mapped to values. A multi-column layout maintains its hierarchy. This matters because most business-critical information lives in complex documents. Financial reports with nested tables. Scientific papers with methodology charts. Legal documents with cla

Oct 27, 2025•Multimodal Monday #30: Smarter Agents, Real-Time 3D

General

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Harvard researchers developed power sampling, an MCMC-based method that unlocks latent reasoning in base models without any training. Their approach matches or beats RL-finetuned models on MATH500, HumanEval, and GPQA while maintaining generation diversity.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Ctrl-VI: Controllable Video Synthesis via Variational Inference

Stanford and MIT built Ctrl-VI, a video synthesis system that handles everything from text prompts to precise 4D object trajectories and camera paths. The framework uses variational inference with step-wise KL divergence minimization to produce controllable, diverse videos with 3D consistency.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

FlashWorld: High-quality 3D Scene Generation within Seconds

Tencent, Xiamen University, and Fudan created FlashWorld, which generates high-quality 3D scenes from text or image prompts in 5-10 seconds. The model produces 3D Gaussian representations directly instead of going through multi-view intermediates, combining 2D diffusion quality with 3D geometric consistency.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Trace Anything: Representing Any Video in 4D via Trajectory Fields

ByteDance SEED released Trace Anything, which maps every pixel in a video to a continuous 3D trajectory using B-splines in a single forward pass. The model achieves state-of-the-art performance on trajectory estimation and point-tracking while being significantly faster than iterative methods.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

VIST3A: Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator

ETH Zurich and Google unified a video generator with a 3D reconstruction model through model stitching, connecting a pretrained 3D foundation model into the video VAE’s latent space via lightweight linear mapping. The system generates 3D representations directly from text without needing 3D training labels.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Eyeline Labs enables multi-view character consistency and 3D camera control in video diffusion using 4D Gaussian Splatting and video relighting.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

LabOS combines computational reasoning with physical experimentation through multimodal perception and XR-enabled human-AI collaboration.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN enhances multi-turn VLM agents by integrating explicit visual state reasoning into the model’s thinking process.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

LAKAN: Landmark-Assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection

LAKAN introduces a new network architecture for detecting face forgeries. Paper [https://arxiv.org/pdf/2510.00634] https://substackcdn.com/image/fetch/$s_!LxhP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f5f5b41-ce8e-4697-84b1-44e73257892f_2076x906.jpegThe LAKAN module leverages facial landmarks to generate adaptive parameters for KAN and is applied to downsampled features from four different stages of the image encoder through gating mechanisms.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Simple Projection Variants Improve ColBERT Performance

Mixedbread AI investigated architectural improvements to ColBERT’s projection layer for late-interaction models. Paper [https://arxiv.org/abs/2510.12327] TOOLS & TECHNIQUES

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Google Veo 3.1

Google released Veo 3.1 and Veo 3.1 Fast through the Gemini API with richer native audio, better cinematic style understanding, and enhanced image-to-video. New features include ingredient-based generation with up to 3 reference images, scene extension for longer videos, and first-and-last frame interpolation.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Anthropic Claude Haiku 4.5

Anthropic released Claude Haiku 4.5, delivering near-frontier performance at one-third the cost and twice the speed of models from five months ago. The model handles real-time, low-latency tasks and is available on Claude API, Amazon Bedrock, and Google Cloud’s Vertex AI.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Baidu PaddleOCR VL 0.9B

Baidu released a 0.9B parameter multilingual VLM for OCR tasks. https://substackcdn.com/image/fetch/$s_!jckW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a461f8-c9f7-459d-b67d-c85f23e49d4d_1200x675.jpeg

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

Alibaba Qwen3-VL-4B/8B

Alibaba released Qwen3-VL models in Instruct and Thinking variants. More options in the 4B-8B parameter range for vision-language tasks.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

ImagenWorld

Google released ImagenWorld, a large-scale benchmark for image generation and editing that makes model failures more visible. Better benchmarks expose where generation models actually break. TRENDS & PREDICTIONS

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

1. Continuous Representations Replace Discrete Ones

Trace Anything doesn’t compute frame-to-frame correspondences. It learns continuous 3D trajectories. FlashWorld doesn’t generate multiple views. It produces 3D Gaussians directly. Veo 3.1 doesn’t concatenate clips. It interpolates smooth transitions. The shift from discrete to continuous representations runs through every major paper this week. This matters because continuous representations are more compact, more queryable, and more composable. You can sample at any resolution. You can query based on derivatives and integrals. You can blend and interpolate smoothly. Discrete representations lock you into fixed sampling rates and make interpolation messy. Continuous ones give you infinite resolution and natural interpolation. The move to continuous isn’t just cleaner math. It’s fundamentally more powerful.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

2. Composable Control Wins

Ctrl-VI lets you combine text prompts with 4D trajectories and camera paths. Veo 3.1 adds reference images and scene extension. VIST3A stitches models together with linear mappings. The pattern is clear: systems that combine multiple control signals beat single-mode approaches. You don’t choose between high-level creativity and low-level precision anymore. You get both. This matters because creative workflows are messy. You need broad strokes and fine details at different stages. Composable control means you can start with a text prompt, refine with reference images, adjust specific object paths, and modify camera movement without switching tools. COMMUNITY + SHOUTOUTS Builder of the week: Real-time head pose estimation for perspective correction. Clean implementation, practical use case.

Oct 20, 2025•Multimodal Monday #29: Sampling Smarts, Composable Control

General

ModernVBERT: Towards Smaller Visual Document Retrievers

EPFL researchers built a 250M parameter model that matches systems 10x larger on document retrieval. They discovered bidirectional attention beats causal attention by +10.6 nDCG@5 for retrieval, and that mixing text-only pairs with image-text during training fixes data scarcity through cross-modal transfer. https://substackcdn.com/image/fetch/$s_!De-2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753a8eb5-577f-459a-a49c-58383bc5dc18_1170x1200.jpeg

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval

DocPruner slashes storage for visual document retrieval by 50-60% without hurting performance. The system analyzes attention scores to identify which patches matter, then adapts pruning intensity per document—aggressively cutting sparse pages while preserving dense ones. https://substackcdn.com/image/fetch/$s_!tAra!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e4ed74-cc43-4184-8fbe-ca406aef0a62_2000x475.pngThe illustration of comparison between OCR-based (a) & LVLM-based (b) paradigms for VDR, and DocPruner (c), a novel framework to adaptively prune the patch-level embeddings for diverse document types.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks

LEAML adapts multimodal models to specialized domains like medical imaging using minimal labeled data plus unlabeled samples. The framework turns abundant unlabeled visual content into useful training signal when expert annotations are expensive or impossible to obtain. https://substackcdn.com/image/fetch/$s_!U2Cn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd044006e-35ff-4c6d-8eec-9ff5ccf050da_2000x751.pngOverview of the proposed two-stage LEAML framework for OOD VQA adaptation. In Pseudo QA Generation, the QA Generator is trained using a small set of labeled question-answer pairs and then used to generate pseudo QA pairs for a large collection of unlabeled images. In OOD VQA Finetuning, the VQA model is fine-tuned with both the original labeled data and the produced pseudo QA pairs of unlabeled data, enabling label-efficient adaptation to out-of-distribution visual-question answering.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Coevolutionary Continuous Discrete Diffusion

CCDD enables joint generation across continuous (images, audio) and discrete (text) modalities in one unified process. The coevolutionary approach lets models reason across different representation types simultaneously rather than processing them separately.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

GraphSearch: An Agentic Deep Searching Workflow

DataArc’s GraphSearch fixes GraphRAG’s shallow retrieval problem through six-stage deep searching: decomposition, refinement, grounding, drafting, verification, and expansion. The dual-channel approach queries both text chunks and graph structure simultaneously, beating single-round GraphRAG on all benchmarks. https://substackcdn.com/image/fetch/$s_!Xajz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2df2869e-c07b-4183-8c0b-7cbbc71f7520_2000x480.pngComparison of using graph data only, text data only, or all data as commonly adopted in GraphRAG approaches. The metric is SubEM. The contribution of retrieved graph data is marginal.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Other New Notable Research:

Fathom-DeepResearch delivers evidence-based web investigation with two 4B models achieving SOTA among open-weights through DuetQA dataset and RAPO optimization.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

OpenAI Sora 2

Sora 2 ships with rightsholder controls and revenue sharing. Sam Altman says users are generating way more content than expected, so they’re building opt-in controls where creators get paid when their characters appear in user-generated content—basically “interactive fan fiction” that pays the original creators.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Anthropic Claude Sonnet 4.5

Claude Sonnet 4.5 breaks records: 77.2% on SWE-bench, 61.4% on OSWorld, and can code for 30+ hours straight. Ships with checkpoints in Claude Code, VS Code extension, memory tools for longer agent runs, and the Claude Agent SDK powering it all.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Alibaba Qwen3-VL-30B-A3B-Instruct

Alibaba’s Qwen3-VL uses just 3B active parameters to match GPT-5-Mini and Claude4-Sonnet on STEM, VQA, OCR, video, and agent tasks. Available in standard and FP8 versions, plus a massive 235B-A22B variant for maximum capability.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Tencent HunyuanImage-3.0

HunyuanImage-3.0 improves text-to-image generation across the board: better prompt understanding, higher quality, more consistent styles. Handles complex scenes, detailed characters, and maintains coherence across artistic styles.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Ovi: Twin Backbone Cross-Modal Fusion

Ovi generates synchronized audio and video simultaneously using twin backbone architecture. Creates 5-second 720×720 videos at 24 FPS with matched audio, supporting 9:16, 16:9, and 1:1 aspect ratios from text or text+image inputs.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants

General

Other Notable New Tools:

Code2Video generates educational videos from code for automated programming tutorials.

Oct 6, 2025•Multimodal Monday #27: Small Models Beat Giants