Multimodal AI - Weekly

Last Week In Multimodal AI #39: MLLMs Are Not Taking Over... Yet

Your Weekly Multimodal AI Roundup - Dec 22, 2025 - Jan 4, 2026

January 5, 202633 Resources
Last Week In Multimodal AI #39: MLLMs Are Not Taking Over... Yet
View Original on Substack
Tools & Techniques

Tencent HY-Motion 1.0

Tencent released a billion-parameter text-to-motion model using Diffusion Transformer architecture and flow matching. The model generates 3D character animations from text prompts. The Living Edge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

#tencent#motion
Tools & Techniques

Diffusion Knows Transparency (DKT)

DKT repurposes video diffusion models for transparent object depth and normal estimation. It achieves zero-shot SOTA on ClearPose/DREDS benchmarks, runs at 0.17s per frame, and maintains temporal consistency across videos.

#diffusion#knows#transparency+1
Tools & Techniques

LongVideoAgent

A multi-agent framework where a master LLM coordinates a grounding agent for segment localization and a vision agent for observation extraction. The system uses reinforcement learning to optimize multi-agent cooperation with step limits.

#longvideoagent
Tools & Techniques

Qwen-Image-2512

Qwen’s new text-to-image model delivers more realistic humans, finer natural textures, and stronger text rendering. It sets a new SOTA benchmark for image generation quality.

#qwen#image#2512
Tools & Techniques

Yume-1.5

A text-controlled interactive world generation model that creates explorable 3D environments. Users can navigate and interact with generated spaces in real time.

#yume
Tools & Techniques

TwinFlow

Enables one-step generation on large models using self-adversarial flows. The approach eliminates iterative sampling while maintaining output quality.

#twinflow
Tools & Techniques

Stable Video Infinite 2.0 Pro

The new version launched with immediate ComfyUI wrapper support from Kijai. Models are already available for download and integration.

#stable#video#infinite+1
Tools & Techniques

Soprano

An ultra-lightweight TTS model that generates 10 hours of 32kHz audio in under 20 seconds. It streams with sub-15ms latency using only 80M parameters and less than 1GB VRAM.

#soprano
Tools & Techniques

Wan-NVFP4

A fast video model claiming 28x faster render speeds than previous versions. Released by lightx2v on Hugging Face.

#wan#nvfp4
Tools & Techniques

JavisGPT

A unified multi-modal LLM for sounding-video comprehension and generation. The model handles video analysis and audio-visual synthesis in one framework.

#javisgpt
Tools & Techniques

Dream-VL & Dream-VLA

Open vision-language and vision-language-action models using a diffusion language model backbone. Both models integrate visual understanding with either language or robotic action outputs.

#dream#amp#dream+1
Tools & Techniques

HyperCLOVA X SEED Omni 8B

A unified multimodal model handling text, vision, audio, and video inputs with text, image, and audio outputs.

#hyperclova#seed#omni
Tools & Techniques

AMD ROCm

AMD published a guide for accelerating multimodal inference in vLLM using batch-level dynamic programming switches.

#amd#rocm
Research Highlights

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

MLLMs like Qwen2-VL and Paligemma2 fail at zero-shot multimodal retrieval despite excelling at generation. Using sparse autoencoders, researchers identified three limitations: training objectives optimize for generation not retrieval, evaluation focuses on generative tasks, and autoregressive architectures compute poor similarity scores.

#generative#giants#retrieval+3
Research Highlights

ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling

ReaSeq solves two problems in recommender systems: knowledge poverty in ID-based representations and systemic blindness to off-platform user interests. The framework uses reasoning to incorporate world knowledge into sequential modeling.

#reaseq#unleashing#world+3
Research Highlights

MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

MegaRAG automatically constructs multimodal knowledge graphs integrating text, visual, and spatial information from documents. It uses a two-round, page-based approach where LLMs extract entities in parallel then refine the graph by retrieving relevant subgraphs.

#megarag#multimodal#knowledge+3
Research Highlights

Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

This framework enhances prompt learning by decoupling knowledge from memorization. The approach addresses instability in low-resource settings where parametric models overfit to shallow patterns.

#retrieval#augmented#prompt+3
Research Highlights

Latent Implicit Visual Reasoning

LIVR discovers visual reasoning tokens without explicit supervision. The method outperforms approaches requiring costly annotations.

#latent#implicit#visual+1
Research Highlights

Geometric Memory in Sequence Models

Google DeepMind identified geometric memory in deep sequence models, where embeddings encode global relationships between all entities including those never co-occurring in training. This contrasts with associative memory’s brute-force lookup approach.

#geometric#memory#sequence+1
Research Highlights

Step-DeepResearch

A 32B parameter research agent matching OpenAI and Gemini DeepResearch through atomic capability training. It decomposes research into planning, information gathering, verification, and writing, achieving 61.42 on ResearchRubrics.

#step#deepresearch
Research Highlights

SpatialTree

A 4-level cognitive hierarchy mapping spatial abilities in MLLMs from perception to agentic competence. Benchmarks 27 sub-abilities across 16 models and reveals transfer patterns.

#spatialtree
Research Highlights

FlowBlending

Stage-aware multi-model sampling for fast and high-fidelity video generation. The approach optimizes different models for different stages of the generation process.

#flowblending
Research Highlights

SpaceTimePilot

Adobe’s video diffusion model disentangles space and time for controllable rendering. From a single input video, it enables independent control of camera viewpoint and motion for bullet-time, slow motion, and mixed trajectories.

#spacetimepilot
Research Highlights

HiStream

Meta’s autoregressive framework for 1080p video generation eliminates spatial, temporal, and timestep redundancy. HiStream achieves SOTA quality with up to 107.5x speedup.

#histream
Research Highlights

InsertAnywhere

Bridges 4D scene geometry and diffusion models for realistic video object insertion. The method maintains spatial and temporal consistency across frames.

#insertanywhere
Research Highlights

Robust-R1

A framework making multimodal models robust to visual degradations through explicit degradation-aware reasoning chains. Achieves SOTA robustness on R-Bench while maintaining interpretability.

#robust
Research Highlights

StoryMem

ByteDance’s multi-shot long video storytelling framework with memory. The system maintains narrative consistency across extended video sequences.

#storymem
Research Highlights

Spatia

Microsoft’s video generation system maintains a 3D scene point cloud as persistent spatial memory. Enables long-horizon, spatially consistent video generation with explicit camera control and 3D-aware editing.

#spatia
Research Highlights

DiffThinker

Enables generative multimodal reasoning with diffusion models. The approach integrates reasoning capabilities directly into the diffusion generation process.

#diffthinker
Community

ComfyUI Segmentation Agent

Adam Barbato released an LLM-based character segmentation agent for ComfyUI using SAM 3.

#comfyui#segmentation#agent
Community

CosyVoice 3 ComfyUI

Machine Delusion released a voice cloning node pack featuring CosyVoice 3 for ComfyUI.

#cosyvoice#comfyui
Community

SAM3 Video Tracking in X-AnyLabeling

Important_Priority76 integrated SAM3 video object tracking into X-AnyLabeling for easy annotation workflows.

#sam3#video#tracking+1
Community

AI Reskinning

Ege Berkina demonstrated how AI reskinning barriers have nearly disappeared with current tools.

#reskinning