T5Gemma 2
Google released T5Gemma 2, the next generation of encoder-decoder models. The architecture combines bidirectional understanding with flexible text generation.
Perception Encoder Audiovisual (PE-AV)
Meta released PE-AV, the technical engine behind SAM Audio’s audio separation capabilities. The model processes both visual and audio information to isolate individual sound sources. https://substackcdn.com/image/fetch/$s_!uJMQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fccc9bdf8-c72e-4558-b456-bbd0f3142122_2698x588.png
MiMo-V2-Flash
Xiaomi released MiMo-V2-Flash, optimized for speed in real-time applications. The model sacrifices some accuracy for dramatic latency reductions.
TurboDiffusion
TurboDiffusion accelerates video diffusion models by 100-205 times through architectural optimizations. The speedup comes from reducing redundant computations without quality loss.
Qwen-Image-Layered
Qwen-Image-Layered decomposes images into multiple RGBA layers that can be independently edited. Each layer isolates specific semantic or structural components while maintaining visual coherence.
N3D-VLM
N3D-VLM grounds spatial reasoning in native 3D representations rather than 2D projections. The model understands depth, distance, and spatial relationships directly.
MemFlow
MemFlow maintains adaptive memory for long streaming videos, deciding which frames to remember and which to discard. The system balances memory efficiency with video understanding quality.
WorldPlay
Tencent’s WorldPlay generates interactive 3D worlds with long-term geometric consistency. The model maintains spatial relationships across extended video sequences, allowing persistent interaction with generated environments.
LongVie 2
LongVie 2 generates 5-minute continuous videos with controllable elements and consistent geometry. The model handles multiple modalities and maintains coherence across thousands of frames.
FoundationMotion
FoundationMotion labels and analyzes spatial movement in videos automatically. The system identifies motion patterns and spatial trajectories without manual annotation.
Generative Refocusing
Generative Refocusing controls depth of field in existing images, simulating camera focus changes after capture. The model infers 3D scene structure to generate realistic blur patterns.
StereoPilot
StereoPilot converts 2D videos to stereo 3D through learned generative priors. The system produces depth-aware conversions suitable for VR headsets.
KV-Tracker: Real-Time Pose Tracking with Transformers
KV-Tracker achieves real-time tracking at 30 FPS without any training. The approach uses transformer key-value pairs to track objects and scenes across frames.
DeContext: Protecting Images from Unwanted In-Context Edits
DeContext adds imperceptible perturbations that prevent DiT models like FLUX and Qwen-Image from making unwanted edits. The protection preserves visual quality while blocking manipulation attempts.
EgoX: Generate Immersive First-Person Video from Any Third-Person Clip
EgoX transforms third-person videos into realistic first-person perspectives using video diffusion. The framework from KAIST AI and Seoul National University maintains spatial and temporal coherence during the transformation.
MMGR: Multi-Modal Generative Reasoning
MMGR benchmarks reveal systematic reasoning failures in GPT-4o and other leading multimodal models. The evaluation exposes gaps between perception and logical inference. https://substackcdn.com/image/fetch/$s_!c2F6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95443f95-0cd1-4046-83cf-153c58be99ae_2198x980.jpegMMGR Overview
KlingAvatar 2.0 and Kling-Omni Technical Report
KlingAvatar 2.0 generates high-fidelity avatar videos through a spatio-temporal cascade framework. Kling-Omni provides a generalist framework for multimodal video generation with a Co-Reasoning Director that fuses instructions across modalities.
Step-GUI Technical Report
Step-GUI introduces a self-evolving pipeline for GUI automation. The system reaches state-of-the-art on AndroidWorld and OSWorld benchmarks through iterative improvement.
ReFusion
ReFusion combines diffusion models with parallel autoregressive decoding. The architecture bridges autoregressive models and diffusion models for faster text generation.
DEER
DEER uses diffusion models to draft content and autoregressive models to verify it. The two-stage approach balances generation quality with computational efficiency.
IC-Effect
IC-Effect applies video effects through in-context learning without fine-tuning. The system learns effect patterns from examples and applies them to new videos.
Flow Map Trajectory Tilting
Flow Map Trajectory Tilting improves diffusion model outputs at test time using flow maps. The technique adjusts generation trajectories without retraining.
UniVA: Universal Video Agent
UniVA works like LEGO for video AI, you plug in whatever tools you need. The demo shows it tracking objects, editing footage, and understanding complex scenes all in one system.
Phys2Real: Sim-to-Real Transfer
This method trains robots in simulation then transfers that knowledge to the real world by accounting for real-world messiness. The robot learns what it doesn’t know and adapts accordingly.
Pelican-VL 1.0: The Embodied Intelligence Brain
Beijing’s Pelican-VL converts what robots see into 3D movement commands directly. Their DPPO training method works like human practice, make mistakes, reflect, improve.
OmniVinci: Omni-Modal Understanding LLM
NVIDIA’s OmniVinci processes vision, audio, and language in one unified space. It beats Qwen2.5-Omni by 19% while using 6x less training data. https://substackcdn.com/image/fetch/$s_!gUPi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6ff5d2a-81ba-41df-926d-6e11009c5cdb_2566x1108.png
Teaching AI to See the World More Like We Do
DeepMind used an “odd-one-out” test to show how differently AI sees things compared to humans. Their three-step alignment method fixes this, making AI group concepts the way you naturally would. https://substackcdn.com/image/fetch/$s_!w4ED!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e138237-cba0-4da9-90ff-16329311306c_1440x617.pngDiagram of their three-step model-alignment method.
SIMA 2
Google’s SIMA 2 plays games with you, learns through trial and error, and actually reasons about what to do. Talk to it through text, voice, or images, it understands high-level goals and figures out how to achieve them.
Depth Anything 3 (DA3)
DA3 generates depth maps from regular images with unprecedented accuracy. The demo shows it working on everything from selfies to satellite imagery.
Marble
World Labs’ Marble creates persistent 3D worlds from a single image, video, or text prompt. Upload a photo of your living room, get a walkable 3D space.
Holo2
H-Company’s Holo2 beats all computer-use benchmarks across web, desktop, and mobile. Drop it into your existing Holo setup, it works immediately on Ubuntu, Android, or Chrome. https://substackcdn.com/image/fetch/$s_!nTtj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0c7ac30-6b34-4a8f-8018-467ea25caffa_4997x3836.pngWeb Surfing with Holo2
Music Flamingo
NVIDIA’s Music Flamingo understands full songs, not just clips. It analyzes music structure, identifies instruments, and reasons about compositions.
The Perception-to-Action Gap Closes
This week shows three distinct approaches to the same problem: how do you get AI to actually do things, not just understand them? Pelican-VL tackles this for robotics with its DPPO training method, the model practices tasks, fails, analyzes what went wrong, then adjusts. Think of it like teaching a robot to play piano: it doesn’t just memorize finger positions, it learns the relationship between what it sees and how to move. The Beijing team tested this on real humanoid robots doing manipulation tasks, and the results show genuine spatial reasoning emerging from visual input alone. SIMA 2 solves this in virtual environments. Google’s agent doesn’t just execute commands, it maintains persistent goals across gaming sessions, reasons about cause and effect, and learns new skills without being explicitly programmed. When you tell it “build a house,” it figures out it needs to gather materials first, find a good location, and plan the structure. This kind of multi-step reasoning with envi
dLLM
Zhanhui Zhou turned BERT into a chatbot using diffusion. Yes, you read that right—BERT can now chat.
Next Scene LoRA
OdinLovis built a LoRA that adds camera movement to image generation. Type “Next Scene” and watch your static image become a cinematic sequence.
Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
Single-vector retrieval breaks when queries have multiple distinct answers. The Autoregressive Multi-Embedding Retriever (AMER) generates a sequence of query embeddings instead of one, capturing diverse relevant documents for ambiguous or list-based queries. https://substackcdn.com/image/fetch/$s_!rHn4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec1de12-7834-43c6-9d2b-1cffdf8b7e14_1844x1070.jpegThe proposed model takes as input the target document embedding (order decided randomly) or predicted embedding in the previous step, and output the next embedding. During inference, AMER predicts the first embedding after seeing the query text, and outputs multiple query embeddings autoregressively.
FractalForensics: Proactive Deepfake Detection and Localization
This detector embeds fractal watermarks into images before they’re shared online. The watermarks survive normal edits but break under AI manipulation, showing you exactly where an image was altered. https://substackcdn.com/image/fetch/$s_!8eH5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89b07b2-26e4-4239-a40d-ed120f9aeac9_2414x766.jpegWorkflow of the proposed FractalForensics.
Cambrian-S: Advancing Spatial Supersensing in Video
NYU and Stanford researchers built models that anticipate and organize complex visual experiences in long videos. The system selects relevant information and reasons about relationships between objects and events over time. https://substackcdn.com/image/fetch/$s_!6HNX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380cbe99-4d92-40a1-b0ab-352e7af92a8c_1790x1466.jpeg
The Underappreciated Power of Vision Models for Graph Structural Understanding
Vision models outperform graph neural networks at understanding global graph properties. GraphAbstract benchmark shows vision models intuitively grasp overall structure better than specialized GNNs.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Models improve reasoning on both vision and text tasks by generating video sequences. The Video Thinking Benchmark shows that video generation helps models explore possibilities and think dynamically.
OlmoEarth-v1-Large
AllenAI released a foundation model for remote sensing trained on Sentinel and Landsat satellite data. OlmoEarth turns Earth data into insights within hours using ready-made infrastructure for both image and time series tasks.
BindWeave
ByteDance’s model for subject-consistent video generation uses cross-modal integration to keep subjects consistent across multiple shots. BindWeave already works in ComfyUI.
GEN-0
GeneralistAI built a 10B+ foundation model for robots with Harmonic Reasoning architecture. GEN-0 trains on 270,000+ hours of dexterous data to think and act simultaneously.
Step-Audio-EditX
StepFun open-sourced the first LLM-grade audio editing model. Control emotion, speaking style, breaths, laughs, and sighs through text prompts in a 3B-parameter model that runs on a single GPU. https://substackcdn.com/image/fetch/$s_!fZzc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf776f14-08cd-47ea-8465-c002b6b64cd8_5012x932.pngAn overview of the architecture of Step-Audio-EditX
Rolling Forcing
This technique generates multi-minute streaming videos in real-time on a single GPU. Rolling Forcing denoises multiple frames jointly and anchors context with attention sinks for temporal consistency.
Retrieval Gets Smarter
Search broke when you started asking it to do two things at once. AMER fixes this by generating multiple query embeddings instead of forcing everything through a single vector. Here’s what that means. When you search for “climate change impacts and economic solutions,” a single-vector system picks one interpretation and misses the other. AMER showed 4x better performance than single embedding models on synthetic data where queries had multiple distinct answers arXiv [https://arxiv.org/html/2511.02770]. The gains get bigger when your target documents are conceptually distant from each other. The technique works by predicting query embeddings autoregressively. Each embedding captures a different facet of what you want. Think of it as asking the question from multiple angles simultaneously rather than hoping one angle catches everything. On real-world datasets, AMER showed 4-21% gains on average, but the improvements jumped to 5-144% on queries where target documents formed distinct clu
Replicate Mouse Tracker
Shoutout to fofr and kylancodes for putting together a dedicated Replicate model that generates HTML with a face that follows the cursor.
VideoSwarm 0.5
Shoutout to Cerzi for releasing VideoSwarm 0.5, a mass video player for easy browsing of large video datasets.
WALT: Web Agents that Learn Tools
Salesforce built WALT to make browser agents stop clicking around like lost tourists. Instead, agents now reverse-engineer website features into structured APIs through a demonstrate-generate-validate loop, turning messy UI interactions into clean function calls like search(query). https://substackcdn.com/image/fetch/$s_!UUKI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe71d6969-4654-4255-86d3-f82501bf4995_6583x1666.png
AGILE: Agentic Jigsaw Interaction Learning
Researchers trained a VLM by making it solve jigsaw puzzles through trial and error. The model observes the puzzle, generates code to swap pieces, sees the result, and tries again. This simple interactive loop took accuracy from 9.5% to 82.8% and improved performance on nine other vision tasks by an average of 3.1%. https://substackcdn.com/image/fetch/$s_!xJUk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3a2142e-1719-44a0-83bd-65e8e2519884_2882x1628.pngOverview of AGILE.
Sa2VA: Dense Grounded Understanding of Images and Videos
ByteDance combined SAM-2’s segmentation with LLaVA’s vision-language understanding into one unified model. Sa2VA handles both images and videos, producing pixel-precise masks for any object you ask about through conversational prompts. https://substackcdn.com/image/fetch/$s_!NMiu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a0bc258-4427-49e5-9d8d-bb486a42e15d_1200x853.jpeg
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Apple’s UltraCUA mixes low-level GUI actions with high-level API calls in one model. Train it with supervised learning, then online RL on hybrid action trajectories. The result beats baselines by 22% while running 11% faster. https://substackcdn.com/image/fetch/$s_!wB7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b0da10e-02f5-4e7b-90ff-7f429063645d_2246x936.jpegAn overview of UltraCUA’s design. The agent adaptively switches between visual grounding and programmatic tool call, establishing the hybrid action mechanism.
Grasp Any Region (GAR): Precise Pixel-Level Understanding for MLLMs
GAR lets you ask detailed questions about any specific region of an image. It uses global context plus a region-of-interest replay mechanism to beat a 78B-parameter baseline with a much smaller model, and works zero-shot on video tasks.
DeepSeek OCR
DeepSeek’s OCR reads text in 100 languages and parses complex structures like charts and tables into HTML. It combines CLIP and SAM features for better grounding and a more efficient performance-to-vision-token ratio. https://substackcdn.com/image/fetch/$s_!6pPF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F331cc935-245c-4f71-a06d-3bc1ad6cae05_2646x1152.png
Tencent Hunyuan World 1.1 (WorldMirror)
Tencent open-sourced WorldMirror, a feed-forward 3D reconstruction model that now handles video-to-3D and multi-view-to-3D. It runs on a single GPU and delivers complete 3D attributes in one forward pass within seconds.
ByteDance Seed3D 1.0
ByteDance released Seed3D 1.0, which generates high-fidelity, simulation-ready 3D assets from a single image. The output works directly in physics simulations without additional processing.
HoloCine by Ant Group
Generates complete cinematic narratives from text prompts. The model maintains global consistency across multiple shots, creating coherent stories instead of disconnected clips.
Krea Realtime by Krea AI
Krea AI released a 14B autoregressive model that generates video at 11 fps on a single B200 GPU. It’s 10x larger than any open-source alternative and handles long-form video generation in real time.
Web Agents Learn to Think in Functions, Not Pixels
Salesforce and Apple both shipped the same insight this week: stop teaching agents to click buttons and start teaching them to extract functionality. WALT and UltraCUA both move from pixel-level automation to API-level understanding. Here’s why pixel-clicking fails. You train an agent to navigate a website by clicking specific coordinates or finding specific UI elements. Then the site updates its design. Or it loads slower than expected. Or a popup appears. Your agent breaks. Every edge case becomes a new failure mode. You’re essentially teaching the agent to memorize a choreographed dance routine on a stage that keeps changing. WALT and UltraCUA flip this. Instead of “click the search button at these coordinates,” the agent learns “this website has a search function that takes a query parameter.” It reverse-engineers the underlying capabilities. What can this site do? Search. Filter. Sort. Post. Each capability becomes a callable function. The agent demonstrates an action through th
Document AI Understands Structure, Not Just Text
DeepSeek OCR marks a shift from text extraction to document understanding. Reading characters in 100 languages is table stakes. The real advance is parsing charts and tables into HTML, understanding layout and structure, preserving semantic relationships. Traditional OCR gives you a wall of text. You know what words are on the page but you’ve lost everything else. Which numbers belong to which row in a table? What’s a header versus a data point? How do the chart labels connect to the values? That context disappears. DeepSeek OCR preserves it. The model doesn’t just read text, it understands document semantics. A financial table stays a table with intact relationships between columns and rows. A chart becomes structured data with labels mapped to values. A multi-column layout maintains its hierarchy. This matters because most business-critical information lives in complex documents. Financial reports with nested tables. Scientific papers with methodology charts. Legal documents with cla
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Harvard researchers developed power sampling, an MCMC-based method that unlocks latent reasoning in base models without any training. Their approach matches or beats RL-finetuned models on MATH500, HumanEval, and GPQA while maintaining generation diversity.
Ctrl-VI: Controllable Video Synthesis via Variational Inference
Stanford and MIT built Ctrl-VI, a video synthesis system that handles everything from text prompts to precise 4D object trajectories and camera paths. The framework uses variational inference with step-wise KL divergence minimization to produce controllable, diverse videos with 3D consistency.
FlashWorld: High-quality 3D Scene Generation within Seconds
Tencent, Xiamen University, and Fudan created FlashWorld, which generates high-quality 3D scenes from text or image prompts in 5-10 seconds. The model produces 3D Gaussian representations directly instead of going through multi-view intermediates, combining 2D diffusion quality with 3D geometric consistency.
Trace Anything: Representing Any Video in 4D via Trajectory Fields
ByteDance SEED released Trace Anything, which maps every pixel in a video to a continuous 3D trajectory using B-splines in a single forward pass. The model achieves state-of-the-art performance on trajectory estimation and point-tracking while being significantly faster than iterative methods.
VIST3A: Text-to-3D by Stitching a Multi-View Reconstruction Network to a Video Generator
ETH Zurich and Google unified a video generator with a 3D reconstruction model through model stitching, connecting a pretrained 3D foundation model into the video VAE’s latent space via lightweight linear mapping. The system generates 3D representations directly from text without needing 3D training labels.
Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures
Eyeline Labs enables multi-view character consistency and 3D camera control in video diffusion using 4D Gaussian Splatting and video relighting.
LabOS: The AI-XR Co-Scientist That Sees and Works With Humans
LabOS combines computational reasoning with physical experimentation through multimodal perception and XR-enabled human-AI collaboration.
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents
VAGEN enhances multi-turn VLM agents by integrating explicit visual state reasoning into the model’s thinking process.
LAKAN: Landmark-Assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection
LAKAN introduces a new network architecture for detecting face forgeries. Paper [https://arxiv.org/pdf/2510.00634] https://substackcdn.com/image/fetch/$s_!LxhP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f5f5b41-ce8e-4697-84b1-44e73257892f_2076x906.jpegThe LAKAN module leverages facial landmarks to generate adaptive parameters for KAN and is applied to downsampled features from four different stages of the image encoder through gating mechanisms.
Simple Projection Variants Improve ColBERT Performance
Mixedbread AI investigated architectural improvements to ColBERT’s projection layer for late-interaction models. Paper [https://arxiv.org/abs/2510.12327] TOOLS & TECHNIQUES
Google Veo 3.1
Google released Veo 3.1 and Veo 3.1 Fast through the Gemini API with richer native audio, better cinematic style understanding, and enhanced image-to-video. New features include ingredient-based generation with up to 3 reference images, scene extension for longer videos, and first-and-last frame interpolation.
Anthropic Claude Haiku 4.5
Anthropic released Claude Haiku 4.5, delivering near-frontier performance at one-third the cost and twice the speed of models from five months ago. The model handles real-time, low-latency tasks and is available on Claude API, Amazon Bedrock, and Google Cloud’s Vertex AI.
Baidu PaddleOCR VL 0.9B
Baidu released a 0.9B parameter multilingual VLM for OCR tasks. https://substackcdn.com/image/fetch/$s_!jckW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a461f8-c9f7-459d-b67d-c85f23e49d4d_1200x675.jpeg
Alibaba Qwen3-VL-4B/8B
Alibaba released Qwen3-VL models in Instruct and Thinking variants. More options in the 4B-8B parameter range for vision-language tasks.
ImagenWorld
Google released ImagenWorld, a large-scale benchmark for image generation and editing that makes model failures more visible. Better benchmarks expose where generation models actually break. TRENDS & PREDICTIONS
1. Continuous Representations Replace Discrete Ones
Trace Anything doesn’t compute frame-to-frame correspondences. It learns continuous 3D trajectories. FlashWorld doesn’t generate multiple views. It produces 3D Gaussians directly. Veo 3.1 doesn’t concatenate clips. It interpolates smooth transitions. The shift from discrete to continuous representations runs through every major paper this week. This matters because continuous representations are more compact, more queryable, and more composable. You can sample at any resolution. You can query based on derivatives and integrals. You can blend and interpolate smoothly. Discrete representations lock you into fixed sampling rates and make interpolation messy. Continuous ones give you infinite resolution and natural interpolation. The move to continuous isn’t just cleaner math. It’s fundamentally more powerful.
2. Composable Control Wins
Ctrl-VI lets you combine text prompts with 4D trajectories and camera paths. Veo 3.1 adds reference images and scene extension. VIST3A stitches models together with linear mappings. The pattern is clear: systems that combine multiple control signals beat single-mode approaches. You don’t choose between high-level creativity and low-level precision anymore. You get both. This matters because creative workflows are messy. You need broad strokes and fine details at different stages. Composable control means you can start with a text prompt, refine with reference images, adjust specific object paths, and modify camera movement without switching tools. COMMUNITY + SHOUTOUTS Builder of the week: Real-time head pose estimation for perspective correction. Clean implementation, practical use case.
ModernVBERT: Towards Smaller Visual Document Retrievers
EPFL researchers built a 250M parameter model that matches systems 10x larger on document retrieval. They discovered bidirectional attention beats causal attention by +10.6 nDCG@5 for retrieval, and that mixing text-only pairs with image-text during training fixes data scarcity through cross-modal transfer. https://substackcdn.com/image/fetch/$s_!De-2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753a8eb5-577f-459a-a49c-58383bc5dc18_1170x1200.jpeg
DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval
DocPruner slashes storage for visual document retrieval by 50-60% without hurting performance. The system analyzes attention scores to identify which patches matter, then adapts pruning intensity per document—aggressively cutting sparse pages while preserving dense ones. https://substackcdn.com/image/fetch/$s_!tAra!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e4ed74-cc43-4184-8fbe-ca406aef0a62_2000x475.pngThe illustration of comparison between OCR-based (a) & LVLM-based (b) paradigms for VDR, and DocPruner (c), a novel framework to adaptively prune the patch-level embeddings for diverse document types.
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks
LEAML adapts multimodal models to specialized domains like medical imaging using minimal labeled data plus unlabeled samples. The framework turns abundant unlabeled visual content into useful training signal when expert annotations are expensive or impossible to obtain. https://substackcdn.com/image/fetch/$s_!U2Cn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd044006e-35ff-4c6d-8eec-9ff5ccf050da_2000x751.pngOverview of the proposed two-stage LEAML framework for OOD VQA adaptation. In Pseudo QA Generation, the QA Generator is trained using a small set of labeled question-answer pairs and then used to generate pseudo QA pairs for a large collection of unlabeled images. In OOD VQA Finetuning, the VQA model is fine-tuned with both the original labeled data and the produced pseudo QA pairs of unlabeled data, enabling label-efficient adaptation to out-of-distribution visual-question answering.
Coevolutionary Continuous Discrete Diffusion
CCDD enables joint generation across continuous (images, audio) and discrete (text) modalities in one unified process. The coevolutionary approach lets models reason across different representation types simultaneously rather than processing them separately.
GraphSearch: An Agentic Deep Searching Workflow
DataArc’s GraphSearch fixes GraphRAG’s shallow retrieval problem through six-stage deep searching: decomposition, refinement, grounding, drafting, verification, and expansion. The dual-channel approach queries both text chunks and graph structure simultaneously, beating single-round GraphRAG on all benchmarks. https://substackcdn.com/image/fetch/$s_!Xajz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2df2869e-c07b-4183-8c0b-7cbbc71f7520_2000x480.pngComparison of using graph data only, text data only, or all data as commonly adopted in GraphRAG approaches. The metric is SubEM. The contribution of retrieved graph data is marginal.
Other New Notable Research:
Fathom-DeepResearch delivers evidence-based web investigation with two 4B models achieving SOTA among open-weights through DuetQA dataset and RAPO optimization.
OpenAI Sora 2
Sora 2 ships with rightsholder controls and revenue sharing. Sam Altman says users are generating way more content than expected, so they’re building opt-in controls where creators get paid when their characters appear in user-generated content—basically “interactive fan fiction” that pays the original creators.
Anthropic Claude Sonnet 4.5
Claude Sonnet 4.5 breaks records: 77.2% on SWE-bench, 61.4% on OSWorld, and can code for 30+ hours straight. Ships with checkpoints in Claude Code, VS Code extension, memory tools for longer agent runs, and the Claude Agent SDK powering it all.
Alibaba Qwen3-VL-30B-A3B-Instruct
Alibaba’s Qwen3-VL uses just 3B active parameters to match GPT-5-Mini and Claude4-Sonnet on STEM, VQA, OCR, video, and agent tasks. Available in standard and FP8 versions, plus a massive 235B-A22B variant for maximum capability.
Tencent HunyuanImage-3.0
HunyuanImage-3.0 improves text-to-image generation across the board: better prompt understanding, higher quality, more consistent styles. Handles complex scenes, detailed characters, and maintains coherence across artistic styles.
Ovi: Twin Backbone Cross-Modal Fusion
Ovi generates synchronized audio and video simultaneously using twin backbone architecture. Creates 5-second 720×720 videos at 24 FPS with matched audio, supporting 9:16, 16:9, and 1:1 aspect ratios from text or text+image inputs.
Other Notable New Tools:
Code2Video generates educational videos from code for automated programming tutorials.