Table of Contents

cs.CL [Back]

[1] PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins

Sihan Chen,John P. Lalor,Yi Yang,Ahmed Abbasi

Main category: cs.CL

TL;DR: PersonaTwin是一种多层级提示调节框架,用于生成和评估个性化的数字孪生,通过整合人口统计、行为和心理测量数据,提升了大型语言模型(LLM)对用户的多维细微特征的捕捉能力。

Details Motivation: 现有的大型语言模型在模拟用户行为和建模时,往往无法充分捕捉个体的多维细微特征,因此需要一种更精细化的方法来实现个性化的数字孪生。

Contribution: 提出了PersonaTwin框架,通过多层级提示调节整合多维度用户数据,显著提升了数字孪生的生成质量和公平性,并验证了其在下游任务中的有效性。

Method: 使用多层级提示调节框架,结合人口统计、行为和心理测量数据,生成个性化数字孪生;通过文本相似度指标和人口统计平等性评估进行全面评测。

Result: 实验表明,PersonaTwin生成的数字孪生在仿真逼真度上接近真实用户设置,且在下游任务中表现与直接使用真实数据训练的模型相当。

Insight: PersonaTwin展示了基于LLM的数字孪生方法在个性化用户建模和行为分析中的潜力,为高保真、情感细腻的用户模拟提供了新工具。

Abstract: While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis.

[2] gpt-oss-120b & gpt-oss-20b Model Card

OpenAI,:,Sandhini Agarwal,Lama Ahmad,Jason Ai,Sam Altman,Andy Applebaum,Edwin Arbus,Rahul K. Arora,Yu Bai,Bowen Baker,Haiming Bao,Boaz Barak,Ally Bennett,Tyler Bertao,Nivedita Brett,Eugene Brevdo,Greg Brockman,Sebastien Bubeck,Che Chang,Kai Chen,Mark Chen,Enoch Cheung,Aidan Clark,Dan Cook,Marat Dukhan,Casey Dvorak,Kevin Fives,Vlad Fomenko,Timur Garipov,Kristian Georgiev,Mia Glaese,Tarun Gogineni,Adam Goucher,Lukas Gross,Katia Gil Guzman,John Hallman,Jackie Hehir,Johannes Heidecke,Alec Helyar,Haitang Hu,Romain Huet,Jacob Huh,Saachi Jain,Zach Johnson,Chris Koch,Irina Kofman,Dominik Kundel,Jason Kwon,Volodymyr Kyrylov,Elaine Ya Le,Guillaume Leclerc,James Park Lennon,Scott Lessans,Mario Lezcano-Casado,Yuanzhi Li,Zhuohan Li,Ji Lin,Jordan Liss,Lily,Liu,Jiancheng Liu,Kevin Lu,Chris Lu,Zoran Martinovic,Lindsay McCallum,Josh McGrath,Scott McKinney,Aidan McLaughlin,Song Mei,Steve Mostovoy,Tong Mu,Gideon Myles,Alexander Neitz,Alex Nichol,Jakub Pachocki,Alex Paino,Dana Palmie,Ashley Pantuliano,Giambattista Parascandolo,Jongsoo Park,Leher Pathak,Carolina Paz,Ludovic Peran,Dmitry Pimenov,Michelle Pokrass,Elizabeth Proehl,Huida Qiu,Gaby Raila,Filippo Raso,Hongyu Ren,Kimmy Richardson,David Robinson,Bob Rotsted,Hadi Salman,Suvansh Sanjeev,Max Schwarzer,D. Sculley,Harshit Sikchi,Kendal Simon,Karan Singhal,Yang Song,Dane Stuckey,Zhiqing Sun,Philippe Tillet,Sam Toizer,Foivos Tsimpourlas,Nikhil Vyas,Eric Wallace,Xin Wang,Miles Wang,Olivia Watkins,Kevin Weil,Amy Wendling,Kevin Whinnery,Cedric Whitney,Hannah Wong,Lin Yang,Yu Yang,Michihiro Yasunaga,Kristen Ying,Wojciech Zaremba,Wenting Zhan,Cyril Zhang,Brian Zhang,Eddie Zhang,Shengjia Zhao

Main category: cs.CL

TL;DR: 论文介绍了gpt-oss-120b和gpt-oss-20b两个开放权重的推理模型,通过混合专家Transformer架构和大规模蒸馏强化学习训练,实现了高精度和低推理成本,同时具备强大的智能体能力(如深度研究浏览、Python工具使用等)。

Details Motivation: 推动开放权重大型语言模型的发展,提升模型在推理、工具使用等方面的能力,同时通过开源促进更广泛的研究和应用。

Contribution: 1. 提出了两个高效的大型语言模型gpt-oss-120b和gpt-oss-20b;2. 采用混合专家Transformer架构和蒸馏强化学习优化模型;3. 提供了完整的开源工具链和环境。

Method: 1. 使用混合专家(Mixture-of-Experts)Transformer架构;2. 通过大规模蒸馏和强化学习训练模型;3. 优化模型的智能体能力(如工具使用、指令跟随等)。

Result: 模型在数学、编程和安全等基准测试中表现优异,且具备强大的智能体功能。

Insight: 通过开源模型权重和工具链,可以加速社区的研究和应用,同时展示了混合专家架构和大规模训练在高性能语言模型中的潜力。

Abstract: We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

[3] Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

Nasim Shirvani-Mahdavi,Chengkai Li

Main category: cs.CL

TL;DR: Rule2Text是一个利用大语言模型(LLM)为知识图谱规则生成自然语言解释的框架,通过实验和评估,显著提升了规则解释的质量。

Details Motivation: 知识图谱中的逻辑规则通常难以理解,Rule2Text旨在通过自然语言解释提升其可访问性和可用性。

Contribution: 1. 提出Rule2Text框架,结合LLM生成规则解释;2. 开发并验证了LLM-as-a-judge评估方法;3. 构建高质量数据集用于微调开源模型。

Method: 1. 利用多种LLM和提示策略生成解释;2. 通过人工和LLM评估解释质量;3. 微调Zephyr模型并整合类型推断模块。

Result: 微调后模型在解释质量上显著提升,尤其在特定领域数据集中表现突出。

Insight: 结合LLM生成与人工评估的反馈,能够显著提升规则解释的可读性和准确性。

Abstract: Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models’ performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at https://github.com/idirlab/KGRule2NL.

[4] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Tomer Wolfson,Harsh Trivedi,Mor Geva,Yoav Goldberg,Dan Roth,Tushar Khot,Ashish Sabharwal,Reut Tsarfaty

Main category: cs.CL

TL;DR: MoNaCo 是一个包含 1,315 个自然且复杂的问答任务的数据集,这些问题需要数十甚至上百个中间步骤才能解决,旨在测试大语言模型在多文档推理能力上的表现。

Details Motivation: 当前大语言模型(LLMs)的基准测试缺乏自然且复杂的问题,这些问题通常是信息寻求型的,对人类来说也非常耗时。MoNaCo 填补了这一空白。

Contribution: MoNaCo 是目前唯一一个需要大量中间推理步骤的问答数据集,其规模和复杂性远超现有基准。

Method: 通过分解的标注流程,人工构建并回答大量的自然耗时问题。

Result: 前沿大语言模型在 MoNaCo 上最高仅达到 61.2% F1 分数,表现出召回率低和幻觉问题。

Insight: MoNaCo 突显了大语言模型在处理复杂真实世界问题时的局限性,为未来推理模型的改进提供了有效资源。

Abstract: Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve – far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions – with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: https://tomerwolgithub.github.io/monaco

[5] MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering

Hikaru Asano,Hiroki Ouchi,Akira Kasuga,Ryo Yonetani

Main category: cs.CL

TL;DR: 该论文介绍了MobQA数据集,用于通过自然语言问答评估大语言模型(LLM)对人类移动数据的语义理解能力,揭示了LLM在事实检索上的优势,但在语义推理和解释性问答中的局限性。

Details Motivation: 现有模型擅长预测人类移动模式,但对这些模式背后原因和语义的理解能力尚不明确,MobQA填补了这一空白。

Contribution: 提出了一个包含5,800个问答对的基准数据集MobQA,覆盖三种互补的提问类型:事实检索、多选推理和自由解释,全面评估LLM的语义理解能力。

Method: 数据集中包含多样的人类GPS轨迹(从日常到周粒度),通过三种问答类型(事实检索、多选推理、自由解释)设计问题,要求模型进行空间、时间和语义推理。

Result: 评估发现LLM在事实检索上表现优异,但在语义推理和解释性问答中存在显著局限性,轨迹长度对模型效果有较大影响。

Insight: MobQA揭示了当前最先进LLM在语义移动理解中的成就与不足,强调了模型在解释性推理能力上的提升空间。

Abstract: This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.}

[6] Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

Tao Wu,Jingyuan Chen,Wang Lin,Jian Zhan,Mengze Li,Kun Kuang,Fei Wu

Main category: cs.CL

TL;DR: 本文提出了一种通过蒙特卡洛树搜索(MCTS)引导的推理重建方法来生成个性化干扰项,解决了现有方法无法捕捉学生个体多样推理错误的局限性。

Details Motivation: 现有的多选择题(MCQ)干扰项生成方法依赖于大语言模型(LLM)学习学生群体的共同错误模式,但无法捕捉个体学生的多样化推理错误,限制了诊断效果。

Contribution: 本文首次提出个性化干扰项生成任务,开发了一种无训练的两阶段框架,通过MCTS重建学生推理轨迹并生成与其特定误解对齐的干扰项。

Method: 方法分为两阶段:1)用MCTS从学生过去的错误答案中重建推理轨迹,生成学生特定的误解原型;2)通过该原型模拟学生对新问题的推理过程,生成个性化干扰项。

Result: 实验表明,该方法在为140名学生生成个性化干扰项时表现最佳,并能有效泛化到群体级设置,体现了其鲁棒性和适应性。

Insight: 训练不可行时,基于推理重建的个性化干扰项生成方法能有效捕捉学生个体误解,为教育评估提供了新思路。

Abstract: Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student’s past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student’s underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student’s reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student’s reasoning on new questions, enabling the generation of personalized distractors that align with the student’s recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.

[7] E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection

Ahmad Mousavi,Yeganeh Abdollahinejad,Roberto Corizzo,Nathalie Japkowicz,Zois Boukouvalas

Main category: cs.CL

TL;DR: E-CaTCH是一个用于检测多模态虚假信息的框架,通过事件分组、跨模态注意力、时间建模和类别不平衡处理,显著提升了检测性能。

Details Motivation: 社交媒体上多模态虚假信息的检测面临模态不一致、时间模式变化和类别不平衡等挑战,现有方法未能有效捕获事件级结构和时间动态。

Contribution: 提出了E-CaTCH框架,结合事件聚类、跨模态注意力、时间一致性建模和类别不平衡处理技术,提升了虚假信息检测的性能和可解释性。

Method: 通过文本相似度和时间邻近性聚类事件,使用自注意力和跨模态注意力对齐特征,并通过趋势感知LSTM建模时间动态,同时整合自适应类别加权和难例挖掘。

Result: 在多个数据集(Fakeddit、IND、COVID-19 MISINFOGRAPH)上,E-CaTCH优于现有方法,展现了鲁棒性和泛化能力。

Insight: 事件级别的建模和跨模态对齐对虚假信息检测至关重要,而时间动态和类别不平衡处理能进一步提升模型的稳定性和性能。

Abstract: Detecting multimodal misinformation on social media remains challenging due to inconsistencies between modalities, changes in temporal patterns, and substantial class imbalance. Many existing methods treat posts independently and fail to capture the event-level structure that connects them across time and modality. We propose E-CaTCH, an interpretable and scalable framework for robustly detecting misinformation. If needed, E-CaTCH clusters posts into pseudo-events based on textual similarity and temporal proximity, then processes each event independently. Within each event, textual and visual features are extracted using pre-trained BERT and ResNet encoders, refined via intra-modal self-attention, and aligned through bidirectional cross-modal attention. A soft gating mechanism fuses these representations to form contextualized, content-aware embeddings of each post. To model temporal evolution, E-CaTCH segments events into overlapping time windows and uses a trend-aware LSTM, enhanced with semantic shift and momentum signals, to encode narrative progression over time. Classification is performed at the event level, enabling better alignment with real-world misinformation dynamics. To address class imbalance and promote stable learning, the model integrates adaptive class weighting, temporal consistency regularization, and hard-example mining. The total loss is aggregated across all events. Extensive experiments on Fakeddit, IND, and COVID-19 MISINFOGRAPH demonstrate that E-CaTCH consistently outperforms state-of-the-art baselines. Cross-dataset evaluations further demonstrate its robustness, generalizability, and practical applicability across diverse misinformation scenarios.

[8] UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?

Mukund Choudhary,KV Aditya Srivatsa,Gaurja Aeron,Antara Raaghavi Bhattacharya,Dang Khoa Dang Dinh,Ikhlasul Akmal Hanif,Daria Kotova,Ekaterina Kochmar,Monojit Choudhury

Main category: cs.CL

TL;DR: 该研究分析了大型语言模型(LLMs)在语言学奥林匹克(LO)谜题上的表现,揭示了其在低资源语言中的局限性。

Details Motivation: 尽管LLMs在许多推理任务中表现良好,但它们在语言学谜题中的表现却较差,这为评估其在低资源语言中的语言推理能力提供了一个低污染的环境。

Contribution: 通过分析629个问题,研究揭示了LLMs在形态复杂度高的谜题中表现较差,并提出了将词语分解为语素作为预处理步骤的改进方法。

Method: 研究标注了每个问题的语言学特征,分析了41种低资源语言的谜题,并测试了语素分割对性能的影响。

Result: LLMs在形态复杂度高的谜题中表现不佳,但在与英语相似的语言特征中表现较好。语素分割显著提高了模型的表现。

Insight: 研究表明,LLMs需要更语言特定的分词器,并且形态复杂度是影响其表现的关键因素。

Abstract: Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs’ linguistic reasoning abilities across low-resource languages. This work analyses LLMs’ performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.

[9] LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Ruiyan Qi,Congding Wen,Weibo Zhou,Shangsong Liang,Lingbo Li

Main category: cs.CL

TL;DR: 本文提出了LETToT框架,通过专家驱动的树状思维结构而非标注数据,无标签评估大语言模型在旅游领域的表现。实验证明该方法有效,且揭示了小模型通过显式推理架构可缩小与大规模模型的差距。

Details Motivation: 在旅游等特定领域评估大语言模型面临标注成本高和幻觉问题等挑战,亟需一种无标签的评估方法。

Contribution: 1. 提出LETToT框架,通过专家树状思维结构实现无标签评估;2. 揭示推理增强的小模型在特定领域中的潜力;3. 建立了一种可扩展的领域专用评估范式。

Method: 1. 通过专家反馈迭代优化树状思维结构;2. 将优化后的结构应用于不同规模模型的评估,比较性能和推理能力。

Result: LETToT在相对质量上优于基线4.99%-14.15%,且小模型在显式推理架构下表现出竞争力。

Insight: 1. 规模定律在专业领域依然适用,但推理能力可弥补模型规模的不足;2. 无标签评估为领域专用LLM评估提供了新思路。

Abstract: Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT’s optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.

[10] LLM Compression: How Far Can We Go in Balancing Size and Performance?

Sahil Sk,Debasish Dhal,Sonal Khosla,Sk Shahid,Sambit Shekhar,Akash Dhaka,Shantipriya Parida,Dilip K. Prasad,Ondřej Bojar

Main category: cs.CL

TL;DR: 该研究探讨了4位组缩放量化(GSQ)和生成预训练变换器量化(GPTQ)对三个不同规模的大语言模型(LLMs)的压缩效果,评估了它们在多个NLP任务上的性能与效率权衡。

Details Motivation: 量化技术能显著降低LLMs的内存占用和计算成本,但其对性能的影响尚未充分研究。本研究旨在评估不同量化方法在多种任务中的适用性,为实际部署提供依据。

Contribution: 1. 对GSQ和GPTQ方法在三种不同规模LLMs上的效果进行了系统评估;2. 提供了量化后的性能与效率权衡分析;3. 为未来实验提供了基准。

Method: 使用4位GSQ和GPTQ对LLaMA 1B、Qwen 0.5B和PHI 1.5B进行量化,并在MS MARCO、BoolQ和GSM8K数据集上评估准确性、推理延迟和吞吐量。

Result: 量化能显著减少模型大小和计算成本,但性能损失因任务而异。GSQ和GPTQ在不同模型规模下表现出不同的优劣势。

Insight: 低比特量化在部分任务中表现良好,但需权衡性能和效率,为实际应用提供灵活选择。

Abstract: Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.

[11] Retrieval-augmented reasoning with lean language models

Ryan Sze-Yin Chan,Federico Nanni,Tomas Lazauskas,Rosie Wood,Penelope Yong,Lionel Tarassenko,Mark Girolami,James Geddes,Andrew Duncan

Main category: cs.CL

TL;DR: 提出了一种结合推理和检索增强生成(RAG)的轻量级语言模型方法,适用于资源受限或隐私敏感的环境。

Details Motivation: 针对现有RAG系统依赖大规模模型和外部API的问题,作者旨在开发一种性能优越且隐私保护的方法,适用于资源受限或安全要求高的场景。

Contribution: 提出了一种轻量级架构,结合密集检索器和微调的Qwen2.5-Instruct模型,通过合成查询生成和推理轨迹优化性能,接近前沿模型的水平。

Method: 使用密集检索器和Qwen2.5-Instruct模型,结合合成数据生成、文档压缩和推理感知微调,提升模型在特定领域的表现。

Result: 实验表明,该方法在答案准确性和一致性上显著优于非推理和通用轻量级模型,接近前沿模型的性能。

Insight: 轻量级模型通过领域特定微调和合成数据设计,可以在资源受限情况下实现高性能推理和检索任务。

Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.

[12] Model Interpretability and Rationale Extraction by Input Mask Optimization

Marc Brinner,Sina Zarriess

Main category: cs.CL

TL;DR: 本文提出了一种基于输入掩码优化的新方法,用于为神经网络预测生成解释性掩码,通过梯度优化和正则化确保解释的充分性、全面性和简洁性。

Details Motivation: 随着神经网络在自然语言处理和计算机视觉领域的快速发展,对黑盒模型预测的解释需求日益增加。本文旨在为神经网络预测提供提取性解释方法。

Contribution: 提出了一种通过输入掩码优化生成解释的新方法,无需训练专用模型即可实现理性提取(rationale extraction),并将其扩展到图像分类任务。

Method: 结合梯度优化和新正则化方案,掩码对预测无关的输入部分,确保生成的解释满足充分性、全面性和简洁性。

Result: 方法在自然语言处理和图像分类任务中均生成高质量解释,表明理性提取的条件可广泛适用于不同输入类型。

Insight: 理性提取的条件不仅适用于自然语言处理,还能推广到其他输入类型,如图像,为跨领域模型解释提供了新思路。

Abstract: Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types.

[13] Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training

Marc Brinner,Sina Zarrieß

Main category: cs.CL

TL;DR: 论文提出了一种端到端可微分自训练方法,用于稳定训练可解释的Transformer分类器,生成同时分类和评分输入令牌相关性的单一模型。

Details Motivation: 现有可解释模型训练通常依赖三个模块(选择器、分类器和互补分类器),过程复杂且不稳定,需简化并提升效果。

Contribution: 1. 提出单一模型实现三模块功能,简化训练;2. 扩展至类特定解释,结合正则化技术提升与人类标注的对齐。

Method: 采用端到端可微分自训练框架,整合选择器和分类器功能,引入正则化优化解释质量。

Result: 方法显著提升了与人类标注的对齐效果,达到SOTA水平。

Insight: 简化模块设计可提升训练稳定性,而正则化和类特定解释能进一步提升模型可解释性。

Abstract: We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision.

[14] Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

Shangrui Nie,Florian Mai,David Kaczér,Charles Welch,Zhixue Zhao,Lucie Flek

Main category: cs.CL

TL;DR: 通过调查问题调整LLM的人类价值观,展示了一种简单但有效的下游对齐方法。

Details Motivation: 现有方法调整大语言模型(LLM)的价值观常需大量数据,本文探索是否可通过简单的调查问题微调实现价值观对齐。

Contribution: 提出一种通过微调调查问题改变LLM价值观的下游对齐方法,并验证其在域内和域外任务中的有效性。

Method: 构建LLM的价值观基线,微调其回答价值观调查问题,评估其域内回答变化及域外行为(如道德判断和文本冒险游戏)。

Result: 微调不仅能改变LLM对调查问题的回答,还能显著调整其在下游任务中的隐含行为(价值观对齐)。

Insight: 简单的调查问题微调是实现LLM价值观对齐的有效途径,为后续价值观干预研究提供新思路。

Abstract: Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model’s value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model’s behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model’s behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model’s behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model’s answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.

[15] HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor

Shivam Dubey

Main category: cs.CL

TL;DR: HumorPlanSearch 通过结构化规划和 HuCoT 模板来提升 AI 幽默生成的上下文敏感性和喜剧质量,实验表明其方法显著优于基线模型。

Details Motivation: 当前基于大语言模型的幽默生成通常缺乏上下文敏感性,导致笑话显得通用、重复或不合时宜。

Contribution: 提出了一种模块化流水线 HumorPlanSearch,包含 Plan-Search、HuCoT 模板、知识图谱检索、新颖性过滤和迭代修订循环,显著提升了幽默生成的上下文适应性和质量。

Method: 结合了 Plan-Search 策略、HuCoT 模板、知识图谱检索、语义嵌入新颖性过滤以及迭代修订循环。

Result: 在实验中,完整流水线(知识图谱+修订)的 Humor Generation Score (HGS) 比基线模型提升了 15.4%(p < 0.05)。

Insight: 通过分阶段的上下文建模和多信号评估,可以显著提升 AI 幽默生成的连贯性和文化适应性。

Abstract: Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener’s cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p < 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy.

[16] Reference Points in LLM Sentiment Analysis: The Role of Structured Context

Junichiro Niimi

Main category: cs.CL

TL;DR: 论文研究在LLM情感分析中参考点的作用,通过对比自然语言和JSON格式提示,发现结构化的JSON提示能显著提升性能,适合资源受限的边缘设备部署。

Details Motivation: 现有NLP研究大多仅从评论文本进行情感分类,但营销理论指出客户评价还受其他参考点影响。论文旨在研究补充信息的内容和格式对LLM情感分析的影响。

Contribution: 提出了一种轻量级的3B参数模型,通过结构化JSON提示实现性能提升,无需微调即可在资源受限环境中部署,为大规模模型提供实用替代方案。

Method: 比较自然语言和JSON格式的提示,使用3B参数模型在两个Yelp类别(餐厅和夜生活)上实验,评估宏F1和RMSE指标。

Result: JSON提示显著优于基线:宏F1提升1.6%和4%,RMSE下降16%和9.1%。后续分析证实性能提升来自真实的上下文推理。

Insight: 结构化提示能使小模型达到竞争力性能,展示了在边缘设备部署的潜力,同时避免大规模模型的高昂成本。

Abstract: Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation–disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment.

[17] Language models align with brain regions that represent concepts across modalities

Maria Ryskina,Greta Tuckute,Alexander Fung,Ashley Malkin,Evelina Fedorenko

Main category: cs.CL

TL;DR: 本文研究了语言模型(LMs)与大脑表征之间的关系,发现语言模型能够预测大脑中对概念意义一致性较高的区域的信号,表明LMs可能内部表征了跨模态的概念意义。

Details Motivation: 认知科学和神经科学长期以来面临一个挑战:如何区分语言表征和概念意义表征。今天的语言模型(LMs)也面临同样问题,因此作者研究了LMs与大脑对齐的关系。

Contribution: 1. 提出了一种度量大脑区域在跨模态输入中概念意义一致性的新方法。2. 表明LMs能够预测大脑中对概念意义一致性较高区域的信号,即使这些区域对语言处理不敏感。

Method: 作者使用fMRI数据集,分析了两种神经指标:1) 句子处理时的大脑激活水平(针对语言处理);2) 跨输入模态(句子、词云、图像)的概念意义一致性度量。研究了语言模型与大脑对齐的关系。

Result: 实验表明,语言模型(无论是纯语言模型还是语言-视觉模型)在概念意义一致性较高的大脑区域中更能预测信号,即使这些区域对语言处理不敏感。

Insight: 研究结果表明,语言模型可能在内部表征了跨模态的概念意义,这为理解LMs如何捕捉语言和概念之间的关系提供了新视角。

Abstract: Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today’s language models (LMs), we investigate the relationship between LM–brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning.

[18] Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

Qiguang Chen,Dengyun Peng,Jinhao Liu,HuiKang Su,Jiannan Guan,Libo Qin,Wanxiang Che

Main category: cs.CL

TL;DR: 论文提出了Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF),通过动态评估和调整推理深度来提升大型语言模型在复杂任务中的效率,显著减少了响应token数量并保持了准确性。

Details Motivation: 现有的大型语言模型在复杂推理任务中常因长链思维(CoT)导致冗余,计算效率低下且延迟高。现有方法依赖人为定义的难度先验,与模型的自我感知难度不匹配,导致效率不足。

Contribution: 提出了DR. SAF框架,通过边界自我感知对齐、自适应奖励管理和边界保护机制三个核心组件,动态调整推理深度,优化效率和准确性。

Method: DR. SAF的三大组件:(1) Boundary Self-Awareness Alignment(边界自我感知对齐),(2) Adaptive Reward Management(自适应奖励管理),(3) Boundary Preservation Mechanism(边界保护机制),动态调整推理深度。

Result: 实验显示,DR. SAF减少了49.27%的响应token,token效率提升6.59倍,训练时间减少5倍,极端训练下甚至超越传统指令模型16%的准确率。

Insight: 模型的自我感知能力可以动态优化推理效率,减少冗余,适合资源受限场景。

Abstract: Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM’s self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.

[19] Representing Speech Through Autoregressive Prediction of Cochlear Tokens

Greta Tuckute,Klemen Kotar,Evelina Fedorenko,Daniel L. K. Yamins

Main category: cs.CL

TL;DR: AuriStream是一个两阶段生物启发模型,通过模拟人类听觉处理层次结构,将语音编码为离散的耳蜗令牌,并用自回归模型处理,展现出优异的语音表示和语义理解能力。

Details Motivation: 受到人类听觉系统的启发,研究旨在开发一种更接近人类处理语音的方式的模型,以高效处理多种语音任务。

Contribution: 提出了AuriStream,一种两阶段框架,将语音编码为耳蜗令牌并通过自回归模型学习,展现出先进的表示能力和语义理解。

Method: 1. 第一阶段:从原始音频生成基于人类耳蜗的时频表示,并提取离散耳蜗令牌。
2. 第二阶段:对耳蜗令牌应用自回归序列模型。

Result: AuriStream在SUPERB语音任务中表现优异,能够生成可解码的音频继续,展示了其预测能力的直观理解。

Insight: 通过生物启发设计,AuriStream在语音表示和生成任务中展现出高效性和人机交互的潜力。

Abstract: We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream’s strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model’s predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

cs.CV [Back]

[20] Privacy Enhancement for Gaze Data Using a Noise-Infused Autoencoder

Samantha Aziz,Oleg Komogortsev

Main category: cs.CV

TL;DR: 本文提出一种基于噪声注入自编码器的隐私增强机制,用于保护视线数据,防止未经同意的用户跨会话重新识别,同时保持数据在良性任务中的可用性。

Details Motivation: 视线数据在应用中具有广泛用途,但也包含敏感的生物识别信息,易被滥用。传统方法在隐私保护与数据可用性之间存在不平衡,亟需一种既能保护隐私又能保留数据实际用途的技术。

Contribution: 提出了一种新型的基于噪声注入自编码器的隐私保护框架,显著降低了视线数据的生物识别性,同时最小化了对数据实用性的影响。与传统方法相比,该方法生成的视线模式更符合生理学特征,适用于下游任务。

Method: 利用自编码器架构,在潜在空间中注入噪声以扰动视线信号,从而保护隐私。通过平衡噪声强度和信号保真度,优化隐私-实用性之间的权衡。

Result: 实验表明,该方法显著降低用户识别率(隐私保护),同时在视线预测任务中保持了较高的数据可用性,优于现有方法。

Insight: 噪声注入自编码器是一种高效且实用的隐私保护工具,尤其适合生物信号数据。其关键在于噪声设计,既能隐藏敏感信息,又不破坏数据的自然特征。

Abstract: We present a privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder that prevents users from being re-identified across play sessions without their consent, while retaining the usability of the data for benign tasks. We evaluate privacy-utility trade-offs across biometric identification and gaze prediction tasks, showing that our approach significantly reduces biometric identifiability with minimal utility degradation. Unlike prior methods in this direction, our framework retains physiologically plausible gaze patterns suitable for downstream use, which produces favorable privacy-utility trade-off. This work advances privacy in gaze-based systems by providing a usable and effective mechanism for protecting sensitive gaze data.

[21] A Survey on Video Temporal Grounding with Multimodal Large Language Model

Jianlong Wu,Wei Liu,Ye Liu,Meng Liu,Liqiang Nie,Zhouchen Lin,Chang Wen Chen

Main category: cs.CV

TL;DR: 该论文是一篇关于基于多模态大语言模型(MLLMs)的视频时序定位(VTG)的综述,总结了当前研究的三大维度:MLLMs功能角色、训练范式与视频特征处理技术,并讨论了基准数据集和未来研究方向。

Details Motivation: 为填补针对VTG-MLLMs的综合综述的空缺,系统性梳理该领域的进展,并提出未来研究方向。

Contribution: 1. 提出三维分类法总结VTG-MLLMs研究;2. 讨论基准数据集与评估协议;3. 指出研究局限与未来方向。

Method: 通过三大维度(功能角色、训练范式、视频特征处理)系统分析VTG-MLLMs的当前研究。

Result: 总结了VTG-MLLMs的竞争力与泛化能力,并提出未来研究的需求。

Insight: MLLMs在视频时序定位中的多任务、零样本能力显示出巨大潜力,但其架构设计与训练仍需优化。

Abstract: The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.

[22] VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By \underline{V}alue \underline{S}ign \underline{F}lip

Wenqi Guo,Shan Du

Main category: cs.CV

TL;DR: 论文提出了一种名为Value Sign Flip(VSF)的简单高效方法,用于在少步图像生成模型中实现负向提示引导。通过动态翻转负向提示的注意力值符号,VSF有效抑制了不期望的内容,显著提升了负向提示的遵循能力。

Details Motivation: 现有方法(如分类器无关引导CFG)在少步生成模型中难以有效实现负向提示引导,且计算开销较大。VSF旨在以低计算成本解决这一问题。

Contribution: VSF通过翻转注意力值符号动态抑制不期望内容,是一种简单高效的负向提示引导方法;在少步模型中表现优于CFG等现有方法。

Method: VSF通过动态翻转负向提示的注意力值符号来抑制不期望内容,适用于MMDiT和跨注意力架构,计算开销低。

Result: 实验显示,VSF在少步模型中显著提升了负向提示的遵循能力,同时在非少步模型中表现优于CFG,图像质量保持竞争性。

Insight: VSF表明,简单的注意力机制调整可以高效实现负向提示引导,为少步生成模型提供了新的思路。

Abstract: We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.

[23] Relative Pose Regression with Pose Auto-Encoders: Enhancing Accuracy and Data Efficiency for Retail Applications

Yoli Shavit,Yosi Keller

Main category: cs.CV

TL;DR: 该论文提出了一种基于相机姿态自动编码器(PAE)的相对姿态回归(RPR)方法,用于提高零售场景中相机定位的精度和数据效率,并通过结合绝对姿态回归(APR)实现了高效的重定位方案。

Details Motivation: 在零售环境中,精确的相机定位对提升用户体验和管理效率至关重要。现有方法如APR虽然简单,但结合视觉和空间场景先验的方法通常更准确。PAE能嵌入这些先验,本文将其扩展至RPR任务,以进一步提升定位精度和数据效率。

Contribution: 1. 提出了基于PAE的RPR方法;2. 设计了一个结合APR和PAE-RPR的重定位方案,无需额外存储图像或姿态数据;3. 在室内基准测试中验证了该方法的高效性,尤其是在数据量仅为30%时仍具竞争力。

Method: 1. 扩展PAE至RPR任务;2. 通过PAE-RPR对APR预测进行精调;3. 在架构相同的图像基RPR模型上验证PAE-RPR的有效性。

Result: 实验表明,PAE-RPR在室内定位任务中显著提升了APR的精度,且在数据量减少70%的情况下仍保持竞争力。

Insight: 结合场景先验的PAE能有效提升定位精度,同时降低数据依赖性,为零售等实际应用提供了更高效的解决方案。

Abstract: Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: https://github.com/yolish/camera-pose-auto-encoders

[24] ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang,Qunjie Zhou,Hesam Rabeti,Aleksandr Korovko,Huan Ling,Xuanchi Ren,Tianchang Shen,Jun Gao,Dmitry Slepichev,Chen-Hsuan Lin,Jiawei Ren,Kevin Xie,Joydeep Biswas,Laura Leal-Taixe,Sanja Fidler

Main category: cs.CV

TL;DR: ViPE 是一种高效视频处理引擎,用于从非受限原始视频中估计相机内参、相机运动和稠密近似度量深度图,支持多种场景和相机模型,显著优于现有基线。

Details Motivation: 3D 几何感知是空间AI系统的关键前提,但现有方法依赖大规模标注数据,而野外视频的精准3D标注获取困难。ViPE 旨在填补这一技术空白。

Contribution: 1. 提出 ViPE,一种高效通用的视频处理引擎,支持多种场景和相机模型;2. 在多个基准测试中表现优于现有基线;3. 开源了包含约 96M 帧的大规模标注数据集。

Method: ViPE 从原始视频中估计相机内参、相机运动和稠密深度图,支持动态自拍视频、电影镜头等多种场景,以及针孔、广角和360度全景相机模型。

Result: ViPE 在 TUM/KITTI 序列上超越基线方法18%/50%,单GPU运行时达3-5FPS,并标注了大规模视频数据集(含100K真实视频和1M AI生成视频)。

Insight: ViPE 通过高效处理非受限视频,解决了3D几何感知中的数据标注难题,为空间AI系统的发展提供了重要工具和数据支持。

Abstract: Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames – all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

[25] HQ-OV3D: A High Box Quality Open-World 3D Detection Framework based on Diffision Model

Qi Liu,Yabei Li,Hongsong Wang,Lei He

Main category: cs.CV

TL;DR: HQ-OV3D是一个基于扩散模型的高质量开放词汇3D检测框架,通过跨模态几何一致性和类别辅助降噪机制,显著提升了伪标签的几何质量。

Details Motivation: 传统封闭集的3D检测框架无法满足开放世界应用的需求,现有开放词汇3D检测方法通常忽视伪标签的几何质量(如边界框精度)。

Contribution: 提出了HQ-OV3D框架,通过Intra-Modality Cross-Validated Proposal Generator和Annotated-Class Assisted Denoiser两个关键组件,生成和优化高质量的伪标签。

Method: 1. 利用跨模态几何一致性生成高质量初始3D提案;2. 通过基于DDIM的降噪机制,利用标注类别的几何先验逐步优化3D提案。

Result: 在未见类别上,mAP提升了7.37%,证明了框架生成的伪标签质量优越。

Insight: HQ-OV3D不仅可作为独立的开放词汇3D检测器,还可作为现有开放词汇检测或标注流程的插件式高质量伪标签生成器。

Abstract: Traditional closed-set 3D detection frameworks fail to meet the demands of open-world applications like autonomous driving. Existing open-vocabulary 3D detection methods typically adopt a two-stage pipeline consisting of pseudo-label generation followed by semantic alignment. While vision-language models (VLMs) recently have dramatically improved the semantic accuracy of pseudo-labels, their geometric quality, particularly bounding box precision, remains commonly neglected.To address this issue, we propose a High Box Quality Open-Vocabulary 3D Detection (HQ-OV3D) framework, dedicated to generate and refine high-quality pseudo-labels for open-vocabulary classes. The framework comprises two key components: an Intra-Modality Cross-Validated (IMCV) Proposal Generator that utilizes cross-modality geometric consistency to generate high-quality initial 3D proposals, and an Annotated-Class Assisted (ACA) Denoiser that progressively refines 3D proposals by leveraging geometric priors from annotated categories through a DDIM-based denoising mechanism.Compared to the state-of-the-art method, training with pseudo-labels generated by our approach achieves a 7.37% improvement in mAP on novel classes, demonstrating the superior quality of the pseudo-labels produced by our framework. HQ-OV3D can serve not only as a strong standalone open-vocabulary 3D detector but also as a plug-in high-quality pseudo-label generator for existing open-vocabulary detection or annotation pipelines.

[26] Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

Cheng Chen,Hao Huang,Saurabh Bagchi

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏3D语义高斯溅射的协作3D语义占据预测方法,解决了现有方法在通信成本、深度估计依赖和额外监督需求方面的问题。

Details Motivation: 现有基于视觉的3D语义占据预测方法通常依赖稠密3D体素或2D平面特征,导致高通信成本或需准确深度估计,限制了协作场景的应用。

Contribution: 首次提出利用稀疏3D语义高斯溅射的协作3D语义占据预测方法,通过共享和融合高斯基元,实现高效通信和鲁棒性能。

Method: 采用基于邻域的跨代理融合技术去除重复及噪声高斯基元,并结合几何与语义的联合编码,减少对深度监督的依赖。

Result: 在mIoU和IoU上分别比单代理感知和基线协作方法提升8.42/3.28和5.11/22.41点,通信量减少65.4%时仍保持1.9的mIoU提升。

Insight: 稀疏、基于对象的高斯基元传输既能保持结构信息,又能显著降低通信成本,适用于受限通信预算的协作场景。

Abstract: Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

[27] Personalized Face Super-Resolution with Identity Decoupling and Fitting

Jiarui Yang,Hang Guo,Wen Huang,Tao Dai,Shutao Xia

Main category: cs.CV

TL;DR: 本文提出了一种新的人脸超分辨率方法IDFSR,通过身份解耦和拟合技术,解决极端退化场景下人脸重建中身份一致性不足和幻觉问题。

Details Motivation: 在极端退化场景(如超8倍放大)下,传统人脸超分辨率方法难以保持身份一致性,容易产生幻觉人脸。IDFSR旨在通过解耦身份和风格信息,提升重建效果。

Contribution: 1)通过掩蔽、变形和身份嵌入三个关键设计,实现身份一致性重建;2)提出基于扩散模型的预训练方法,解耦风格和身份;3)轻量化微调技术,适配个性化需求。

Method: 1)掩蔽面部区域以排除不可靠身份线索;2)变形参考图像以提供风格引导;3)利用GT图像提取的身份嵌入进行细粒度建模。结合扩散模型预训练和轻量化微调。

Result: 在极端退化场景下,IDFSR显著优于现有方法,尤其在身份一致性上表现突出。

Insight: 解耦身份和风格信息并结合轻量化微调,是实现高质量个性化人脸超分辨率的有效途径。

Abstract: In recent years, face super-resolution (FSR) methods have achieved remarkable progress, generally maintaining high image fidelity and identity (ID) consistency under standard settings. However, in extreme degradation scenarios (e.g., scale $> 8\times$), critical attributes and ID information are often severely lost in the input image, making it difficult for conventional models to reconstruct realistic and ID-consistent faces. Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints. To address this challenge, we propose a novel FSR method with Identity Decoupling and Fitting (IDFSR), designed to enhance ID restoration under large scaling factors while mitigating hallucination effects. Our approach involves three key designs: 1) \textbf{Masking} the facial region in the low-resolution (LR) image to eliminate unreliable ID cues; 2) \textbf{Warping} a reference image to align with the LR input, providing style guidance; 3) Leveraging \textbf{ID embeddings} extracted from ground truth (GT) images for fine-grained ID modeling and personalized adaptation. We first pretrain a diffusion-based model to explicitly decouple style and ID by forcing it to reconstruct masked LR face regions using both style and identity embeddings. Subsequently, we freeze most network parameters and perform lightweight fine-tuning of the ID embedding using a small set of target ID images. This embedding encodes fine-grained facial attributes and precise ID information, significantly improving both ID consistency and perceptual quality. Extensive quantitative evaluations and visual comparisons demonstrate that the proposed IDFSR substantially outperforms existing approaches under extreme degradation, particularly achieving superior performance on ID consistency.

[28] Deep Learning for Automated Identification of Vietnamese Timber Species: A Tool for Ecological Monitoring and Conservation

Tianyu Song,Van-Doan Duong,Thi-Phuong Le,Ton Viet Ta

Main category: cs.CV

TL;DR: 这篇论文探讨了利用深度学习自动分类越南常见木材树种的方法,通过评估五种CNN架构,发现ShuffleNetV2在性能和计算效率上表现最佳,为生态监测提供了高效工具。

Details Motivation: 木材树种准确识别对生态监测、生物多样性保护及可持续森林管理至关重要,但传统方法依赖专家且耗时。因此,研究者希望通过深度学习实现高效自动化分类。

Contribution: 论文的主要贡献是提出了一种基于轻量级深度学习模型(ShuffleNetV2)的木材树种自动分类方法,其高准确性和计算效率适合资源受限环境。

Method: 构建了越南木材样本的自定义图像数据集,并评估了ResNet50、EfficientNet、MobileViT、MobileNetV3和ShuffleNetV2五种CNN架构。最终选择ShuffleNetV2作为最佳模型。

Result: ShuffleNetV2在20次独立运行中取得了99.29%的平均准确率和99.35%的F1分数,表现出色且计算高效。

Insight: 轻量级深度学习模型(如ShuffleNetV2)在资源有限的实时应用中具有潜力,为生态信息学提供了可扩展的图像分类解决方案。

Abstract: Accurate identification of wood species plays a critical role in ecological monitoring, biodiversity conservation, and sustainable forest management. Traditional classification approaches relying on macroscopic and microscopic inspection are labor-intensive and require expert knowledge. In this study, we explore the application of deep learning to automate the classification of ten wood species commonly found in Vietnam. A custom image dataset was constructed from field-collected wood samples, and five state-of-the-art convolutional neural network architectures–ResNet50, EfficientNet, MobileViT, MobileNetV3, and ShuffleNetV2–were evaluated. Among these, ShuffleNetV2 achieved the best balance between classification performance and computational efficiency, with an average accuracy of 99.29% and F1-score of 99.35% over 20 independent runs. These results demonstrate the potential of lightweight deep learning models for real-time, high-accuracy species identification in resource-constrained environments. Our work contributes to the growing field of ecological informatics by providing scalable, image-based solutions for automated wood classification and forest biodiversity assessment.

[29] Topological Structure Description for Artcode Detection Using the Shape of Orientation Histogram

Liming Xu,Dave Towey,Andrew P. French,Steve Benford

Main category: cs.CV

TL;DR: 这篇文章提出了一种用于Artcode检测的新特征描述子——方向直方图形状,用于描述Artcode的通用拓扑结构,并通过实验验证了其有效性。

Details Motivation: 随着智能手机和VR/AR技术的普及,环境中将出现更多连接虚拟元素的装饰性物体(如Artcodes)。识别这些物体的存在是触发后续交互的第一步。

Contribution: 提出了一种新的特征描述子——方向直方图形状(shape of orientation histogram),用于描述Artcode的拓扑结构,并通过实验验证了其检测性能。

Method: 将Artcode检测问题形式化为Artcode提案检测任务,设计了一种基于方向直方图形状的特征描述子,通过收集数据集并进行实验评估性能。

Result: 实验表明,所提出的特征描述子能有效表示拓扑结构,且基于该特征的检测系统在Artcode提案检测中表现良好。

Insight: 这项工作为基于特征的拓扑物体(如Artcodes)检测系统提供了初步尝试,为相关交互应用和潜在应用场景打开了新机遇。

Abstract: The increasing ubiquity of smartphones and resurgence of VR/AR techniques, it is expected that our everyday environment may soon be decorating with objects connecting with virtual elements. Alerting to the presence of these objects is therefore the first step for motivating follow-up further inspection and triggering digital material attached to the objects. This work studies a special kind of these objects – Artcodes – a human-meaningful and machine-readable decorative markers that camouflage themselves with freeform appearance by encoding information into their topology. We formulate this problem of recongising the presence of Artcodes as Artcode proposal detection, a distinct computer vision task that classifies topologically similar but geometrically and semantically different objects as a same class. To deal with this problem, we propose a new feature descriptor, called the shape of orientation histogram, to describe the generic topological structure of an Artcode. We collect datasets and conduct comprehensive experiments to evaluate the performance of the Artcode detection proposer built upon this new feature vector. Our experimental results show the feasibility of the proposed feature vector for representing topological structures and the effectiveness of the system for detecting Artcode proposals. Although this work is an initial attempt to develop a feature-based system for detecting topological objects like Artcodes, it would open up new interaction opportunities and spark potential applications of topological object detection.

[30] Analysis of the Compaction Behavior of Textile Reinforcements in Low-Resolution In-Situ CT Scans via Machine-Learning and Descriptor-Based Methods

Christian Düreth,Jan Condé-Wolter,Marek Danczak,Karsten Tittmann,Jörn Jaschinski,Andreas Hornig,Maik Gude

Main category: cs.CV

TL;DR: 这篇论文提出了一个机器学习框架,用于分析低分辨率CT扫描中纺织增强材料的压实行为,通过3D-UNet进行语义分割,并用两点相关函数量化嵌套行为。

Details Motivation: 纺织增强复合材料的力学性能(如刚度和渗透性)受嵌套行为影响,而传统方法难以在低分辨率CT数据中量化这种结构特征。

Contribution: 开发了一个基于3D-UNet的框架,成功从低分辨率CT数据中语义分割出材料相,并通过两点相关函数提取嵌套行为的几何特征。

Method: 使用3D-UNet对CT数据进行语义分割,并通过两点相关函数(S2)分析空间结构,量化平均层厚和嵌套程度。

Result: 模型的分割性能良好(mIoU=0.822,F1=0.902),结果与显微图像验证一致,为复合材料预成型体的逆向建模提供了基础。

Insight: 该方法为工业相关CT数据的几何特征提取提供了一种鲁棒途径,并支持基于描述符的复合材料结构分析。

Abstract: A detailed understanding of material structure across multiple scales is essential for predictive modeling of textile-reinforced composites. Nesting – characterized by the interlocking of adjacent fabric layers through local interpenetration and misalignment of yarns – plays a critical role in defining mechanical properties such as stiffness, permeability, and damage tolerance. This study presents a framework to quantify nesting behavior in dry textile reinforcements under compaction using low-resolution computed tomography (CT). In-situ compaction experiments were conducted on various stacking configurations, with CT scans acquired at 20.22 $\mu$m per voxel resolution. A tailored 3D{-}UNet enabled semantic segmentation of matrix, weft, and fill phases across compaction stages corresponding to fiber volume contents of 50–60 %. The model achieved a minimum mean Intersection-over-Union of 0.822 and an $F1$ score of 0.902. Spatial structure was subsequently analyzed using the two-point correlation function $S_2$, allowing for probabilistic extraction of average layer thickness and nesting degree. The results show strong agreement with micrograph-based validation. This methodology provides a robust approach for extracting key geometrical features from industrially relevant CT data and establishes a foundation for reverse modeling and descriptor-based structural analysis of composite preforms.

[31] iWatchRoad: Scalable Detection and Geospatial Visualization of Potholes for Smart Cities

Rishi Raj Sahoo,Surbhi Saswati Mohanty,Subhankar Mishra

Main category: cs.CV

TL;DR: iWatchRoad是一个端到端系统,用于实时检测坑洼并通过OpenStreetMap进行地理可视化,特别适用于印度多样化的道路环境。

Details Motivation: 印度道路多样且维护不足,坑洼对安全和车辆寿命构成严重威胁。本文旨在开发一个成本效益高、可扩展的系统,用于自动化检测和地理标记坑洼。

Contribution: 1. 提出iWatchRoad系统,包含检测、GPS标记和实时映射功能。2. 构建了一个包含7,000多帧的自注释数据集。3. 使用YOLO模型和OCR模块进行高效检测和时间戳提取。

Method: 1. 通过车载摄像头收集数据并标注。2. 基于YOLO模型进行坑洼检测。3. 使用OCR提取时间戳并与GPS同步。4. 通过OpenStreetMap实现数据可视化和存储。

Result: 系统在复杂条件下提高了检测精度,并生成了与政府兼容的输出,用于道路维护规划。

Insight: 该系统为发展中国家提供了一种低成本、高效的自动化道路管理工具,具有广泛的应用潜力。

Abstract: Potholes on the roads are a serious hazard and maintenance burden. This poses a significant threat to road safety and vehicle longevity, especially on the diverse and under-maintained roads of India. In this paper, we present a complete end-to-end system called iWatchRoad for automated pothole detection, Global Positioning System (GPS) tagging, and real time mapping using OpenStreetMap (OSM). We curated a large, self-annotated dataset of over 7,000 frames captured across various road types, lighting conditions, and weather scenarios unique to Indian environments, leveraging dashcam footage. This dataset is used to fine-tune, Ultralytics You Only Look Once (YOLO) model to perform real time pothole detection, while a custom Optical Character Recognition (OCR) module was employed to extract timestamps directly from video frames. The timestamps are synchronized with GPS logs to geotag each detected potholes accurately. The processed data includes the potholes’ details and frames as metadata is stored in a database and visualized via a user friendly web interface using OSM. iWatchRoad not only improves detection accuracy under challenging conditions but also provides government compatible outputs for road assessment and maintenance planning through the metadata visible on the website. Our solution is cost effective, hardware efficient, and scalable, offering a practical tool for urban and rural road management in developing regions, making the system automated. iWatchRoad is available at https://smlab.niser.ac.in/project/iwatchroad

[32] IPG: Incremental Patch Generation for Generalized Adversarial Patch Training

Wonho Lee,Hyunsik Na,Jisu Lee,Daeseon Choi

Main category: cs.CV

TL;DR: 本文提出了一种名为IPG(增量式补丁生成)的方法,旨在高效生成对抗性补丁,提升攻击效率并覆盖更广的模型漏洞范围。

Details Motivation: 对抗性补丁对AI模型(如计算机视觉任务中的目标检测)的鲁棒性构成挑战。本文旨在解决现有方法在生成对抗性补丁时的效率问题。

Contribution: 提出IPG方法,生成对抗性补丁的效率比现有方法提升至11.1倍,同时保持攻击性能。

Method: 通过YOLO特征分布可视化和对抗训练实验,验证IPG生成通用性强、覆盖漏洞广的对抗性补丁。

Result: 实验表明IPG在高效生成补丁的同时,攻击效果与现有方法相当,且生成的补丁可用于构建鲁棒性更强的模型。

Insight: IPG不仅可用于对抗性防御,还能应用于自动驾驶、安防系统等高要求领域,提升模型在动态高风险环境中的鲁棒性。

Abstract: The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO’s feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.

[33] MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

Ronghao Xu,Zhen Huang,Yangbo Wei,Xiaoqian Zhou,Zikang Xu,Ting Liu,Zihang Jiang,S. Kevin Zhou

Main category: cs.CV

TL;DR: MedAtlas是一个新颖的医学多模态基准框架,旨在评估大型语言模型在现实医学推理任务中的表现,涵盖了多轮对话、多模态图像交互、多任务集成和高临床保真度。

Details Motivation: 现有医学多模态基准通常局限于单图像、单轮任务,无法捕捉临床实践中多模态和纵向交互的本质。MedAtlas填补了这一空白。

Contribution: 提出了MedAtlas框架,支持多轮问答、多图像联合推理等任务,并引入新的评估指标:轮链准确性和错误传播抵抗力。

Method: MedAtlas集成了来自真实诊断工作流的多模态数据(CT、MRI等)和临床文本,要求模型进行深度联合推理。

Result: 现有多模态模型在MedAtlas上的表现显示出多阶段临床推理中的巨大性能差距。

Insight: MedAtlas为开发稳健、可信的医学AI提供了具有挑战性的评估平台,推动了多模态医学推理的研究。

Abstract: Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn question answering, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray, requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Round Chain Accuracy and Error Propagation Resistance. Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.

[34] From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement

Xinyi Wang,Michael Barnett,Frederique Boonstra,Yael Barnett,Mariano Cabezas,Arkiev D’Souza,Matthew C. Kiernan,Kain Kyle,Meng Law,Lynette Masters,Zihao Tang,Stephen Tisch,Sicong Tu,Anneke Van Der Walt,Dongang Wang,Fernando Calamante,Weidong Cai,Chenyu Wang

Main category: cs.CV

TL;DR: 该论文提出了FastFOD-Net,一种快速深度学习框架,用于增强扩散MRI中的纤维方向分布(FOD)分析,并在健康受试者和六种神经系统疾病中验证了其临床适用性。

Details Motivation: 扩散MRI中FOD分析的可靠性和准确性依赖于MRI采集质量和FOD估计,而现有方法主要针对健康受试者评估,限制了其临床应用。

Contribution: 提出了FastFOD-Net,一种高效的端到端深度学习框架,显著提升FOD增强的性能和效率,并在广泛的临床数据中验证其有效性。

Method: FastFOD-Net通过深度学习技术优化FOD增强,训练和推理速度比其前身快60倍。

Result: FastFOD-Net在临床数据中表现出色,支持疾病分类、提高连接组应用的解释性,并减少测量误差。

Insight: 该研究推动了深度学习在扩散MRI分析中的临床普及,增强了对低质量临床数据的分析能力。

Abstract: Fiber orientation distribution (FOD) is an advanced diffusion MRI modeling technique that represents complex white matter fiber configurations, and a key step for subsequent brain tractography and connectome analysis. Its reliability and accuracy, however, heavily rely on the quality of the MRI acquisition and the subsequent estimation of the FODs at each voxel. Generating reliable FODs from widely available clinical protocols with single-shell and low-angular-resolution acquisitions remains challenging but could potentially be addressed with recent advances in deep learning-based enhancement techniques. Despite advancements, existing methods have predominantly been assessed on healthy subjects, which have proved to be a major hurdle for their clinical adoption. In this work, we validate a newly optimized enhancement framework, FastFOD-Net, across healthy controls and six neurological disorders. This accelerated end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use ($60\times$ faster comparing to its predecessor). With the most comprehensive clinical evaluation to date, our work demonstrates the potential of FastFOD-Net in accelerating clinical neuroscience research, empowering diffusion MRI analysis for disease differentiation, improving interpretability in connectome applications, and reducing measurement errors to lower sample size requirements. Critically, this work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement. Specifically, FastFOD-Net enables robust analysis of real-world, clinical diffusion MRI data, comparable to that achievable with high-quality research acquisitions.

[35] Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Wenbin An,Jiahao Nie,Yaqiang Wu,Feng Tian,Shijian Lu,Qinghua Zheng

Main category: cs.CV

TL;DR: 这篇论文全面调查了如何通过外部工具提升多模态大语言模型(MLLMs)的能力,重点关注数据获取、任务性能、评估方法以及未来发展方向。

Details Motivation: 现有MLLMs在多模态任务中表现优异但仍面临数据质量低、复杂任务性能不足和评估协议不完善等问题。受人类使用外部工具增强推理的启发,论文探讨了利用外部工具提升MLLMs的潜力。

Contribution: 系统总结了外部工具在MLLMs中的应用,包括数据获取与标注、任务性能提升、评估方法改进,并指出了当前局限与未来方向。

Method: 通过四维框架分析外部工具的作用:(1)高质量多模态数据的获取与标注;(2)复杂下游任务性能的改进;(3)MLLMs的全面评估;(4)当前局限与未来展望。

Result: 论文突出了外部工具对MLLMs能力提升的变革性潜力,提供了工具化MLLMs的前瞻性视角。

Insight: 外部工具可以有效弥补MLLMs在数据、任务和评估方面的不足,未来需进一步探索工具与模型的深度融合。

Abstract: By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.

[36] ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks

Abhishek Kolari,Mohammadhossein Khojasteh,Yifan Jiang,Floris den Hengst,Filip Ilievski

Main category: cs.CV

TL;DR: ORBIT是一个新的视觉推理基准测试,专注于物体属性推理,旨在评估视觉语言模型(VLMs)在多层次推理能力上的表现。实验显示,现有VLMs在复杂推理任务中表现不佳。

Details Motivation: 当前的视觉问答(VQA)基准测试在物体属性推理方面存在不足,缺乏复杂性和代表性。ORBIT旨在填补这一空白,提供更系统化的评估框架。

Contribution: 提出了ORBIT基准测试,包含360张图像和1,080个计数问题,覆盖三种图像类型、三种复杂推理水平和四种属性维度。

Method: 设计了层次化的问题生成方法,结合了感知和推理任务,并通过零样本测试评估了12种VLMs的能力。

Result: 最佳模型准确率仅40%,VLMs在真实图像、反事实推理和高计数任务中表现较差。

Insight: ORBIT揭示了VLMs在复杂推理任务中的局限性,未来需开发更高效的基准测试方法和更强大的推理模型。

Abstract: While vision-language models (VLMs) have made remarkable progress on many popular visual question answering (VQA) benchmarks, it remains unclear whether they abstract and reason over depicted objects. Inspired by human object categorisation, object property reasoning involves identifying and recognising low-level details and higher-level abstractions. While current VQA benchmarks consider a limited set of object property attributes like size, they typically blend perception and reasoning, and lack representativeness in terms of reasoning and image categories. To this end, we introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions driven by prior work on commonsense reasoning. We develop a procedure to instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40% accuracy. VLMs struggle particularly with realistic (photographic) images, counterfactual reasoning about physical and functional properties, and higher counts. ORBIT points to the need to develop methods for scalable benchmarking, generalize annotation guidelines, and explore additional reasoning VLMs. We make the ORBIT benchmark and the experimental code available to support such endeavors.

[37] CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving

Jiarong Li,Imad Ali Shah,Diarmaid Geever,Fiachra Collins,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan

Main category: cs.CV

TL;DR: 该论文提出了一种基于CSNR和JMIM的高光谱成像波段选择方法,以减少城市驾驶中的同色异谱现象,提升对弱势道路使用者(VRU)的感知能力。

Details Motivation: 城市驾驶中,由于同色异谱现象(metamerism)导致RGB图像中不同材料看起来相似,增加了对弱势道路使用者(VRU)的识别难度。高光谱成像(HSI)能捕捉可见光以外的独特材料特征,但高维数据需要有效处理。

Contribution: 提出了一种结合信息论技术(联合互信息最大化JMIM、相关性分析)和图像质量指标(对比信噪比CSNR)的波段选择方法,显著减少了同色异谱现象,提升了VRU的可区分性。

Method: 利用JMIM和CSNR从高光谱数据中选择最具信息量的波段(497 nm、607 nm、895 nm±27 nm),重建伪彩色图像并与RGB图像对比。

Result: 实验表明,所选波段在同色异谱指标(Euclidean、SAM、T²)和感知指标(CIE ΔE)上分别提升了70.24%、528.46%、1206.83%和246.62%,显著优于RGB。

Insight: 通过优化高光谱波段选择,为ADAS和自动驾驶提供了更鲁棒的感知输入,有助于提升道路安全。

Abstract: Protecting Vulnerable Road Users (VRU) is a critical safety challenge for automotive perception systems, particularly under visual ambiguity caused by metamerism, a phenomenon where distinct materials appear similar in RGB imagery. This work investigates hyperspectral imaging (HSI) to overcome this limitation by capturing unique material signatures beyond the visible spectrum, especially in the Near-Infrared (NIR). To manage the inherent high-dimensionality of HSI data, we propose a band selection strategy that integrates information theory techniques (joint mutual information maximization, correlation analysis) with a novel application of an image quality metric (contrast signal-to-noise ratio) to identify the most spectrally informative bands. Using the Hyperspectral City V2 (H-City) dataset, we identify three informative bands (497 nm, 607 nm, and 895 nm, $\pm$27 nm) and reconstruct pseudo-color images for comparison with co-registered RGB. Quantitative results demonstrate increased dissimilarity and perceptual separability of VRU from the background. The selected HSI bands yield improvements of 70.24%, 528.46%, 1206.83%, and 246.62% for dissimilarity (Euclidean, SAM, $T^2$) and perception (CIE $\Delta E$) metrics, consistently outperforming RGB and confirming a marked reduction in metameric confusion. By providing a spectrally optimized input, our method enhances VRU separability, establishing a robust foundation for downstream perception tasks in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD), ultimately contributing to improved road safety.

[38] EVCtrl: Efficient Control Adapter for Visual Generation

Zixiang Yang,Yue Ma,Yinhan Zhang,Shanhui Mo,Dongrui Liu,Linfeng Zhang

Main category: cs.CV

TL;DR: EVCtrl是一种轻量级、即插即用的控制适配器,通过时空双缓存策略减少冗余计算,在视频和图像生成中显著提升效率,且无需重新训练模型。

Details Motivation: 现有方法(如ControlNet)在可控生成中引入冗余计算,尤其在视频领域导致高延迟。EVCtrl旨在消除这种冗余,提升效率。

Contribution: 提出EVCtrl适配器,采用时空双缓存策略,通过局部功能区域划分和选择性去噪,显著减少计算开销。

Method: 1. 空间冗余:分析网络层对控制的响应,划分全局与局部功能区域,使用局部感知缓存;2. 时间冗余:选择性跳过不必要的去噪步骤。

Result: 在CogVideo-Controlnet和Wan2.1-Controlnet上分别实现2.16和2.05倍的加速,生成质量几乎无损。

Insight: 通过分析网络层的控制响应特性,局部化计算和动态跳过冗余步骤是提升可控生成效率的有效途径。

Abstract: Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation quality.Codes are available in the supplementary materials.

[39] Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision

Rosiana Natalie,Wenqian Xu,Ruei-Che Chang,Rada Mihalcea,Anhong Guo

Main category: cs.CV

TL;DR: 该论文评估了视觉语言模型(VLMs)在模拟低视力人群视觉感知方面的能力,构建了一个基准数据集并测试了不同提示方法对模型性能的影响。

Details Motivation: 现有研究未探索VLMs在可访问性领域的模拟能力,尤其是低视力人群的视觉感知。通过评估VLMs的模拟效果,填补了这一空白。

Contribution: 1. 构建了一个基于40名低视力参与者调查的基准数据集;2. 提出了一种模拟低视力个体视觉感知的方法,并通过实验验证了提示设计的重要性。

Method: 1. 收集参与者的视觉信息和图像感知响应;2. 设计不同提示(视觉信息、示例图像响应或其组合)生成模拟代理;3. 评估模型响应与参与者原响应的匹配度。

Result: 仅提供视觉信息或示例图像响应的匹配度较低(0.59),而组合两者显著提升匹配度(0.70)。单一混合示例优于单独示例,额外示例效果不显著。

Insight: VLMs在模拟低视力感知时需结合具体视觉信息和示例响应,但当前模型仍倾向于过度推断,未来需改进提示设计和模型能力。

Abstract: Advances in vision language models (VLMs) have enabled the simulation of general human behavior through their reasoning and problem solving capabilities. However, prior research has not investigated such simulation capabilities in the accessibility domain. In this paper, we evaluate the extent to which VLMs can simulate the vision perception of low vision individuals when interpreting images. We first compile a benchmark dataset through a survey study with 40 low vision participants, collecting their brief and detailed vision information and both open-ended and multiple-choice image perception and recognition responses to up to 25 images. Using these responses, we construct prompts for VLMs (GPT-4o) to create simulated agents of each participant, varying the included information on vision information and example image responses. We evaluate the agreement between VLM-generated responses and participants’ original answers. Our results indicate that VLMs tend to infer beyond the specified vision ability when given minimal prompts, resulting in low agreement (0.59). The agreement between the agent’ and participants’ responses remains low when only either the vision information (0.59) or example image responses (0.59) are provided, whereas a combination of both significantly increase the agreement (0.70, p < 0.0001). Notably, a single example combining both open-ended and multiple-choice responses, offers significant performance improvements over either alone (p < 0.0001), while additional examples provided minimal benefits (p > 0.05).

[40] Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

Xuezheng Chen,Zhengbo Zou

Main category: cs.CV

TL;DR: 该论文提出了ConstructionSite 10k数据集,包含1万张建筑工地图像,用于评估和微调视觉语言模型在施工安全检测中的表现,并展示了现有VLMs在零样本和小样本任务中的泛化能力。

Details Motivation: 施工安全检查通常依赖人工,而视觉语言模型(VLMs)的崛起引发了将其应用于自动检测安全问题的研究兴趣。然而,缺乏公开数据集限制了VLMs在此领域的全面评估和应用。

Contribution: 提出了ConstructionSite 10k数据集,包含1万张建筑工地图像,支持图像描述、安全违规视觉问答和建筑元素视觉定位三种任务,为施工安全检测提供了基准。

Method: 通过构建大规模数据集,对现有预训练VLMs进行了零样本和小样本评估,并验证了其泛化能力,同时指出需进一步训练以适应实际场景。

Result: 结果显示,当前最先进的VLMs在零样本和小样本任务中表现出显著的泛化能力,但仍需额外训练才能在实际施工场景中应用。

Insight: 该数据集为研究人员提供了新的架构和技术开发的基准,推动了VLMs在施工安全检测中的进一步发展和实际应用。

Abstract: Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

[41] Can Multi-modal (reasoning) LLMs detect document manipulation?

Zisheng Liang,Kidus Zewde,Rudra Pratap Singh,Disha Patil,Zexi Chen,Jiayu Xue,Yao Yao,Yifei Chen,Qinzhe Liu,Simiao Ren

Main category: cs.CV

TL;DR: 这篇论文研究了多模态大语言模型(LLM)在检测文档欺诈方面的表现,发现某些先进模型在零样本泛化能力上优于传统方法,但模型大小与检测准确性相关性有限。

Details Motivation: 文档欺诈对依赖安全文档的行业构成严重威胁,因此需要开发高效的检测机制。论文旨在评估多模态LLM在这方面的潜力。

Contribution: 论文的主要贡献包括:1)对多款先进多模态LLM在文档欺诈检测任务上的表现进行了全面基准测试;2)揭示了模型推理能力与检测准确性的关系;3)为未来可解释且可扩展的欺诈检测策略奠定了基础。

Method: 论文通过提示优化和模型推理过程分析,在多模态LLM上测试了其对篡改文本、格式错位和不一致交易金额等欺诈指标的检测能力。使用标准数据集进行对比评估。

Result: 研究发现,表现最佳的多模态LLM在零样本泛化能力上优于传统方法,但部分模型表现不一致。模型大小与检测准确性的关联性有限,任务特定微调更为关键。

Insight: 论文指出,尽管多模态LLM在文档欺诈检测中展现出潜力,但其表现并非完全依赖模型规模或高级推理能力,任务特定的优化更具决定性。此外,未来研究应关注模型的可解释性。

Abstract: Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models’ reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs demonstrate superior zero-shot generalization, outperforming conventional methods on out-of-distribution datasets, while several vision LLMs exhibit inconsistent or subpar performance. Notably, model size and advanced reasoning capabilities show limited correlation with detection accuracy, suggesting task-specific fine-tuning is critical. This study underscores the potential of multi-modal LLMs in enhancing document fraud detection systems and provides a foundation for future research into interpretable and scalable fraud mitigation strategies.

[42] MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation

Yanwu Yang,Guinan Su,Jiesi Hu,Francesco Sammarco,Jonas Geiping,Thomas Wolfers

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的模型合并方法MedSAMix,结合通用模型(如SAM)和专用模型(如MedSAM)的优势,用于医学图像分割,通过零阶优化自动发现最优层合并方案。

Details Motivation: 医学图像分割模型的通用性受限于数据异质性、标注稀缺和分布偏移等问题,传统模型合并方法依赖人工配置且效果不佳。

Contribution: 1. 提出无需训练的模型合并方法MedSAMix;2. 开发零阶优化自动寻找最优层合并方案;3. 针对临床需求设计单任务优化和多目标优化两种模式。

Method: 通过零阶优化自动发现最优层合并方案,并设计单任务优化和多目标优化两种模式以适应不同场景需求。

Result: 在25个医学分割任务上验证,MedSAMix显著提升性能,专用任务准确率提高6.67%,多任务评估提升4.37%。

Insight: 结合通用和专用模型的优势能够有效缓解模型偏差,提升医学图像分割的域专用性和通用性。

Abstract: Universal medical image segmentation models have emerged as a promising paradigm due to their strong generalizability across diverse tasks, showing great potential for a wide range of clinical applications. This potential has been partly driven by the success of general-purpose vision models such as the Segment Anything Model (SAM), which has inspired the development of various fine-tuned variants for medical segmentation tasks. However, fine-tuned variants like MedSAM are trained on comparatively limited medical imaging data that often suffers from heterogeneity, scarce annotations, and distributional shifts. These challenges limit their ability to generalize across a wide range of medical segmentation tasks. In this regard, we propose MedSAMix, a training-free model merging method that integrates the strengths of both generalist models (e.g., SAM) and specialist models (e.g., MedSAM) for medical image segmentation. In contrast to traditional model merging approaches that rely on manual configuration and often result in suboptimal outcomes, we propose a zero-order optimization method to automatically discover optimal layer-wise merging solutions. Furthermore, for clinical applications, we develop two regimes to meet the demand of domain-specificity and generalizability in different scenarios by single-task optimization and multi-objective optimization respectively. Extensive evaluations on 25 medical segmentation tasks demonstrate that MedSAMix effectively mitigates model bias and consistently improves performance in both domain-specific accuracy and generalization, achieving improvements of 6.67% on specialized tasks and 4.37% on multi-task evaluations.

[43] Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset

Wentao Mo,Qingchao Chen,Yuxin Peng,Siyuan Huang,Yang Liu

Main category: cs.CV

TL;DR: MV-ScanQA和TripAlign旨在解决现有3D VL数据集在跨视图推理和上下文对齐上的不足,通过多视图问题和多对象对齐数据推动3D场景理解的进步。

Details Motivation: 现有3D VL数据集的局限性(如单视图、单对象标注)阻碍了多视图和上下文对齐的3D场景理解模型发展,需要更丰富的数据支持。

Contribution: 1)MV-ScanQA数据集,68%的问题需多视图推理;2)TripAlign预训练数据集,提供2D-3D-语言的上下文多对象对齐;3)LEGO基线方法,将2D LVLMs知识迁移到3D领域。

Method: 1)构建MV-ScanQA测试多视图推理;2)通过TripAlign预训练数据集(1M三元组)实现多对象对齐;3)提出LEGO方法,利用TripAlign预训练提升性能。

Result: LEGO在MV-ScanQA和现有3D密集描述与问答基准上均达到SOTA性能。

Insight: 多视图推理和上下文多对象对齐是提升3D场景理解的关键,数据质量比规模更重要。

Abstract: The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.

[44] Data-Driven Abdominal Phenotypes of Type 2 Diabetes in Lean, Overweight, and Obese Cohorts

Lucas W. Remedios,Chloe Choe,Trent M. Schwartz,Dingjie Su,Gaurav Rudravaram,Chenyu Gao,Aravind R. Krishnan,Adam M. Saunders,Michael E. Kim,Shunxing Bao,Alvin C. Powers,Bennett A. Landman,John Virostko

Main category: cs.CV

TL;DR: 该论文利用AI从3D临床影像中提取腹部结构和脂肪含量的详细测量,通过随机森林和SHAP分析揭示了跨体重类别的2型糖尿病腹部表型特征。

Details Motivation: 尽管高BMI是2型糖尿病的已知风险因素,但疾病在瘦弱成年人中的出现和肥胖者中的缺失表明,详细的身体组成可能揭示腹部表型。

Contribution: 通过AI和大规模临床数据,定义了与2型糖尿病风险和保护相关的身体组成特征。

Method: 从临床CT中提取腹部扫描数据,通过分割和随机森林分类器进行跨验证,利用SHAP分析特征对风险的贡献,并通过聚类和分类链接解剖差异。

Result: 随机森林模型的AUC均值为0.72-0.74,发现了跨体重类别的共享糖尿病特征,如脂肪性骨骼肌、更多内脏和皮下脂肪等。

Insight: 腹部驱动因素在2型糖尿病中可能在体重类别间具有一致性,为风险预测提供了新的方向。

Abstract: Purpose: Although elevated BMI is a well-known risk factor for type 2 diabetes, the disease’s presence in some lean adults and absence in others with obesity suggests that detailed body composition may uncover abdominal phenotypes of type 2 diabetes. With AI, we can now extract detailed measurements of size, shape, and fat content from abdominal structures in 3D clinical imaging at scale. This creates an opportunity to empirically define body composition signatures linked to type 2 diabetes risk and protection using large-scale clinical data. Approach: To uncover BMI-specific diabetic abdominal patterns from clinical CT, we applied our design four times: once on the full cohort (n = 1,728) and once on lean (n = 497), overweight (n = 611), and obese (n = 620) subgroups separately. Briefly, our experimental design transforms abdominal scans into collections of explainable measurements through segmentation, classifies type 2 diabetes through a cross-validated random forest, measures how features contribute to model-estimated risk or protection through SHAP analysis, groups scans by shared model decision patterns (clustering from SHAP) and links back to anatomical differences (classification). Results: The random-forests achieved mean AUCs of 0.72-0.74. There were shared type 2 diabetes signatures in each group; fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the direction of 14-18 of the top 20 predictors within each subgroup (p < 0.05). Conclusions: Our findings suggest that abdominal drivers of type 2 diabetes may be consistent across weight classes.

[45] LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction

Maoquan Zhang,Bisser Raytchev,Xiujuan Sun

Main category: cs.CV

TL;DR: LEARN是一个基于扩散模型的布局到图像生成框架,专为STEM教育设计,通过故事驱动的布局生成教学插图,提升语义对齐和认知效果。

Details Motivation: STEM教育需要清晰的教学插图以传达复杂概念,但传统方法难以动态生成语义对齐的视觉内容。LEARN旨在解决这一问题,通过生成连贯的视觉序列支持高阶认知学习。

Contribution: 1. 提出首个布局感知的扩散框架LEARN,用于教育插图生成。2. 引入BookCover数据集,结合叙事布局和视觉提示。3. 通过布局条件生成和对比学习提升语义对齐,支持Bloom认知分类。

Method: 1. 布局条件扩散模型:生成符合教学叙事的图像。2. 对比视觉-语义训练:增强图像与文本的对齐。3. 提示调制:动态调整生成内容以满足不同认知需求。

Result: LEARN能生成连贯的教学插图,减少认知负荷,支持高阶推理。实验显示其在语义对齐和教育效果上优于传统方法。

Insight: 1. 故事驱动的布局生成可提升教育内容的连贯性。2. 扩散模型与认知理论结合,为AI教育开辟新方向。3. 未来可扩展为多模态自适应系统。

Abstract: LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education. It leverages a curated BookCover dataset that provides narrative layouts and structured visual cues, enabling the model to depict abstract and sequential scientific concepts with strong semantic alignment. Through layout-conditioned generation, contrastive visual-semantic training, and prompt modulation, LEARN produces coherent visual sequences that support mid-to-high-level reasoning in line with Bloom’s taxonomy while reducing extraneous cognitive load as emphasized by Cognitive Load Theory. By fostering spatially organized and story-driven narratives, the framework counters fragmented attention often induced by short-form media and promotes sustained conceptual focus. Beyond static diagrams, LEARN demonstrates potential for integration with multimodal systems and curriculum-linked knowledge graphs to create adaptive, exploratory educational content. As the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding, LEARN represents a novel direction for generative AI in education. The code and dataset will be released to facilitate future research and practical deployment.

[46] Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models

Bing Liu,Le Wang,Mingming Liu,Hao Liu,Rui Yao,Yong Zhou,Peng Liu,Tongqiang Xia

Main category: cs.CV

TL;DR: 提出了一种基于期望最大化(EM)和双向布朗桥扩散模型(B3DM)的半监督图像去雾方法,通过两阶段学习方案解决真实世界雾霾图像中缺乏配对数据的问题。

Details Motivation: 现有去雾方法在真实世界厚雾霾场景中表现不佳,主要因缺乏配对数据和鲁棒先验。本文旨在避免高昂的数据收集成本。

Contribution: 1. 提出EM-B3DM两阶段学习方法。2. 引入细节增强的RDC模块提升模型表现力。3. 在合成和真实数据集上表现优异。

Method: 1. 第一阶段用EM算法解耦图像联合分布为两个条件分布,并用布朗桥扩散模型建模。2. 第二阶段利用预训练模型和大规模未配对数据优化性能。

Result: 在合成和真实数据集上,EM-B3DM表现优于或至少与现有最优方法相当,尤其在厚雾场景中效果显著。

Insight: 通过半监督学习和扩散模型的有效结合,可以显著减少对配对数据的依赖,同时提升去雾模型在复杂场景中的鲁棒性。

Abstract: Existing dehazing methods deal with real-world haze images with difficulty, especially scenes with thick haze. One of the main reasons is the lack of real-world paired data and robust priors. To avoid the costly collection of paired hazy and clear images, we propose an efficient semi-supervised image dehazing method via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models (EM-B3DM) with a two-stage learning scheme. In the first stage, we employ the EM algorithm to decouple the joint distribution of paired hazy and clear images into two conditional distributions, which are then modeled using a unified Brownian Bridge diffusion model to directly capture the structural and content-related correlations between hazy and clear images. In the second stage, we leverage the pre-trained model and large-scale unpaired hazy and clear images to further improve the performance of image dehazing. Additionally, we introduce a detail-enhanced Residual Difference Convolution block (RDC) to capture gradient-level information, significantly enhancing the model’s representation capability. Extensive experiments demonstrate that our EM-B3DM achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.

[47] VFM-Guided Semi-Supervised Detection Transformer for Source-Free Object Detection in Remote Sensing Images

Jianhong Han,Yupei Wang,Liang Chen

Main category: cs.CV

TL;DR: VG-DETR在遥感图像的无源目标检测中,通过结合视觉基础模型(VFM)和半监督框架,提出了一种伪标签挖掘和双级对齐方法,显著提升了检测性能。

Details Motivation: 在真实遥感场景中,隐私和传输限制导致源数据不可访问,现有无监督域适应方法不适用。SFOD虽是一种替代方案,但因伪标签噪声易导致训练崩溃,尤其在密集目标和复杂背景下。

Contribution: 提出VG-DETR,整合VFM提升伪标签质量,通过双级对齐增强特征鲁棒性,解决了SFOD中的噪声问题。

Method: 1. VFM引导的伪标签挖掘策略,利用VFM语义先验评估伪标签可靠性;2. 双级VFM引导对齐,实例和图像级特征对齐。

Result: 实验证明VG-DETR在无源遥感检测任务中表现优越。

Insight: VFM的引入为无源域适应任务提供了“免费午餐”,少量标注数据和语义先验可显著改善伪标签和特征质量。

Abstract: Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a “free lunch” manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector’s feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM’s semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.

[48] Better Supervised Fine-tuning for VQA: Integer-Only Loss

Baihong Qian,Haotian Fan,Wenjie Liao,Yunqiu Wang,Tao Li,Junhui Cui

Main category: cs.CV

TL;DR: 本文提出了一种专为视觉语言模型(VLM)设计的微调方法IOVQA,通过整数化标签和目标掩码策略提升视频质量评估任务的性能。

Details Motivation: 现有方法在视觉质量评估任务中存在结果不精确和损失计算低效的问题,限制了模型对关键指标的学习。

Contribution: 提出IOVQA方法,包括整数标签构造和目标掩码策略,显著提升了模型在VQA任务中的准确性和一致性。

Method: 将模型输出约束为[10,50]范围内的整数,并将十进制Overall_MOS转换为整数标签;在损失计算时仅对标签的前两位整数取消掩码。

Result: 实验表明,该方法显著提升了Qwen2.5-VL模型的性能,在VQualA 2025挑战赛的Track I中排名第三。

Insight: 仅通过整数标签的微调即可有效优化VLM在定量评估任务中的表现。

Abstract: With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model’s output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as labels. We also introduce a target-mask strategy: when computing the loss, only the first two-digit-integer of the label is unmasked, forcing the model to learn the critical components of the numerical evaluation. After fine-tuning the Qwen2.5-VL model using the constructed dataset, experimental results demonstrate that the proposed method significantly improves the model’s accuracy and consistency in the VQA task, ranking 3rd in VQualA 2025 GenAI-Bench AIGC Video Quality Assessment Challenge – Track I. Our work highlights the effectiveness of merely leaving integer labels during fine-tuning, providing an effective idea for optimizing VLMs in quantitative evaluation scenarios.

[49] Exploring the Tradeoff Between Diversity and Discrimination for Continuous Category Discovery

Ruobing Jiang,Yang Liu,Haobing Liu,Yanwei Yu,Chunyang Wang

Main category: cs.CV

TL;DR: 该论文提出了一种名为IDOD的新方法,用于解决连续类别发现(CCD)中多样性和判别性之间的权衡问题,通过模块化设计减少错误累积和存储开销。

Details Motivation: 连续类别发现(CCD)面临新类发现与分类之间的矛盾、错误累积以及存储资源占用高的挑战。现有方法难以平衡多样性和判别性,且容易遗忘旧知识。

Contribution: 提出了IDOD方法,包含独立多样性增强、联合新类发现和正交性持续增量三个模块,解决了CCD中的矛盾问题,降低了错误累积和存储开销。

Method: IDOD通过对比损失训练主干网络以增强多样性,将多阶段新类发现转为单阶段以减少错误累积,并通过正交原型和表征重放防止遗忘。

Result: 实验表明,IDOD在细粒度数据集上优于现有方法。

Insight: 通过模块化设计平衡多样性和判别性,正交性和表征重放可有效降低存储需求并防止遗忘。

Abstract: Continuous category discovery (CCD) aims to automatically discover novel categories in continuously arriving unlabeled data. This is a challenging problem considering that there is no number of categories and labels in the newly arrived data, while also needing to mitigate catastrophic forgetting. Most CCD methods cannot handle the contradiction between novel class discovery and classification well. They are also prone to accumulate errors in the process of gradually discovering novel classes. Moreover, most of them use knowledge distillation and data replay to prevent forgetting, occupying more storage space. To address these limitations, we propose Independence-based Diversity and Orthogonality-based Discrimination (IDOD). IDOD mainly includes independent enrichment of diversity module, joint discovery of novelty module, and continuous increment by orthogonality module. In independent enrichment, the backbone is trained separately using contrastive loss to avoid it focusing only on features for classification. Joint discovery transforms multi-stage novel class discovery into single-stage, reducing error accumulation impact. Continuous increment by orthogonality module generates mutually orthogonal prototypes for classification and prevents forgetting with lower space overhead via representative representation replay. Experimental results show that on challenging fine-grained datasets, our method outperforms the state-of-the-art methods.

[50] Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning

Yumiao Zhao,Bo Jiang,Yuhe Ding,Xiao Wang,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: 该论文提出了一种新的适配器方法LatHAdapter,通过利用下游数据的潜在语义层次结构改进视觉-语言模型(VLMs)的小样本分类性能。

Details Motivation: 现有适配器方法在视觉-语言模型的小样本分类中存在不足,无法有效捕捉类别与图像之间的一对多关联,也难以建立未知类别与图像之间的准确关联。论文旨在通过潜在语义层次结构和双曲空间学习来解决这些问题。

Contribution: 提出LatHAdapter方法,利用双曲空间和语义层次结构建模类别、属性和图像之间的复杂关联,显著提升了小样本分类任务性能。

Method: 引入可学习的属性提示作为桥梁,将类别、属性和图像投影到双曲空间,并通过层次正则化学习其潜在语义层次结构。

Result: 在四个挑战性小样本任务上的实验表明,LatHAdapter优于其他微调方法,特别是在已知类别的适应和未知类别的泛化方面表现突出。

Insight: 潜在语义层次结构能够有效建模视觉-语言任务中的复杂关联,双曲空间的学习为小样本分类提供了新的视角。

Abstract: Adapter-based approaches have garnered attention for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks. These methods strive to develop a lightweight module that better aligns visual and (category) textual representations, thereby enhancing performance on downstream few-shot learning tasks. However, existing adapters generally learn/align (category) textual-visual modalities via explicit spatial proximity in the underlying embedding space, which i) fails to capture the inherent one-to-many associations between categories and image samples and ii) struggles to establish accurate associations between the unknown categories and images. To address these issues, inspired by recent works on hyperbolic learning, we develop a novel Latent Hierarchical Adapter (LatHAdapter) for fine-tuning VLMs on downstream few-shot classification tasks. The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data and employ it to provide richer, fine-grained guidance for the adapter learning process. Specifically, LatHAdapter first introduces some learnable `attribute’ prompts as the bridge to align categories and images. Then, it projects the categories, attribute prompts, and images within each batch in a hyperbolic space, and employs hierarchical regularization to learn the latent semantic hierarchy of them, thereby fully modeling the inherent one-to-many associations among categories, learnable attributes, and image samples. Extensive experiments on four challenging few-shot tasks show that the proposed LatHAdapter consistently outperforms many other fine-tuning approaches, particularly in adapting known classes and generalizing to unknown classes.

[51] Versatile Video Tokenization with Generative 2D Gaussian Splatting

Zhenghao Chen,Zicong Chen,Lei Liu,Yiming Wu,Dong Xu

Main category: cs.CV

TL;DR: 该论文提出了GVT,一种基于生成2D高斯抛洒(2DGS)策略的灵活视频标记化方法,通过空间自适应和时间冗余减少来提高视频处理任务的性能。

Details Motivation: 现有视频标记化方法通常采用固定栅格和块状标记,导致在低信息区域过编码,且难以区分静态和动态内容。GVT通过生成2D高斯抛洒,优化了空间适应性和时间冗余问题。

Contribution: 提出了GVT方法,结合STGE机制生成2D高斯表示,并通过GSP策略分离静态和动态内容,显著提升了视频标记化的灵活性和效能。

Method: 1. 使用STGE机制从视频中提取潜在刚性特征,并生成2D高斯表示;2. 通过GSP策略将高斯分为静态和动态集合,分别建模跨时间和时间特定内容。

Result: GVT在视频重建质量上达到SOTA,在动作识别中优于基线MAGVIT-v2,并在压缩任务中表现可比。

Insight: 生成式2D高斯表示不仅优化了空间适应性,还通过分离静态和动态内容解决了时间冗余问题,为视频标记化提供了新思路。

Abstract: Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video optimization.To enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact representation.We primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.

[52] Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark

Lavisha Aggarwal,Vikas Bahirwani,Lin Li,Andrea Colaco

Main category: cs.CV

TL;DR: 该论文提出了一种将单人教学视频自动转化为任务指导对话的方法,并构建了一个名为HowToDIV的大规模数据集。该数据集包含507段对话、6636组问答对和24小时的视频剪辑,覆盖烹饪、机械和种植等多种任务。论文还通过Gemma-3模型为未来的任务辅助对话研究提供了基准性能。

Details Motivation: 现有的任务辅助对话领域缺乏与真实世界任务视频相结合的对话数据集。人类收集此类数据的成本高且耗时。因此,论文提出了一种自动化方法,利用大语言模型从单人教学视频生成对话数据。

Contribution: 1) 提出了一种自动化方法,将单人教学视频转化为任务指导对话;2) 构建了HowToDIV数据集,包含多种任务的对话、问答对和视频剪辑;3) 为未来的研究提供了基准性能。

Method: 论文利用大语言模型(LLM)自动将单人教学视频中的步骤和视频片段转化为专家与新手的多轮对话。这种方法完全自动化,无需人工干预。

Result: HowToDIV数据集包含507段对话、6636组问答对和24小时的视频剪辑。基线实验使用Gemma-3模型验证了任务的可行性。

Insight: 自动化生成任务对话可以大幅降低数据收集成本,同时为大语言模型在任务辅助领域的应用提供支持。HowToDIV数据集为未来的研究提供了新的方向。

Abstract: Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user’s surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.

[53] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Jiajin Guan,Haibo Mei,Bonan Zhang,Dan Liu,Yuanshuang Fu,Yue Zhang

Main category: cs.CV

TL;DR: UAV-VL-R1是一个轻量级视觉语言模型,通过监督微调和多阶段GRPO强化学习方法优化无人机视觉推理任务,显著提升零样本准确率,并支持实时部署。

Details Motivation: 通用视觉语言模型在无人机航拍图像上的性能下降,无法满足高分辨率、复杂空间语义和实时约束的需求。需要专门设计轻量级模型来适应无人机视觉推理任务。

Contribution: 1. 提出UAV-VL-R1模型,结合监督微调和多阶段GRPO强化学习;2. 发布HRVQA-VL数据集,覆盖8种无人机相关任务;3. 模型在零样本任务中优于大模型,同时支持低资源部署。

Method: 1. 使用监督微调(SFT)优化语义对齐;2. 通过多阶段GRPO强化学习增强逻辑灵活性和推理鲁棒性;3. 发布HRVQA-VL数据集支持训练和评估。

Result: UAV-VL-R1在零样本任务中比Qwen2-VL-2B-Instruct基准模型高48.17%,甚至优于36x更大的72B模型。模型内存占用低,支持FP16和INT8量化。

Insight: 1. 监督微调可能削弱数学任务的多样性,需结合强化学习补偿;2. GRPO通过规则引导奖励和组内策略对齐提升推理结构性和可解释性。

Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.

[54] A Coarse-to-Fine Human Pose Estimation Method based on Two-stage Distillation and Progressive Graph Neural Network

Zhangjian Ji,Wenjin Zhang,Shaotong Qiao,Kai Feng,Yuhua Qian

Main category: cs.CV

TL;DR: 论文提出了一种基于两阶段知识蒸馏和渐进式图神经网络的人体姿态估计方法,通过挖掘关节结构信息逐步细化姿态,实现轻量且鲁棒的姿态估计。

Details Motivation: 现有的人体姿态估计方法需要大量计算资源,且传统知识蒸馏未能充分利用关节间的上下文信息。因此,提出一种轻量且高效的蒸馏方法,结合结构信息和渐进式图网络提升性能。

Contribution: 1. 提出粗到细的两阶段知识蒸馏框架;2. 设计基于关节结构损失的第一阶段蒸馏;3. 引入渐进式图卷积网络(IGP-GCN)细化姿态。

Method: 1. 第一阶段通过关节结构损失传递高层语义知识;2. 第二阶段用IGP-GCN逐步优化初始姿态,以教师模型的最终输出监督训练。

Result: 在COCO keypoint和CrowdPose数据集上表现优异,尤其在复杂场景(CrowdPose)提升显著。

Insight: 结合结构信息与渐进式优化能有效提升轻量模型的姿态估计性能,特别是在拥挤场景中。

Abstract: Human pose estimation has been widely applied in the human-centric understanding and generation, but most existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. In order to obtain an accurate, robust yet lightweight human pose estimator, one feasible way is to transfer pose knowledge from a powerful teacher model to a less-parameterized student model by knowledge distillation. However, the traditional knowledge distillation framework does not fully explore the contextual information among human joints. Thus, in this paper, we propose a novel coarse-to-fine two-stage knowledge distillation framework for human pose estimation. In the first-stage distillation, we introduce the human joints structure loss to mine the structural information among human joints so as to transfer high-level semantic knowledge from the teacher model to the student model. In the second-stage distillation, we utilize an Image-Guided Progressive Graph Convolutional Network (IGP-GCN) to refine the initial human pose obtained from the first-stage distillation and supervise the training of the IGP-GCN in the progressive way by the final output pose of teacher model. The extensive experiments on the benchmark dataset: COCO keypoint and CrowdPose datasets, show that our proposed method performs favorably against lots of the existing state-of-the-art human pose estimation methods, especially for the more complex CrowdPose dataset, the performance improvement of our model is more significant.

[55] A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving

Jialin Li,Shuqi Wu,Ning Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的轻量级不确定性模态建模框架(UMM),用于自动驾驶中的行人重识别(ReID)。通过多模态令牌映射器、合成模态增强策略和跨模态线索交互学习器,UMM统一了特征表示,减轻了缺失模态的影响,并高效融合了多模态输入。实验表明,UMM在计算效率和鲁棒性上表现出色。

Details Motivation: 自动驾驶中的行人重识别(ReID)需要处理不确定性或缺失的模态(如RGB、红外、草图或文本描述)。虽然预训练模型具有强大的多模态语义建模能力,但其计算开销限制了在资源受限环境中的部署。因此,需要一种轻量级的解决方案。

Contribution: 提出了一种轻量级UMM框架,结合了多模态令牌映射器、合成模态增强和跨模态交互学习器。通过利用CLIP的视觉-语言对齐能力,UMM无需大量微调即可高效融合多模态输入。

Method: UMM框架包含三个核心组件:多模态令牌映射器用于统一特征表示,合成模态增强策略缓解缺失模态的影响,跨模态线索交互学习器提取不同模态的互补信息。CLIP的预训练能力被用于高效融合多模态特征。

Result: 实验结果显示,UMM在不确定性模态条件下表现出较强的鲁棒性、泛化能力和计算效率,为自动驾驶中的行人重识别提供了一种可扩展的实用解决方案。

Insight: 通过轻量级设计和预训练模型的高效利用,UMM在计算资源受限的环境中实现了多模态输入的鲁棒融合,为自动驾驶中的ReID任务提供了新的思路。

Abstract: Re-Identification (ReID) is a critical technology in intelligent perception systems, especially within autonomous driving, where onboard cameras must identify pedestrians across views and time in real-time to support safe navigation and trajectory prediction. However, the presence of uncertain or missing input modalities–such as RGB, infrared, sketches, or textual descriptions–poses significant challenges to conventional ReID approaches. While large-scale pre-trained models offer strong multimodal semantic modeling capabilities, their computational overhead limits practical deployment in resource-constrained environments. To address these challenges, we propose a lightweight Uncertainty Modal Modeling (UMM) framework, which integrates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner. Together, these components enable unified feature representation, mitigate the impact of missing modalities, and extract complementary information across different data types. Additionally, UMM leverages CLIP’s vision-language alignment ability to fuse multimodal inputs efficiently without extensive finetuning. Experimental results demonstrate that UMM achieves strong robustness, generalization, and computational efficiency under uncertain modality conditions, offering a scalable and practical solution for pedestrian re-identification in autonomous driving scenarios.

[56] FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

MengChao Wang,Qiang Wang,Fan Jiang,Mu Xu

Main category: cs.CV

TL;DR: 论文提出了一种基于人类偏好的多维度对齐方法,通过引入Talking-Critic奖励模型和TLPO框架,显著提升了音频驱动肖像动画的质量和自然度。

Details Motivation: 现有方法难以在多维度(如自然运动、口型同步和视觉质量)上同时满足人类偏好,且缺乏大规模带标注的多维度偏好数据集。因此,论文旨在解决这些问题。

Contribution: 1) 提出Talking-Critic多模态奖励模型;2) 创建大规模多维度偏好数据集Talking-NSQ;3) 提出TLPO框架,实现扩散模型在多维度偏好上的细粒度优化。

Method: TLPO框架通过将多维度偏好解耦为专用专家模块,并在时间和网络层上动态融合,实现互不干扰的综合优化。Talking-Critic用于量化生成视频的偏好得分。

Result: Talking-Critic在偏好评分上显著优于现有方法;TLPO在口型同步、运动自然度和视觉质量上均超越基线模型。

Insight: 解耦和动态融合多维度偏好是提升生成质量的关键,同时多模态奖励模型和大规模标注数据集对模型对齐人类偏好至关重要。

Abstract: Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/

[57] Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Junjie Wang,Keyu Chen,Yulin Li,Bin Chen,Hengshuang Zhao,Xiaojuan Qi,Zhuotao Tian

Main category: cs.CV

TL;DR: 该论文提出了DeCLIP框架,通过解耦CLIP的自注意力模块,分别获取内容与上下文特征,以增强开放词汇密集感知任务的性能。

Details Motivation: 现有密集视觉感知任务依赖预定义类别,限制了其在实际场景中的应用。尽管CLIP在开放词汇任务中表现优异,但其直接应用于密集感知时,由于局部特征表示的局限性,效果不佳。

Contribution: 提出了DeCLIP框架,通过解耦自注意力模块并分别优化内容与上下文特征,显著提升了开放词汇密集感知任务的性能。

Method: DeCLIP将CLIP的自注意力模块解耦为内容与上下文特征,分别通过视觉基础模型和扩散模型优化语义关联与空间一致性,同时通过图像裁剪表示对齐提升局部区分能力。

Result: 实验表明,DeCLIP在多种任务(如2D检测、分割,3D实例分割等)中均达到最先进水平。

Insight: 解耦内容与上下文特征并结合多模态信息优化,是提升开放词汇密集感知任务性能的有效途径。

Abstract: Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context’’ features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP

[58] Vision-Language Models display a strong gender bias

Aiswarya Konavoor,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat

Main category: cs.CV

TL;DR: 该研究发现视觉语言模型(VLM)在图像与文本对齐时会编码和放大性别偏见,尤其是在描述职业和活动时。通过分析对比性嵌入空间的性别关联,作者提出了一种评估性别偏见的框架。

Details Motivation: 视觉语言模型在视觉和文本任务中表现出色,但其对齐机制可能隐含社会偏见。目前缺乏对这类模型中性别偏见的系统性评估方法。

Contribution: 1. 提出了一种新的评估框架,用于量化VLM中的性别偏见;2. 构建了包含分性别面部照片和语句的数据集;3. 揭示了对比性嵌入空间中性别与职业/活动的关联模式。

Method: 1. 构建分性别人脸数据集和语句集;2. 计算图像和语句的嵌入;3. 定义性别关联分数(基于余弦相似性差异);4. 通过自助法和标签交换零模型评估偏见。

Result: 研究发现VLM在嵌入空间中存在显著的性别关联,例如某些职业和活动更倾向于与特定性别绑定。

Insight: 视觉语言模型的性能评估需超越精度指标,纳入社会偏见的分析。对比性嵌入空间可能无意中强化社会刻板印象。

Abstract: Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.

[59] Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

Jun Li,Kai Li,Shaoguo Liu,Tingting Gao

Main category: cs.CV

TL;DR: 该论文提出了一种基于推理增强表示工程的框架(PMTFR),用于提升监督式组合图像检索(CIR)任务的性能,通过金字塔匹配模型和无训练细化方法显著优于现有方法。

Details Motivation: 当前CIR任务中,现有方法通常依赖两阶段排名模型或复杂的提示设计,限制了其在监督式场景中的应用效果。论文旨在通过推理和表示工程改进这一问题。

Contribution: 1. 提出金字塔修补器模块,增强模型对不同粒度视觉信息的理解;2. 利用表示工程从CoT数据中提取特征并注入LVLMs,无需显式文本推理即可优化检索分数;3. 在监督式CIR任务中超越SOTA方法。

Method: 1. 使用金字塔匹配模型(PMTFR)捕捉多粒度视觉信息;2. 从CoT数据中提取表示并注入LVLMs;3. 采用无训练细化范式优化检索结果。

Result: 在CIR基准测试中,PMTFR显著优于现有方法,证明了其在监督式CIR任务中的有效性。

Insight: 推理增强表示工程可以显著提升CIR任务的性能,而无需依赖复杂的提示设计或额外训练。这种方法为多模态任务提供了新的优化思路。

Abstract: Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited – compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model’s understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.

[60] Probing the Representational Power of Sparse Autoencoders in Vision Models

Matthew Lyle Olson,Musashi Hinck,Neale Ratzlaff,Changbai Li,Phillip Howard,Vasudev Lal,Shao-Yen Tseng

Main category: cs.CV

TL;DR: 本文通过广泛的实验评估了稀疏自编码器(SAE)在视觉模型中的表征能力,发现其特征具有语义意义,能提升分布外泛化能力,并在多种视觉模型架构中实现可控生成。

Details Motivation: 稀疏自编码器(SAE)在语言模型中已被广泛应用作解释工具,但在视觉领域的研究较少。本文旨在探究SAE在视觉模型中的表征能力及其潜在应用。

Contribution: 1. 证明SAE特征在视觉模型中具有语义意义;2. 展示了SAE能提升模型的分布外泛化能力和可控生成能力;3. 在多种视觉模型架构(如嵌入模型、多模态LLM和扩散模型)中验证了SAE的有效性。

Method: 通过设计广泛的图像任务实验,评估SAE在视觉嵌入模型、多模态LLM和扩散模型中的表现,包括特征分析、分布外检测和可控生成等。

Result: 实验结果表明,SAE特征能够捕捉视觉模型的语义结构,提升泛化能力,并通过文本编码器操控实现语义控制。在多模态LLM中,SAE揭示了视觉与语言的共享表征。

Insight: 稀疏自编码器在视觉模型中具有广泛的应用潜力,尤其是在解释性、泛化性和可控性方面,为视觉模型的透明化和功能扩展提供了新思路。

Abstract: Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.

[61] Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction

Muzammil Khan,Enzo Kerkhof,Matteo Fusaglia,Koert Kuhlmann,Theo Ruers,Françoise J. Siepel

Main category: cs.CV

TL;DR: 论文提出了一种统一框架,结合尺度感知深度预测和感知先验,改进单目内窥镜的位姿估计和组织重建,通过模块化设计和优化算法解决了深度模糊和变形等问题。

Details Motivation: 单目内窥镜在手术导航和空间感知中存在深度模糊、组织变形、纹理有限等挑战,需要一种能整合深度预测和时间约束优化的方法。

Contribution: 1. 提出MAPIS-Depth模块,结合Depth Pro和Depth Anything生成伪度量深度;2. 引入WEMA-RTDL模块优化位姿估计;3. 通过时间感知的光流和感知融合减少伪影。

Method: 1. 使用L-BFGS-B优化和RAFT光流进行时间约束的深度预测;2. 基于LPIPS的感知融合减少变形;3. 采用TSDF体素融合和Marching Cubes提取三维网格。

Result: 在HEVD和SCARED数据集上的实验表明,框架在深度预测和位姿估计上优于现有方法。

Insight: 通过结合深度预测与感知先验,时间约束优化能有效解决单目内窥镜的挑战,模块化设计为类似问题提供了通用思路。

Abstract: Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework’s robustness and superiority over state-of-the-art methods.

[62] Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Weijia Liu,Jiuxin Cao,Bo Miao,Zhiheng Fu,Xuelin Zhu,Jiawei Ge,Bo Liu,Mehwish Nasim,Ajmal Mian

Main category: cs.CV

TL;DR: 该论文提出了一种名为‘Denoise-then-Retrieve’的新范式,通过文本驱动的视频降噪过滤无关视频片段,改进视频时刻检索性能。

Details Motivation: 当前文本驱动的视频时刻检索方法会编码所有视频片段(包括无关片段),扰乱了多模态对齐并阻碍了优化。为此,论文提出了‘降噪再检索’的范式来解决这一问题。

Contribution: 论文的主要贡献包括:(1)引入了‘Denoise-then-Retrieve’范式,通过显式过滤无关片段改进检索性能;(2)设计了DRNet网络,包含TCD和TRF模块,分别用于动态识别噪声片段和优化多模态对齐。

Method: 方法分为两步:首先通过TCD模块(结合交叉注意力和结构化状态空间块)动态识别噪声片段并生成噪声掩码;然后通过TRF模块进一步蒸馏查询嵌入并与文本对齐。最终在降噪后的视频表征上进行条件检索。

Result: 在Charades-STA和QVHighlights数据集上的实验表明,该方法在所有指标上均优于现有方法,且该范式可以无缝集成到其他先进模型中。

Insight: 论文的‘降噪再检索’范式为视频时刻检索提供了一种新的优化方向,通过显式过滤噪声片段提升多模态对齐效果。

Abstract: Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.

[63] Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Yuchen Zhou,Jiayu Tang,Shuo Yang,Xiaoyan Xiao,Yuqin Dai,Wenhao Yang,Chao Gou,Xiaobo Xia,Tat-Seng Chua

Main category: cs.CV

TL;DR: 论文提出了LogicBench和LogicCLIP,前者是一个全面的评估基准,用于揭示视觉语言模型(VLMs)的逻辑盲点;后者是一个新型训练框架,旨在提升VLMs的逻辑理解能力。

Details Motivation: 现有的视觉语言模型(如CLIP)在逻辑理解方面存在显著不足,影响其在实际应用中的可靠性。这促使作者开发了LogicBench,以系统地诊断这些问题,并提出了LogicCLIP来改善模型的逻辑敏感度。

Contribution: 1. 提出了LogicBench,包含9个逻辑类别和4种场景的50,000多个视觉语言对;2. 设计了LogicCLIP训练框架,通过逻辑感知数据生成和对比学习策略提升模型逻辑理解能力。

Method: 1. 使用LogicBench评估VLMs的逻辑盲点;2. LogicCLIP结合了粗粒度对齐、细粒度多目标选择和逻辑结构感知目标,优化模型训练。

Result: LogicCLIP在所有LogicBench领域表现出显著改进,逻辑理解能力明显优于基线模型,同时在通用视觉语言任务中保持或超越竞争性能。

Insight: 提升VLMs的逻辑能力不会损害其通用对齐性能,说明逻辑理解和通用任务可以相辅相成。

Abstract: Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ‘’logical blindspots’’ that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs’ logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP’s substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.

[64] Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking

Haonan Zhang,Xinyao Wang,Boxi Wu,Tu Zheng,Wang Yunhua,Zheng Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于动态场景线索一致性的3D多目标跟踪方法DSC-Track,通过统一时空编码器和线索一致性变换模块,显著提升了复杂场景下的跟踪鲁棒性,并在nuScenes和Waymo数据集上验证了其优越性能。

Details Motivation: 传统3D多目标跟踪方法(如卡尔曼滤波)在复杂或拥挤场景中常因忽略物体间的几何关系而失效。现有基于几何的方法易受无关物体干扰,导致特征模糊和关联错误,因此需要一种更鲁棒的方法来挖掘动态场景中的稳定空间模式。

Contribution: 1. 提出了动态场景线索一致性(cue-consistency)原则,通过识别和匹配稳定空间模式提升跟踪性能。2. 设计了统一时空编码器(基于PPF)和线索一致性变换模块,抑制干扰并提取判别性轨迹嵌入。3. 引入了动态更新机制,保留关键的时空信息以实现稳定的在线跟踪。

Method: 1. 使用点对特征(PPF)构建统一时空编码器,学习判别性轨迹嵌入。2. 通过线索一致性变换模块显式对齐历史轨迹与当前检测的特征表示。3. 采用动态更新机制维护鲁棒的时空信息。

Result: 在nuScenes数据集上,DSC-Track在验证集和测试集上的AMOTA分别达到73.2%和70.3%,实现了SOTA性能。Waymo数据集上的实验也验证了方法的有效性。

Insight: 动态场景中的稳定空间模式(线索一致性)是提升多目标跟踪鲁棒性的关键。通过显式建模和匹配这些模式,可以显著减少复杂环境中的干扰和误关联。

Abstract: 3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.

[65] Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

Yanghao Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种优化噪声的方法NoOp,以解决扩散分类器(DC)中噪声不稳定性的问题,通过频率匹配和空间匹配原则学习好的噪声,从而提高分类性能。

Details Motivation: 现有扩散分类器(DC)的性能因随机噪声的采样差异而不稳定,需要大量噪声采样进行集成,显著降低分类速度。为此,作者探索了噪声在DC中的作用,并提出优化噪声的方法以提升稳定性。

Contribution: 提出了NoOp方法,通过频率匹配和空间匹配原则优化噪声,减少DC的不稳定性,同时避免了大量噪声采样的需求。

Method: NoOp首先优化数据集特定的噪声(频率匹配),然后训练一个元网络生成图像特定的噪声偏移(空间匹配),最终将优化噪声与噪声偏移结合用于分类。

Result: 在多种数据集上的广泛实验证明了NoOp的有效性,显著提升了分类性能。

Insight: 噪声的选择对扩散分类器的性能至关重要,通过优化噪声可以显著减少不稳定性,而无需依赖大量噪声采样。

Abstract: Although today’s pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises’’ that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.

[66] GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition

Md Asgor Hossain Reaj,Rajan Das Gupta,Md Yeasin Rahat,Nafiz Fahad,Md Jawadul Hasan,Tze Hui Liew

Main category: cs.CV

TL;DR: GANDiff FR 是一个合成框架,通过结合 StyleGAN3 和扩散模型,精确控制人脸属性(如姿态、光照、表情)以量化人脸识别中的偏见。

Details Motivation: 研究目标是通过生成可控的合成人脸数据,量化并减少人脸识别系统中的偏见,为公平性评估提供标准。

Contribution: 1. 提出首个结合 StyleGAN3 和扩散模型的合成框架 GANDiff FR;2. 实现对人脸属性的细粒度控制;3. 量化并分析偏见来源。

Method: 使用 StyleGAN3 生成身份保留的人脸,结合扩散模型控制姿态、光照和表情,生成 10,000 张平衡人脸数据。

Result: AdaFace 减少组间 TPR 差异 60%;光照占剩余偏见的 42%;合成数据与真实数据相关性达 0.85。

Insight: 光照是影响人脸识别偏见的主要因素,合成数据可有效支持公平性评估。

Abstract: We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor. GANDiff FR unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control, enabling fine-grained manipulation of pose around 30 degrees, illumination (four directions), and expression (five levels) under ceteris paribus conditions. We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection (98.2%) and human review (89%) to isolate and quantify bias drivers. Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60% (2.5% vs. 6.3%), with illumination accounting for 42% of residual bias. Cross-dataset evaluation on RFW, BUPT, and CASIA WebFace confirms strong synthetic-to-real transfer (r 0.85). Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants, establishing a reproducible, regulation-aligned (EU AI Act) standard for fairness auditing. Code and data are released to support transparent, scalable bias evaluation.

[67] Semantically Guided Adversarial Testing of Vision Models Using Language Models

Katarzyna Filus,Jorge M. Cruz-Duarte

Main category: cs.CV

TL;DR: 论文提出了一个基于语义引导的对抗目标选择框架,利用预训练语言和视觉语言模型进行跨模态知识迁移,以生成最优和最差的对抗场景。实验表明,该方法优于静态语义资源,并能构建可解释、标准化和可扩展的对抗性基准。

Details Motivation: 当前针对视觉模型的对抗攻击中,目标标签的选择通常依赖随机性或静态语义资源,缺乏灵活性和可解释性。作者希望通过预训练模型的语义知识来改进目标选择过程。

Contribution: 1. 提出了一个语义引导的对抗目标选择框架,利用跨模态模型(如BERT、TinyLLAMA和CLIP)的语义相似性;2. 通过实验验证了该方法在生成对抗目标上的有效性,优于静态资源。

Method: 使用预训练的语言模型(BERT、TinyLLAMA)和视觉语言模型(CLIP)计算语义相似性,选择与真实标签最相关和最不相关的目标标签,生成对抗场景。

Result: 在三个视觉模型和五种攻击方法上的实验表明,该方法能有效地生成对抗目标,超越静态语义资源(如WordNet)。还表明静态测试可以提供对相似性来源的初步评估。

Insight: 预训练模型能够提供更灵活和可解释的对抗目标选择方式,且适用于不同架构和数据集,为构建标准化对抗基准提供了新思路。

Abstract: In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textit{a priori} testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.

[68] Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas,Pierluca D’Oro,Koustuv Sinha,Adriana Romero-Soriano,Michal Drozdzal,Aishwarya Agrawal

Main category: cs.CV

TL;DR: 该论文提出了一种通过奖励引导解码来控制多模态大语言模型(MLLM)的方法,实现了对其视觉接地能力的动态控制,并在对象幻觉基准测试中表现优于现有方法。

Details Motivation: 随着MLLMs的广泛应用,适应多样化用户需求变得尤为重要。论文旨在探索如何通过解码过程的动态控制来满足用户对模型输出的精确性和召回率的灵活需求。

Contribution: 论文的主要贡献是首次提出了奖励引导解码方法,通过构建视觉接地奖励模型,动态控制MLLM的解码过程,实现了对象精确性和召回率的灵活权衡。

Method: 方法包括构建两个独立的奖励模型(分别控制对象精确性和召回率),并在解码过程中动态调整各奖励函数的权重和搜索范围,以实现对MLLM输出的动态控制。

Result: 在标准对象幻觉基准测试中,该方法显著提升了MLLM的可控性,同时在视觉接地任务中优于现有方法。

Insight: 论文揭示了通过解码过程的动态调节可以有效平衡模型输出质量与计算成本,为MLLM的实用化提供了新思路。

Abstract: As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM’s decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model’s output. Our approach enables on-the-fly controllability of an MLLM’s inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.

[69] HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model

Zhenhao Zhang,Hanqing Wang,Xiangyu Zeng,Ziyu Cheng,Jiaxin Liu,Haoyu Yan,Zhirui Liu,Kaiyang Ji,Tianxiang Gui,Ke Hu,Kangyi Chen,Yahao Fan,Mokai Pan

Main category: cs.CV

TL;DR: HOID-R1是一种通过强化学习结合多模态大语言模型的开放世界人-物交互检测方法,引入链式思维(CoT)监督微调和群组相对策略优化(GRPO)提升效果。

Details Motivation: 当前开放词汇人-物交互(HOI)检测方法依赖大语言模型但忽略其3D空间理解能力,HOID-R1旨在填补这一空白。

Contribution: 首次将CoT引导的监督微调与GRPO结合的强化学习框架,提出“MLLM-as-a-judge”机制减少CoT幻觉。

Method: 结合SFT赋予模型推理能力,再通过GRPO优化多模态对齐,“MLLM-as-a-judge”监督CoT输出。

Result: 在HOI检测基准上达到SOTA,开放世界泛化能力优于现有方法。

Insight: 将推理能力与强化学习结合可显著提升HOI检测的开放世界适应能力。

Abstract: Understanding and recognizing human-object interaction (HOI) is a pivotal application in AR/VR and robotics. Recent open-vocabulary HOI detection approaches depend exclusively on large language models for richer textual prompts, neglecting their inherent 3D spatial understanding capabilities. To address this shortcoming, we introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided supervised fine-tuning (SFT) with group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we initially apply SFT to imbue the model with essential reasoning capabilities, forcing the model to articulate its thought process in the output. Subsequently, we integrate GRPO to leverage multi-reward signals for policy optimization, thereby enhancing alignment across diverse modalities. To mitigate hallucinations in the CoT reasoning, we introduce an “MLLM-as-a-judge” mechanism that supervises the CoT outputs, further improving generalization. Extensive experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.

[70] Does the Skeleton-Recall Loss Really Work?

Devansh Arora,Nitin Kumar,Sukrit Gupta

Main category: cs.CV

TL;DR: 本文通过理论和实验分析,质疑了Skeleton Recall Loss(SRL)在图像分割任务中的实际效果,发现其并未超越传统基线模型,并揭示了拓扑保持损失函数的局限性。

Details Motivation: 图像分割中,尤其是针对细长管状结构的分割任务,拓扑保持损失函数(如SRL)被声称能生成更精确的分割结果。但该论文旨在验证SRL的实际效果是否如声称的那样优越。

Contribution: 本文的主要贡献是对SRL损失函数进行了理论梯度分析,并通过对多个管状数据集的实验比较,证明了SRL并未显著优于传统基线方法。

Method: 论文首先分析了SRL的梯度特性,随后在原始工作中使用的数据集及其他附加数据集上进行了实验对比,比较了SRL与传统基线模型的性能。

Result: 实证结果表明,基于SRL的分割模型性能并未超过传统基线模型,揭示了SRL在实际应用中的局限性。

Insight: 该研究为开发更有效的复杂管状结构分割模型提供了重要见解,指出单纯依赖拓扑保持损失函数可能不足以解决此类任务的实际挑战。

Abstract: Image segmentation is an important and widely performed task in computer vision. Accomplishing effective image segmentation in diverse settings often requires custom model architectures and loss functions. A set of models that specialize in segmenting thin tubular structures are topology preservation-based loss functions. These models often utilize a pixel skeletonization process claimed to generate more precise segmentation masks of thin tubes and better capture the structures that other models often miss. One such model, Skeleton Recall Loss (SRL) proposed by Kirchhoff et al.~\cite {kirchhoff2024srl}, was stated to produce state-of-the-art results on benchmark tubular datasets. In this work, we performed a theoretical analysis of the gradients for the SRL loss. Upon comparing the performance of the proposed method on some of the tubular datasets (used in the original work, along with some additional datasets), we found that the performance of SRL-based segmentation models did not exceed traditional baseline models. By providing both a theoretical explanation and empirical evidence, this work critically evaluates the limitations of topology-based loss functions, offering valuable insights for researchers aiming to develop more effective segmentation models for complex tubular structures.

[71] G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

Ramil Khafizov,Artem Komarichev,Ruslan Rakhimov,Peter Wonka,Evgeny Burnaev

Main category: cs.CV

TL;DR: G-CUT3R提出了一种新颖的前馈方法,用于引导式3D场景重建,通过整合先验信息改进了CUT3R模型。

Details Motivation: 现有的前馈方法仅依赖输入图像,忽略了实际场景中常见的深度、相机标定或位置等辅助数据。

Contribution: 提出了一种轻量级改进,为每种辅助数据设计专用编码器,并通过零卷积与RGB图像特征融合,灵活支持不同先验信息的组合。

Method: 在CUT3R基础上引入多模态编码器和零卷积融合机制,支持深度、相机标定等多类先验信息的整合。

Result: 在多个基准测试中表现优异,显著提升了3D重建等任务的性能。

Insight: 利用先验信息可以显著提升3D重建性能,且设计灵活的融合机制对多模态数据兼容性至关重要。

Abstract: We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.

[72] RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator

Zhiming Liu,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 本文提出了一种名为RMFAT的轻量级循环框架,用于高效且时序一致的视频恢复,以解决大气湍流带来的视觉模糊和时序不一致性问题。

Details Motivation: 大气湍流会导致视频质量严重下降,现有基于Transformer和3D架构的方法计算成本高,难以实时部署。

Contribution: RMFAT通过轻量级循环框架和多尺度特征编码解码模块,显著降低了计算负担并提升了恢复效果。

Method: 采用仅需两帧输入的循环框架,结合多尺度特征编码和时序变形模块,增强空间细节和时序一致性。

Result: 在合成和真实数据集上表现优异,SSIM提升近9%,推理速度提升四倍以上。

Insight: 轻量级循环框架结合多尺度特征和时序变形是高效解决大气湍流问题的有效途径。

Abstract: Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer and 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator, designed for efficient and temporally consistent video restoration under AT conditions. RMFAT adopts a lightweight recurrent framework that restores each frame using only two inputs at a time, significantly reducing temporal window size and computational burden. It further integrates multi-scale feature encoding and decoding with temporal warping modules at both encoder and decoder stages to enhance spatial detail and temporal coherence. Extensive experiments on synthetic and real-world atmospheric turbulence datasets demonstrate that RMFAT not only outperforms existing methods in terms of clarity restoration (with nearly a 9% improvement in SSIM) but also achieves significantly improved inference speed (more than a fourfold reduction in runtime), making it particularly suitable for real-time atmospheric turbulence suppression tasks.

[73] SelfAdapt: Unsupervised Domain Adaptation of Cell Segmentation Models

Fabian H. Reith,Jannik Franzen,Dinesh R. Palli,J. Lorenz Rumberger,Dagmar Kainmueller

Main category: cs.CV

TL;DR: SelfAdapt是一种无监督域适应方法,用于提升细胞分割模型在新领域上的性能,无需标注数据,通过师生增强一致性和L2-SP正则化实现,显著优于基线模型。

Details Motivation: 现有细胞分割模型(如Cellpose)在跨领域数据上性能下降,而监督微调需要大量标注数据。SelfAdapt旨在通过无监督方法解决这一问题。

Contribution: 提出了SelfAdapt方法,通过师生增强一致性和L2-SP正则化实现无监督域适应,显著提高了细胞分割模型在新领域的性能。

Method: 采用师生增强一致性训练,引入L2-SP正则化和无标签停止准则,无需标注数据即可优化预训练模型。

Result: 在LiveCell和TissueNet数据集上,AP0.5指标相对提升了29.64%,且无监督适应还能进一步提升监督微调后的模型性能。

Insight: 无监督域适应在生物医学图像分割中具有巨大潜力,特别是标注数据稀缺时,且结合正则化方法能有效防止过拟合。

Abstract: Deep neural networks have become the go-to method for biomedical instance segmentation. Generalist models like Cellpose demonstrate state-of-the-art performance across diverse cellular data, though their effectiveness often degrades on domains that differ from their training data. While supervised fine-tuning can address this limitation, it requires annotated data that may not be readily available. We propose SelfAdapt, a method that enables the adaptation of pre-trained cell segmentation models without the need for labels. Our approach builds upon student-teacher augmentation consistency training, introducing L2-SP regularization and label-free stopping criteria. We evaluate our method on the LiveCell and TissueNet datasets, demonstrating relative improvements in AP0.5 of up to 29.64% over baseline Cellpose. Additionally, we show that our unsupervised adaptation can further improve models that were previously fine-tuned with supervision. We release SelfAdapt as an easy-to-use extension of the Cellpose framework. The code for our method is publicly available at https: //github.com/Kainmueller-Lab/self_adapt.

[74] Training-free Dimensionality Reduction via Feature Truncation: Enhancing Efficiency in Privacy-preserving Multi-Biometric Systems

Florian Bayer,Maximilian Russo,Christian Rathgeb

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的特征截断降维方法,用于提高隐私保护多生物特征系统的效率,同时保持生物识别精度和安全水平。

Details Motivation: 生物识别技术的广泛应用使得提取模板的隐私和安全成为关键问题。现有的生物模板保护方案(如同态加密)带来了高计算负担,而多模态融合虽然提升了安全性,但增加了处理复杂度。如何在加密条件下高效处理多生物特征数据是主要动机。

Contribution: 论文的主要贡献是提出了一种无需训练的特征截断降维方法,可在加密条件下高效处理多生物特征数据,且无需额外训练或牺牲识别精度。

Method: 论文通过特征截断实现降维,结合同态加密减少计算操作。实验使用基于DNN提取的人脸、指纹和虹膜特征,从虚拟多生物特征数据库中验证方法的有效性。

Result: 结果显示,多模态特征融合可将模板尺寸减少67%,同时保持与单模态相当的等错误率(EER)。

Insight: 特征截断是一种简单且可解释的方法,适用于加密环境,且在降维后仍能保持多模态融合的优势,为隐私保护生物识别系统提供了高效解决方案。

Abstract: Biometric recognition is widely used, making the privacy and security of extracted templates a critical concern. Biometric Template Protection schemes, especially those utilizing Homomorphic Encryption, introduce significant computational challenges due to increased workload. Recent advances in deep neural networks have enabled state-of-the-art feature extraction for face, fingerprint, and iris modalities. The ubiquity and affordability of biometric sensors further facilitate multi-modal fusion, which can enhance security by combining features from different modalities. This work investigates the biometric performance of reduced multi-biometric template sizes. Experiments are conducted on an in-house virtual multi-biometric database, derived from DNN-extracted features for face, fingerprint, and iris, using the FRGC, MCYT, and CASIA databases. The evaluated approaches are (i) explainable and straightforward to implement under encryption, (ii) training-free, and (iii) capable of generalization. Dimensionality reduction of feature vectors leads to fewer operations in the Homomorphic Encryption (HE) domain, enabling more efficient encrypted processing while maintaining biometric accuracy and security at a level equivalent to or exceeding single-biometric recognition. Our results demonstrate that, by fusing feature vectors from multiple modalities, template size can be reduced by 67 % with no loss in Equal Error Rate (EER) compared to the best-performing single modality.

[75] ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

Jingyu Li,Bozhou Zhang,Xin Jin,Jiankang Deng,Xiatian Zhu,Li Zhang

Main category: cs.CV

TL;DR: ImagiDrive提出了一個統一的想像與規劃框架,結合視覺語言模型(VLM)和駕駛世界模型(DWM),以實現更安全和精確的自動駕駛。

Details Motivation: 自動駕駛需要豐富的上下文理解和精確的預測推理,而VLMs和DWMs分別在行為預測和場景生成方面具有優勢。整合這兩者可以互補彼此的優勢,但面臨著效率和平滑連接的挑戰。

Contribution: 提出ImagiDrive,一個端到端的框架,通過VLM-based駕駛代理和DWM-based場景想像器形成統一的循環,並引入早期停止機制和軌跡選擇策略以提升效率。

Method: VLM-based駕駛代理預測初始軌跡,DWM-based場景想像器生成未來場景,兩者迭代優化規劃決策。引入效率提升策略(早期停止和軌跡選擇)。

Result: 在nuScenes和NAVSIM數據集上的實驗表明,ImagiDrive在開環和閉環條件下均優於現有方法。

Insight: 結合VLMs的解釋性和DWMs的高保真場景生成能力,為自動駕駛提供了一個更全面且高效的解決方案。

Abstract: Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent’s planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

[76] MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Qian Liang,Yujia Wu,Kuncheng Li,Jiwei Wei,Shiyuan He,Jinyu Guo,Ning Xie

Main category: cs.CV

TL;DR: MM-R1是一个统一的跨模态大语言模型框架,通过引入跨模态链式思维推理(X-CoT)和分组奖励近端策略优化(GRPO),实现零样本个性化图像生成,解决了现有方法数据密集且主题专属的问题。

Details Motivation: 现有跨模态大语言模型(MLLMs)在个性化图像生成中需要针对每个新主题进行密集微调,限制了可扩展性。MM-R1旨在释放统一MLLMs的潜力,实现零样本个性化生成。

Contribution: 提出MM-R1框架,结合X-CoT推理和GRPO优化,实现零样本个性化图像生成,提高了主题保真度和文本对齐能力。

Method: 1. 通过视觉推理和生成过程结构化个性化;2. 采用X-CoT跨模态链式思维推理和GRPO策略优化生成对齐。

Result: 实验表明,MM-R1在零样本设定下生成的主题高保真、文本对齐良好的图像。

Insight: 统一的MLLMs可通过推理策略优化直接应用于个性化任务,无需主题专属微调,显著提升模型适应性。

Abstract: Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.

[77] Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

Daniel Airinei,Elena Burceanu,Marius Leordeanu

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉输入的室内导航方法,结合图路径生成、可解释数据增强和课程学习,避免了传统方法对特殊传感器或地图的依赖。

Details Motivation: 室内导航因缺乏GPS信号而具有挑战性,现有方法依赖复杂传感器或标记,难以部署。论文旨在开发仅依赖视觉输入的实时导航解决方案。

Contribution: 提出图路径生成方法,结合可解释数据增强和课程学习;发布大规模购物中心视频数据集;开发无需额外传感器或标记的Android应用。

Method: 基于深度学习,利用图生成路径,通过数据增强和课程学习优化训练过程,仅依赖视觉输入完成导航任务。

Result: 提出的方法在大型购物中心数据集上验证有效,实现了高效、自动化的导航。

Insight: 视觉输入可替代传统传感器,结合图生成和学习策略,为室内导航提供了轻量化且易部署的解决方案。

Abstract: Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site

[78] CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation

Hongjin Fang,Daniel Reisenbüchler,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng

Main category: cs.CV

TL;DR: CoFi是一种快速粗到精的少样本学习管道,用于电子显微镜图像中肾小球基底膜的精确分割,显著减少标注需求并保持高效性能。

Details Motivation: 肾小球基底膜(GBM)的精确分割对肾病诊断至关重要,但现有监督学习方法需要大量像素级标注,难以适用于临床工作流。少样本学习虽能减少标注负担,但通常难以捕捉精细结构。

Contribution: 提出CoFi管道,通过粗分割与形态学感知的点提示生成,结合SAM模型进行精细分割,实现了高效且高精度的GBM分割。

Method: 1. 使用3张标注图像训练轻量网络生成粗分割;2. 通过形态学感知剪枝生成高质量点提示;3. 利用SAM模型进行精细分割。

Result: Dice系数达74.54%,推理速度1.9 FPS,显著优于传统方法,适用于临床研究。

Insight: CoFi展示了少样本学习方法在医学图像分割中的潜力,通过结合粗分割与精细化步骤,平衡了标注效率和精度。

Abstract: Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline’s speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: https://github.com/ddrrnn123/CoFi.

[79] OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

Ruoxin Xiong,Yanyu Wang,Jiannan Cai,Kaijian Liu,Yuansheng Zhu,Pingbo Tang,Nora El-Gohary

Main category: cs.CV

TL;DR: 这篇论文对建筑行业中用于人工智能和机器学习的视觉数据集进行了系统性综述,提出了一个分类框架,并创建了一个名为OpenConstruction的开源目录,以支持数据驱动的方法开发。

Details Motivation: 建筑行业越来越多地依赖视觉数据来支持AI和ML应用,但目前的数据集在规模、模态、标注质量和代表性上存在巨大差异,缺乏系统性分析,限制了AI应用的进一步发展。

Contribution: 1. 对51个公开视觉数据集进行了系统性综述并分类;2. 提出了一个结构化数据模式;3. 创建了开源目录OpenConstruction;4. 提出了基于FAIR原则的未来数据基础设施路线图。

Method: 通过系统性搜索学术数据库和开放数据平台,收集了2005-2024年间的51个数据集,并使用包含数据基础、模态、标注框架和应用领域的结构化模式进行分类。

Result: 生成了一个分类清晰的开源目录OpenConstruction,并总结了现有数据集的局限性。

Insight: 建筑行业需要更高质量和更标准化的数据集,未来应基于FAIR原则发展数据基础设施,以支持更可靠和可扩展的AI应用。

Abstract: The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community’s ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.

[80] CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

Xiaoxue Wu,Bingjie Gao,Yu Qiao,Yaohui Wang,Xinyuan Chen

Main category: cs.CV

TL;DR: CineTrans 是一种基于掩码扩散模型的新框架,旨在生成具有电影风格过渡的多镜头视频,并通过构建新数据集和掩码控制机制实现了高质量的过渡效果。

Details Motivation: 尽管视频合成技术取得了显著进展,但多镜头视频生成的研究仍处于初级阶段。现有方法生成的视频过渡效果较为粗糙且不稳定,因此需要一种能够生成高质量电影风格过渡的方法。

Contribution: 1) 提出了 CineTrans 框架,用于生成具有电影风格过渡的多镜头视频;2) 构建了 Cine250K 数据集,包含详细的镜头标注;3) 设计了基于掩码的控制机制,实现了对过渡位置的自适应控制。

Method: 1) 分析了扩散模型中注意力图与镜头边界的对应关系;2) 基于此设计了掩码控制机制,在训练自由设置中实现有效过渡;3) 在 Cine250K 数据集上进行微调。

Result: CineTrans 在过渡控制、时序一致性和整体质量方面显著优于现有基线方法。

Insight: 扩散模型中的注意力图可以揭示镜头边界信息,利用掩码控制机制可以有效提升多镜头视频合成的过渡质量。

Abstract: Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

[81] Perception in Plan: Coupled Perception and Planning for End-to-End Autonomous Driving

Bozhou Zhang,Jingyu Li,Nan Song,Li Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为VeteranAD的端到端自动驾驶框架,通过将感知融入规划过程中,实现规划导向的感知优化,从而提升驾驶行为的准确性和可靠性。

Details Motivation: 目前端到端自动驾驶方法的感知和规划通常是顺序执行的,这可能导致感知结果未能充分服务于规划需求。本文旨在通过将感知嵌入规划过程,实现更加目标明确的感知优化。

Contribution: 1. 提出了感知与规划耦合的框架VeteranAD;2. 采用多模锚定轨迹作为规划先验,指导感知模块针对性收集交通元素;3. 引入了自回归策略,逐步预测未来轨迹并优化感知区域。

Method: 基于感知-规划耦合的设计,VeteranAD通过多模锚定轨迹提供规划先验,感知模块针对性收集相关交通信息。采用自回归策略逐步优化轨迹预测和感知区域。

Result: 在NAVSIM和Bench2Drive数据集上的实验表明,VeteranAD达到了SOTA性能,验证了其设计的有效性。

Insight: 将感知嵌入规划过程,实现目标驱动的感知优化,可以显著提升端到端自动驾驶系统的性能。自回归策略的引入进一步增强了轨迹预测的准确性。

Abstract: End-to-end autonomous driving has achieved remarkable advancements in recent years. Existing methods primarily follow a perception-planning paradigm, where perception and planning are executed sequentially within a fully differentiable framework for planning-oriented optimization. We further advance this paradigm through a perception-in-plan framework design, which integrates perception into the planning process. This design facilitates targeted perception guided by evolving planning objectives over time, ultimately enhancing planning performance. Building on this insight, we introduce VeteranAD, a coupled perception and planning framework for end-to-end autonomous driving. By incorporating multi-mode anchored trajectories as planning priors, the perception module is specifically designed to gather traffic elements along these trajectories, enabling comprehensive and targeted perception. Planning trajectories are then generated based on both the perception results and the planning priors. To make perception fully serve planning, we adopt an autoregressive strategy that progressively predicts future trajectories while focusing on relevant regions for targeted perception at each step. With this simple yet effective design, VeteranAD fully unleashes the potential of planning-oriented end-to-end methods, leading to more accurate and reliable driving behavior. Extensive experiments on the NAVSIM and Bench2Drive datasets demonstrate that our VeteranAD achieves state-of-the-art performance.

[82] Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

Erez Meoded

Main category: cs.CV

TL;DR: 该论文研究了基于Transformer的模型TrOCR在16世纪拉丁文手稿识别中的应用,通过改进图像预处理和引入四种新的数据增强方法,显著提升了识别性能。最佳单模型和集成学习分别将字符错误率(CER)降至1.86和1.60,取得了显著的性能提升。

Details Motivation: 历史手写文本识别(HTR)对文化遗产数字化至关重要,但稀缺的转录本、语言变体和多样化的手写风格带来挑战。本研究旨在通过改进TrOCR模型提升识别准确性。

Contribution: 1)针对历史手写文本设计了四种新的数据增强方法;2)评估了集成学习策略;3)在16世纪拉丁文手稿数据集上实现了CER的显著下降。

Method: 1)应用TrOCR模型;2)结合目标图像预处理和多种数据增强技术;3)通过集成学习优化模型性能。

Result: 单模型CER为1.86,集成模型CER降至1.60,相对比基线提升50%,超越了现有最佳结果42%。

Insight: 领域特定的数据增强和集成学习策略对提升历史手写文本识别性能具有重要价值。

Abstract: Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.

[83] A Real-time Concrete Crack Detection and Segmentation Model Based on YOLOv11

Shaoze Huang,Qi Liu,Chao Chen,Yuhang Chen

Main category: cs.CV

TL;DR: 该论文提出了一种基于YOLOv11的实时混凝土裂缝检测与分割模型YOLOv11-KW-TA-FP,通过动态核共享、三重注意力机制和自适应损失函数优化,显著提升了检测性能。

Details Motivation: 随着长三角地区交通基础设施加速老化,传统人工检测效率低下,现有深度学习模型对小目标和复杂背景下的裂缝检测效果欠佳。本文旨在解决这些问题。

Contribution: 1. 提出了YOLOv11-KW-TA-FP模型,结合动态核共享(KWConv)、三重注意力机制(TA)和自适应损失函数(FP-IoU)。2. 在复杂背景下显著提升了裂缝检测的精度和鲁棒性。

Method: 1. 在主干网络中嵌入动态核共享卷积(KWConv)以增强特征表示。2. 特征金字塔中加入三重注意力机制(TA)以优化通道-空间交互。3. 设计了FP-IoU损失函数用于自适应边界框回归。

Result: 实验表明,模型达到91.3%的精确率、76.6%的召回率和86.4%的mAP@50,且在小样本和噪声干扰下表现稳健。

Insight: 动态核共享和注意力机制的结合能够显著提升复杂背景中小目标的检测能力,自适应损失函数进一步优化了模型性能,为工程应用提供了实用解决方案。

Abstract: Accelerated aging of transportation infrastructure in the rapidly developing Yangtze River Delta region necessitates efficient concrete crack detection, as crack deterioration critically compromises structural integrity and regional economic growth. To overcome the limitations of inefficient manual inspection and the suboptimal performance of existing deep learning models, particularly for small-target crack detection within complex backgrounds, this paper proposes YOLOv11-KW-TA-FP, a multi-task concrete crack detection and segmentation model based on the YOLOv11n architecture. The proposed model integrates a three-stage optimization framework: (1) Embedding dynamic KernelWarehouse convolution (KWConv) within the backbone network to enhance feature representation through a dynamic kernel sharing mechanism; (2) Incorporating a triple attention mechanism (TA) into the feature pyramid to strengthen channel-spatial interaction modeling; and (3) Designing an FP-IoU loss function to facilitate adaptive bounding box regression penalization. Experimental validation demonstrates that the enhanced model achieves significant performance improvements over the baseline, attaining 91.3% precision, 76.6% recall, and 86.4% mAP@50. Ablation studies confirm the synergistic efficacy of the proposed modules. Furthermore, robustness tests indicate stable performance under conditions of data scarcity and noise interference. This research delivers an efficient computer vision solution for automated infrastructure inspection, exhibiting substantial practical engineering value.

[84] An Efficient Medical Image Classification Method Based on a Lightweight Improved ConvNeXt-Tiny Architecture

Jingsong Xia,Yue Yin,Xiuhan Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于改进的ConvNeXt-Tiny架构的轻量级医学图像分类方法,通过双全局池化特征融合、SEVector通道注意力模块和特征平滑损失函数提升分类性能,同时降低计算复杂度。

Details Motivation: 在资源受限的计算环境中,实现高效且高精度的医学图像分类具有挑战性。论文旨在通过轻量化的改进结构满足这一需求。

Contribution: 1. 引入了双全局池化特征融合策略(Global Average Pooling和Global Max Pooling);2. 设计了轻量级通道注意力模块SEVector;3. 提出了特征平滑损失函数(Feature Smoothing Loss)。

Method: 1. 在ConvNeXt-Tiny主干中集成双全局池化策略;2. 使用SEVector模块优化通道权重分配;3. 结合特征平滑损失函数提升特征一致性。

Result: 在8线程CPU条件下,10次训练周期内测试集分类准确率达到89.10%,损失值收敛稳定。

Insight: 通过轻量化改进和特征优化,能够在资源受限环境中实现高效的医学图像分类,为临床部署提供了可行方案。

Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis. However, achieving efficient and high-accuracy image classification in resource-constrained computational environments remains challenging. This study proposes a medical image classification method based on an improved ConvNeXt-Tiny architecture. Through structural optimization and loss function design, the proposed method enhances feature extraction capability and classification performance while reducing computational complexity. Specifically, the method introduces a dual global pooling (Global Average Pooling and Global Max Pooling) feature fusion strategy into the ConvNeXt-Tiny backbone to simultaneously preserve global statistical features and salient response information. A lightweight channel attention module, termed Squeeze-and-Excitation Vector (SEVector), is designed to improve the adaptive allocation of channel weights while minimizing parameter overhead. Additionally, a Feature Smoothing Loss is incorporated into the loss function to enhance intra-class feature consistency and suppress intra-class variance. Under CPU-only conditions (8 threads), the method achieves a maximum classification accuracy of 89.10% on the test set within 10 training epochs, exhibiting a stable convergence trend in loss values. Experimental results demonstrate that the proposed method effectively improves medical image classification performance in resource-limited settings, providing a feasible and efficient solution for the deployment and promotion of medical imaging analysis models.

[85] Reinforcing Video Reasoning Segmentation to Think Before It Segments

Sitong Gong,Lu Zhang,Yunzhi Zhuge,Xu Jia,Pingping Zhang,Huchuan Lu

Main category: cs.CV

TL;DR: 这篇论文提出了Veason-R1,一种专为视频推理分割(VRS)任务设计的视觉语言模型,通过强化学习和结构化推理提升分割性能。

Details Motivation: 现有方法依赖大型视觉语言模型(LVLMs)进行视频分割,但存在推理过程缺乏解释性和时空推理能力不足的问题。论文旨在通过强化学习和结构化推理提升VRS的性能和可解释性。

Contribution: 1. 提出Veason-R1,一种专为VRS设计的LVLM;2. 提出Group Relative Policy Optimization(GRPO)和Chain-of-Thought(CoT)初始化的训练方法;3. 设计了结合空间对齐和时间一致性的奖励机制。

Method: 1. 通过高质量CoT数据训练监督微调模型Veason-SFT;2. 用GRPO微调优化推理链;3. 引入全奖励机制提升空间对齐和时间一致性。

Result: 在多个基准测试中实现SOTA性能,如ReVOS上提升1.3 J&F,ReasonVOS上提升10.0 J&F,同时降低幻觉(+8.8 R)。

Insight: 结构化推理和强化学习的结合可以有效提升视频分割任务的性能,并增强模型的解释能力。

Abstract: Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.

[86] TrajSV: A Trajectory-based Model for Sports Video Representations and Applications

Zheng Wang,Shihao Xu,Wei Shi

Main category: cs.CV

TL;DR: TrajSV是一个基于轨迹的运动视频表示框架,通过提取轨迹并使用Transformer模块学习片段表示,结合对比损失优化表示,在运动视频检索、动作检测和视频字幕任务中表现出色。

Details Motivation: 当前运动视频分析领域存在数据不足、缺乏有效的轨迹框架和需要大量标注的问题,TrajSV旨在解决这些问题。

Contribution: 提出TrajSV框架,包含轨迹提取、片段表示学习和视频表示学习模块,并引入三重对比损失进行无监督优化。

Method: 1) 数据预处理提取轨迹;2) 使用轨迹增强的Transformer学习片段表示(CRNet);3) 聚合片段和视觉特征学习视频表示(VRNet);4) 使用三重对比损失优化表示。

Result: 在足球、篮球和排球视频数据集上,TrajSV在检索任务中提升70%,动作检测任务中17类中有9类最优,视频字幕任务中提升20%。

Insight: 轨迹是运动视频分析的关键信息,无监督对比学习能有效优化表示。

Abstract: Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.

[87] Causality Matters: How Temporal Information Emerges in Video Language Models

Yumeng Shi,Quanyu Long,Yin Wu,Wenya Wang

Main category: cs.CV

TL;DR: 本文研究发现,在视频语言模型中,时间信息的编码并非依赖传统的位置编码(PEs),而是通过帧间注意力机制逐步合成。这一机制揭示了因果注意力约束下的时间推理路径,并提出阶段性跨模态注意力和时间退出机制以提高效率。

Details Motivation: 尽管视频语言模型在多模态理解上取得进展,但其时间理解能力(如事件顺序、持续时间和跨时间关系)仍是一个核心挑战。传统认为PEs是关键,但本文发现其作用有限,进而探索时间信息如何真正在模型中编码。

Contribution: 1)首次系统研究视频语言模型中的时间理解;2)揭示了时间信息通过帧间注意力逐步合成的因果机制;3)提出了阶段性跨模态注意力和时间退出机制以提高效率。

Method: 通过分析实验,探索时间信息的整合路径,发现其通过帧间注意力逐步合成并最后聚合到查询令牌中。基于此,提出阶段性跨模态注意力和时间退出机制。

Result: 在两项基准测试中验证了所提方法的有效性,证明时间推理依赖于帧间注意力而非传统PEs。

Insight: 时间推理的隐式编码源于帧间交互的因果注意力约束,这为未来模型设计提供了新思路。

Abstract: Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter-visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches. To the best of our knowledge, this is the first work to systematically investigate video temporal understanding in VideoLMs, offering insights for future model improvement.

[88] DashCam Video: A complementary low-cost data stream for on-demand forest-infrastructure system monitoring

Durga Joshi,Chandi Witharana,Robert Fahey,Thomas Worthley,Zhe Zhu,Diego Cerrai

Main category: cs.CV

TL;DR: 该论文提出了一种基于车载摄像头视频数据的低成本、实时框架,用于路边植被和基础设施的结构评估与地理定位,结合了单目深度估计、深度误差校正和几何三角测量方法。

Details Motivation: 传统的遥感方法(如LiDAR)成本高昂且难以实时更新,而车载摄像头视频数据成本低但未被充分利用。研究旨在提供一种快速、低成本的解决方案,以补充传统遥感技术。

Contribution: 1. 提出首个结合单目深度建模、基于GPS的三角定位和实时结构评估的端到端框架;2. 通过梯度提升回归模型显著提高了远距离对象的深度估计准确性;3. 展示了该方法在不同摄像头放置和车速条件下的实用性和鲁棒性。

Method: 1. 使用单目深度模型生成深度图;2. 通过梯度提升回归校正深度误差,尤其是远距离对象的低估问题;3. 结合GPS数据和针孔相机几何方法估算对象位置和高度。

Result: 深度校正模型表现优异(R2 = 0.92,MAE = 0.31),地理定位误差为2.83米,高度估计误差对树木为2.09米,对杆状物为0.88米。低速和车内摄像头配置下准确度最高。

Insight: 该方法为城市植被和基础设施的实时监控提供了一种低成本、高效的解决方案,尤其适用于需要频繁评估的应用场景(如公用事业公司和城市规划)。

Abstract: Our study introduces a novel, low-cost, and reproducible framework for real-time, object-level structural assessment and geolocation of roadside vegetation and infrastructure with commonly available but underutilized dashboard camera (dashcam) video data. We developed an end-to-end pipeline that combines monocular depth estimation, depth error correction, and geometric triangulation to generate accurate spatial and structural data from street-level video streams from vehicle-mounted dashcams. Depth maps were first estimated using a state-of-the-art monocular depth model, then refined via a gradient-boosted regression framework to correct underestimations, particularly for distant objects. The depth correction model achieved strong predictive performance (R2 = 0.92, MAE = 0.31 on transformed scale), significantly reducing bias beyond 15 m. Further, object locations were estimated using GPS-based triangulation, while object heights were calculated using pin hole camera geometry. Our method was evaluated under varying conditions of camera placement and vehicle speed. Low-speed vehicle with inside camera gave the highest accuracy, with mean geolocation error of 2.83 m, and mean absolute error (MAE) in height estimation of 2.09 m for trees and 0.88 m for poles. To the best of our knowledge, it is the first framework to combine monocular depth modeling, triangulated GPS-based geolocation, and real-time structural assessment for urban vegetation and infrastructure using consumer-grade video data. Our approach complements conventional RS methods, such as LiDAR and image by offering a fast, real-time, and cost-effective solution for object-level monitoring of vegetation risks and infrastructure exposure, making it especially valuable for utility companies, and urban planners aiming for scalable and frequent assessments in dynamic urban environments.

[89] LoRAtorio: An intrinsic approach to LoRA Skill Composition

Niki Foteinopoulou,Ignas Budvytis,Stephan Liwicki

Main category: cs.CV

TL;DR: LoRAtorio是一个基于LoRA技能组合的无训练框架,通过利用模型的内在行为,解决了多LoRA适配器组合在开放环境中的挑战,显著提升了性能。

Details Motivation: 现有的LoRA适配器组合方法在开放环境中效果不佳,尤其是在需要动态组合多技能时。LoRAtorio的动机是通过分析LoRA的内在行为差异,提出一种无需训练的解决方案。

Contribution: 提出了LoRAtorio框架,通过空间感知的权重矩阵和加权聚合方法,实现了多LoRA适配器的动态组合。进一步改进了分类器自由引导,解决了域漂移问题,并支持从大量适配器中动态选择相关模块。

Method: 方法分为两步:1)在潜在空间中计算空间块与基础模型的余弦相似度,生成权重矩阵;2)通过加权聚合和修改的分类器自由引导方法,组合LoRA输出并避免域漂移。此外,还支持动态模块选择。

Result: LoRAtorio在性能上达到了最先进水平,ClipScore提升了1.3%,GPT-4V评估的胜率为72.43%,并能泛化到多种潜在扩散模型。

Insight: 发现LoRA适配器在分布内和分布外行为上的差异,利用这种差异设计动态组合策略。这表明LoRA的内在行为特征可以用于优化多技能组合问题。

Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single LoRA scenario, which nevertheless deteriorates when multiple LoRAs are loaded. Our method operates in the latent space by dividing it into spatial patches and computing cosine similarity between each patch’s predicted noise and that of the base model. These similarities are used to construct a spatially-aware weight matrix, which guides a weighted aggregation of LoRA outputs. To address domain drift, we further propose a modification to classifier-free guidance that incorporates the base model’s unconditional score into the composition. We extend this formulation to a dynamic module selection setting, enabling inference-time selection of relevant LoRA adapters from a large pool. LoRAtorio achieves state-of-the-art performance, showing up to a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, and generalises effectively to multiple latent diffusion models.

[90] Is ChatGPT-5 Ready for Mammogram VQA?

Qiang Li,Shansong Wang,Mingzhe Hu,Mojtaba Safari,Zachary Eidex,Xiaofeng Yang

Main category: cs.CV

TL;DR: GPT-5在乳腺X光视觉问答(VQA)任务中表现优于GPT-4o,但仍不及人类专家和领域专用模型,需进一步优化以用于临床高风险任务。

Details Motivation: 评估通用大语言模型(如GPT-5)在乳腺X光VQA任务中的潜力,探索其在医学影像领域的应用。

Contribution: 系统评估GPT-5和GPT-4o在多个公开乳腺X光数据集上的性能,揭示其在BI-RADS评估、异常检测和恶性分类任务中的表现。

Method: 在四个公开数据集(EMBED、InBreast、CMMD、CBIS-DDSM)上测试GPT-5和GPT-4o的性能,比较其与人类专家和领域专用模型的差异。

Result: GPT-5在密度、变形、肿块等分类任务中表现最佳,但敏感性和特异性仍低于人类专家。从GPT-4o到GPT-5的性能提升显著。

Insight: 通用大语言模型在医学影像领域有潜力,但需针对性领域适应和优化,才能满足临床高风险任务的需求。

Abstract: Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.

[91] Thyme: Think Beyond Images

Yi-Fan Zhang,Xingyu Lu,Shukang Yin,Chaoyou Fu,Wei Chen,Xiao Hu,Bin Wen,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Haonan Fan,Kaibing Chen,Jiankang Chen,Haojie Ding,Kaiyu Tang,Zhang Zhang,Liang Wang,Fan Yang,Tingting Gao,Guorui Zhou

Main category: cs.CV

TL;DR: Thyme提出了一种超越现有“基于图像的思维”方法的新范式,通过生成和执行代码实现多样化的图像处理与计算操作,显著提升了多模态大语言模型(MLLMs)在感知与推理任务中的性能。

Details Motivation: 尽管OpenAI提出的“基于图像的思维”方法在提升模型性能上有所突破,但当前开源的模型在功能丰富性上仍落后于专有模型(如O3)。Thyme旨在通过自主生成和执行代码,扩展MLLMs的能力,使其不仅能处理图像,还能进行数学计算。

Contribution: 1. 提出Thyme范式,通过代码生成与执行实现多样化的图像操作和计算。2. 设计两阶段训练策略(SFT + RL),并结合GRPO-ATS算法优化推理与代码执行的平衡。3. 在近20个基准测试中验证了Thyme的显著性能提升。

Method: 1. 初始阶段通过SFT在50万样本数据集上训练代码生成能力。2. RL阶段使用高分辨率QA对提升学习难度,并采用GRPO-ATS算法,对文本和代码生成分别分配不同温度以平衡探索与精度。

Result: Thyme在多项任务中表现优异,尤其在复杂的高分辨率感知和推理任务中展现出一致性能优势。

Insight: 通过代码生成与执行扩展MLLMs功能是一种有效的方向,而GRPO-ATS算法为多模态任务中的推理-执行权衡提供了新思路。

Abstract: Following OpenAI’s introduction of the thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing think with images’’ approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

cs.GR [Back]

[92] StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation

Seungmi Lee,Kwan Yun,Junyong Noh

Main category: cs.GR

TL;DR: StyleMM 是一个通过文本描述生成具有特定风格的3D可变形人脸模型(3DMM)的框架,利用扩散模型和图像翻译技术实现风格化,同时保留面部属性和动画能力。

Details Motivation: 现有的3DMM多为写实风格,缺乏对特定风格的灵活控制能力,难以满足多样化的风格需求。

Contribution: 1. 提出了一种通过文本描述生成风格化3DMM的方法。2. 引入了保留面部属性的图像翻译方法,确保风格化的同时不影响身份、表情和面部对齐。

Method: 结合预训练的网格变形网络和纹理生成器,通过文本引导的扩散模型生成风格化图像作为目标,并通过图像训练实现风格迁移。

Result: 在多样性和风格化能力上优于现有方法,支持对形状、表情和纹理参数的显式控制。

Insight: 图像翻译技术与3D建模的结合,为多样化的3D风格化提供了新的解决思路。

Abstract: We introduce StyleMM, a novel framework that can construct a stylized 3D Morphable Model (3DMM) based on user-defined text descriptions specifying a target style. Building upon a pre-trained mesh deformation network and a texture generator for original 3DMM-based realistic human faces, our approach fine-tunes these models using stylized facial images generated via text-guided image-to-image (i2i) translation with a diffusion model, which serve as stylization targets for the rendered mesh. To prevent undesired changes in identity, facial alignment, or expressions during i2i translation, we introduce a stylization method that explicitly preserves the facial attributes of the source image. By maintaining these critical attributes during image stylization, the proposed approach ensures consistent 3D style transfer across the 3DMM parameter space through image-based training. Once trained, StyleMM enables feed-forward generation of stylized face meshes with explicit control over shape, expression, and texture parameters, producing meshes with consistent vertex connectivity and animatability. Quantitative and qualitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of identity-level facial diversity and stylization capability. The code and videos are available at kwanyun.github.io/stylemm_page.

[93] SPG: Style-Prompting Guidance for Style-Specific Content Creation

Qian Liang,Zichong Chen,Yang Zhou,Hui Huang

Main category: cs.GR

TL;DR: SPG是一种新颖的风格特异性图像生成采样策略,通过构造风格噪声向量并利用其方向性偏差引导扩散过程,实现语义保真和风格一致性。

Details Motivation: 尽管现有的文本到图像(T2I)扩散模型在生成图像与文本提示的对齐方面表现优异,但控制输出图像的视觉风格仍然是一个挑战。本文旨在解决这一问题。

Contribution: 提出Style-Prompting Guidance (SPG)方法,通过构造风格噪声向量并利用其方向性偏差,引导扩散过程生成特定风格的图像。

Method: SPG结合Classifier-Free Guidance (CFG),构造风格噪声向量,并通过其方向性偏差引导扩散过程,实现语义和风格的平衡。方法兼容ControlNet和IPAdapter等可控框架。

Result: 实验表明,SPG在风格特异性图像生成方面优于现有方法,同时保持语义保真。

Insight: SPG的简洁性和鲁棒性使其成为实际应用中风格控制的有效工具,且兼容性强,适合多种场景。

Abstract: Although recent text-to-image (T2I) diffusion models excel at aligning generated images with textual prompts, controlling the visual style of the output remains a challenging task. In this work, we propose Style-Prompting Guidance (SPG), a novel sampling strategy for style-specific image generation. SPG constructs a style noise vector and leverages its directional deviation from unconditional noise to guide the diffusion process toward the target style distribution. By integrating SPG with Classifier-Free Guidance (CFG), our method achieves both semantic fidelity and style consistency. SPG is simple, robust, and compatible with controllable frameworks like ControlNet and IPAdapter, making it practical and widely applicable. Extensive experiments demonstrate the effectiveness and generality of our approach compared to state-of-the-art methods. Code is available at https://github.com/Rumbling281441/SPG.

math.NA [Back]

[94] Fluid Dynamics and Domain Reconstruction from Noisy Flow Images Using Physics-Informed Neural Networks and Quasi-Conformal Mapping

Han Zhang,Xue-Cheng Tai,Jean-Michel Morel,Raymond H. Chan

Main category: math.NA

TL;DR: 该论文提出了一种结合物理信息神经网络和拟共形映射的方法,用于从含噪声的血流图像中重建流体动力学和区域结构。通过交替优化流体子问题和几何子问题,实现了高精度的图像重建。

Details Motivation: 血流成像在医疗诊断和治疗规划中至关重要,但由于采集时间短或设备误差,图像质量常受噪声影响。因此,需要一种能够同时恢复流体速度和几何结构的方法。

Contribution: 1. 提出了一个联合优化流体子问题(物理信息神经网络)和几何子问题(拟共形映射)的框架;2. 在合成和真实数据上验证了方法的有效性和鲁棒性。

Method: 1. 将问题分解为流体子问题(用物理信息神经网络重建速度场)和几何子问题(优化拟共形映射以推断流动区域);2. 采用交替高斯-赛德尔迭代优化这两个子问题。

Result: 实验表明,该方法在合成数据和真实主动脉数据中均能有效去噪和重建,且对噪声水平具有鲁棒性。消融研究验证了关键超参数的影响。

Insight: 结合物理约束和几何优化可以显著提升血流图像的重建质量,为医学图像处理提供了新思路。

Abstract: Blood flow imaging provides important information for hemodynamic behavior within the vascular system and plays an essential role in medical diagnosis and treatment planning. However, obtaining high-quality flow images remains a significant challenge. In this work, we address the problem of denoising flow images that may suffer from artifacts due to short acquisition times or device-induced errors. We formulate this task as an optimization problem, where the objective is to minimize the discrepancy between the modeled velocity field, constrained to satisfy the Navier-Stokes equations, and the observed noisy velocity data. To solve this problem, we decompose it into two subproblems: a fluid subproblem and a geometry subproblem. The fluid subproblem leverages a Physics-Informed Neural Network to reconstruct the velocity field from noisy observations, assuming a fixed domain. The geometry subproblem aims to infer the underlying flow region by optimizing a quasi-conformal mapping that deforms a reference domain. These two subproblems are solved in an alternating Gauss-Seidel fashion, iteratively refining both the velocity field and the domain. Upon convergence, the framework yields a high-quality reconstruction of the flow image. We validate the proposed method through experiments on synthetic flow data in a converging channel geometry under varying levels of Gaussian noise, and on real-like flow data in an aortic geometry with signal-dependent noise. The results demonstrate the effectiveness and robustness of the approach. Additionally, ablation studies are conducted to assess the influence of key hyperparameters.

cs.MA [Back]

[95] Allen: Rethinking MAS Design through Step-Level Policy Autonomy

Qiangong Zhou,Zhiting Wang,Mingyou Yao,Zongyang Liu

Main category: cs.MA

TL;DR: 论文提出了一种新型多智能体系统(MAS)Allen,通过重新定义MAS的基本执行单元,增强智能体的策略自主性,同时优化网络拓扑结构与任务控制的平衡。

Details Motivation: 当前MAS设计在智能体策略自主性和协作效率、任务监督及人类监督之间的权衡上存在挑战。Allen旨在解决这些问题。

Contribution: 1. 重新定义MAS的基本执行单元,实现智能体行为动态适应;2. 提出四层状态架构(任务、阶段、智能体、步骤),统一拓扑优化与可控进程。

Method: 构建了任务导向与执行导向的四层状态架构,使智能体通过组合基本执行单元动态形成不同行为模式。

Result: Allen在增强策略自主性的同时,实现了协作结构的可控性与效率的平衡。项目代码已开源。

Insight: 通过重新定义MAS的基本单元并结合分层架构,可以在提升自主性的同时保持协作结构的可控性。

Abstract: We introduce a new Multi-Agent System (MAS) - Allen, designed to address two core challenges in current MAS design: (1) improve system’s policy autonomy, empowering agents to dynamically adapt their behavioral strategies, and (2) achieving the trade-off between collaborative efficiency, task supervision, and human oversight in complex network topologies. Our core insight is to redefine the basic execution unit in the MAS, allowing agents to autonomously form different patterns by combining these units. We have constructed a four-tier state architecture (Task, Stage, Agent, Step) to constrain system behavior from both task-oriented and execution-oriented perspectives. This achieves a unification of topological optimization and controllable progress. Allen grants unprecedented Policy Autonomy, while making a trade-off for the controllability of the collaborative structure. The project code has been open source at: https://github.com/motern88/Allen

cs.RO [Back]

[96] GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning

Kelin Yu,Sheng Zhang,Harshit Soora,Furong Huang,Heng Huang,Pratap Tokekar,Ruohan Gao

Main category: cs.RO

TL;DR: 该论文提出了一种名为GenFlowRL的新方法,通过生成对象中心的流(object-centric flow)来改进视觉强化学习中的奖励设计,克服了视频生成的不确定性和大规模数据集的需求。

Details Motivation: 现有视频生成模型在机器人学习中虽能提升动作推断能力,但严重依赖生成数据的质量,且缺少环境反馈。同时,视频强化学习受限于生成的不确定性和大规模数据集的需求。因此,需要一种更通用且鲁棒的方法来提取有效的奖励信号。

Contribution: 提出了GenFlowRL,利用跨体现多样性数据集生成的对象中心流来设计奖励函数,从而学习通用且鲁棒的政策。

Method: 通过生成对象中心的低维特征流来设计奖励函数,减少了对外部数据的依赖,同时提升了政策在多样任务中的表现。

Result: 在10个模拟和真实世界的跨体现任务中,GenFlowRL表现优于现有方法,展示了其通用性和鲁棒性。

Insight: 利用对象中心特征流可以有效弥补生成模型的不确定性,并减少对大规模数据集的依赖,为视觉强化学习提供了新的奖励设计思路。

Abstract: Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GenFlowRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets. This enables learning generalizable and robust policies from diverse demonstrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GenFlowRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios. Our Project Page: https://colinyu1.github.io/genflowrl

[97] Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent

Che Rin Yu,Daewon Chae,Dabin Seo,Sangwon Lee,Hyeongwoo Im,Jinkyu Kim

Main category: cs.RO

TL;DR: 论文提出了一种基于场景图的主动重规划框架,帮助自主机器人在执行任务前检测和修正潜在失败,提高了任务成功率和鲁棒性。

Details Motivation: 现有自主机器人通常依赖预定义的动作规划,缺乏适应环境动态变化的能力,导致任务失败。主动重规划可以预防失败,但现有方法依赖手工规则和大量监督。

Contribution: 提出了一个轻量级的主动重规划框架,通过比较当前场景图与参考场景图的差异,在子任务边界诊断和修正潜在失败。

Method: 利用RGB-D观测构建当前场景图,并与从成功演示中提取的参考图对比,使用轻量级推理模块诊断不匹配并调整计划。

Result: 在AI2-THOR模拟器中的实验表明,该方法能提前检测语义和空间不匹配,显著提高任务成功率和鲁棒性。

Insight: 场景图为环境建模提供了结构化表示,轻量级推理模块可以在不增加计算负担的情况下实现主动规划和调整。

Abstract: When humans perform everyday tasks, we naturally adjust our actions based on the current state of the environment. For instance, if we intend to put something into a drawer but notice it is closed, we open it first. However, many autonomous robots lack this adaptive awareness. They often follow pre-planned actions that may overlook subtle yet critical changes in the scene, which can result in actions being executed under outdated assumptions and eventual failure. While replanning is critical for robust autonomy, most existing methods respond only after failures occur, when recovery may be inefficient or infeasible. While proactive replanning holds promise for preventing failures in advance, current solutions often rely on manually designed rules and extensive supervision. In this work, we present a proactive replanning framework that detects and corrects failures at subtask boundaries by comparing scene graphs constructed from current RGB-D observations against reference graphs extracted from successful demonstrations. When the current scene fails to align with reference trajectories, a lightweight reasoning module is activated to diagnose the mismatch and adjust the plan. Experiments in the AI2-THOR simulator demonstrate that our approach detects semantic and spatial mismatches before execution failures occur, significantly improving task success and robustness.

[98] Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Jakub Łucki,Jonathan Becktor,Georgios Georgakis,Robert Royce,Shehryar Khattak

Main category: cs.RO

TL;DR: VPEngine是一个模块化框架,旨在通过共享基础模型和多任务并行头部实现高效GPU利用,减少计算冗余,适用于机器人视觉任务。

Details Motivation: 在资源受限的机器人平台上部署多个机器学习模型会导致计算冗余、内存占用大和集成复杂,因此需要一个高效的多任务视觉框架。

Contribution: VPEngine提出了一种共享基础模型和多任务并行的架构,实现了高效GPU利用、动态任务优先级调整和实时性能。

Method: 利用共享的基础模型(如DINOv2)提取图像特征,并通过并行运行多任务头部(深度、目标检测、语义分割)实现高效计算,结合CUDA MPS优化GPU利用。

Result: 在NVIDIA Jetson Orin AGX上实现实时性能(≥50 Hz),相比串行执行速度提升3倍。

Insight: 并行化任务头部和共享特征提取是提升机器人视觉多任务处理效率的关键,同时动态任务优先级调整增加了灵活性。

Abstract: Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework’s capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at $\geq$50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.

eess.AS [Back]

[99] Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

Wonjune Kang,Deb Roy

Main category: eess.AS

TL;DR: 该论文提出了基于自然语言描述的说话风格检索任务,通过联合潜在空间嵌入语音和文本描述,实现了通过文本提示检索匹配的富有表现力的语音片段。

Details Motivation: 以往的研究主要集中在基于语音内容进行检索,而本文关注的是基于说话风格(‘如何说’而非‘说什么’)的检索,填补了这一领域的空白。

Contribution: 1. 提出了一种联合嵌入语音和文本描述的框架,支持通过自由文本提示检索富有表现力的语音。2. 分析了编码器架构、跨模态对齐训练标准以及提示增强对检索效果的影响。

Method: 使用语音和文本编码器将语音和风格描述嵌入到联合潜在空间中,通过跨模态对齐训练实现文本到语音的检索。研究了不同编码器架构和训练标准的有效性。

Result: 在涵盖22种说话风格的多个数据集上实验表明,该方法在Recall@k指标上表现优异。

Insight: 通过文本描述直接检索语音风格是一种新颖且可行的方式,跨模态对齐和提示增强是提升检索效果的关键技术。

Abstract: We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.

[100] Emphasis Sensitivity in Speech Representations

Shaun Cassini,Thomas Hain,Anton Ragni

Main category: eess.AS

TL;DR: 这篇论文研究了现代语音模型是否对韵律重音敏感,通过残差框架分析强调词和中性词的表征差异。

Details Motivation: 现有研究通常依赖孤立的声学特征或标签预测,忽略了重音的关系结构。论文旨在填补这一空白。

Contribution: 提出了基于残差的框架,定义了重音为中性词和强调词表征的差异,并分析了自监督语音模型中这些残差的性质。

Method: 采用残差分析框架,对比预训练和ASR微调模型中强调词与中性词的表征差异。

Result: 实验表明,自监督模型的残差与时长变化强相关,但在词识别任务中表现差;ASR微调模型的残差空间更紧凑。

Insight: 韵律重音在语音模型中被编码为一种一致的低维变换,任务特定的学习使其结构更明晰。

Abstract: This work investigates whether modern speech models are sensitive to prosodic emphasis - whether they encode emphasized and neutral words in systematically different ways. Prior work typically relies on isolated acoustic correlates (e.g., pitch, duration) or label prediction, both of which miss the relational structure of emphasis. This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasized word representations. Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and perform poorly at word identity prediction, indicating a structured, relational encoding of prosodic emphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models, further suggesting that emphasis is encoded as a consistent, low-dimensional transformation that becomes more structured with task-specific learning.

cs.AI [Back]

[101] Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

Youcheng Huang,Bowen Qin,Chen Huang,Duanyu Feng,Xi Yang,Wenqiang Lei

Main category: cs.AI

TL;DR: 本文提出了一种新数据集,用于评估大型推理模型(LRMs)在面对信息不全问题时的主动信息获取能力,揭示了此类模型在这一任务上的不足,并分析了过度思考和幻觉行为。

Details Motivation: 现有基准测试仅评估LRMs在解决定义良好问题时的能力,忽略了真实智能代理需要具备主动获取信息的能力。本文旨在填补这一空白,推动LRMs向更真实的智能发展。

Contribution: 1.提出了一个包含多样化上下文信息不全问题的新数据集;2.系统地评估了LRMs在主动信息获取任务上的表现,揭示其不足;3.发现了LRMs的过度思考和幻觉行为,探讨了监督微调的潜力与挑战。

Method: 通过设计新数据集(包含两类信息不全问题),对LRMs进行系统性评估,分析其在主动信息请求任务中的表现,并研究监督微调对其能力的影响。

Result: 实验表明,LRMs在主动信息获取任务上表现不佳,且表现出过度思考和幻觉行为;监督微调在提升这一能力上具有潜力但仍面临挑战。

Insight: 开发具有真实智能的LRMs需超越单纯解决问题,关注其主动信息获取能力;过度思考和幻觉是当前模型的主要挑战。

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users’ requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.

[102] Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

Kangyu Wang,Hongliang He,Lin Liu,Ruiqi Liang,Zhenzhong Lan,Jianguo Li

Main category: cs.AI

TL;DR: 论文提出了Inclusion Arena,一个通过真实应用中的用户反馈来评估大型基础模型的开放平台,解决了现有评测依赖静态数据或众包通用提示的局限性。

Details Motivation: 现有评测方法(如MMLU或Chatbot Arena)依赖静态数据集或通用领域提示,无法反映模型在真实应用中的表现,因此需要一种更贴近实际应用的评估平台。

Contribution: 提出了Inclusion Arena平台,通过用户自然交互收集反馈,并引入Bradley-Terry模型的创新扩展(Placement Matches和Proximity Sampling),实现更稳健的模型排名。

Method: 采用Bradley-Terry模型进行模型排名,并加入两个创新机制:(1) Placement Matches解决冷启动问题,(2) Proximity Sampling优先比较能力相近的模型以提高信息增益和排名稳定性。

Result: 平台通过实证分析和模拟验证,显示出排名可靠性高、数据传递性强,并有效降低恶意操纵风险。

Insight: 通过真实用户交互收集反馈能更准确地反映模型在实际应用中的表现,为优化用户中心部署的基础模型提供了新方向。

Abstract: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at https://doraemon.alipay.com/model-ranking.

cs.SD [Back]

[103] LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters

Haomin Zhang,Kristin Qi,Shuxin Yang,Zihao Chen,Chaofan Ding,Xinhan Di

Main category: cs.SD

TL;DR: 该论文提出了LD-LAudio-V1,通过引入双轻量适配器扩展了现有视频到音频生成模型,解决了长视频音频合成的技术难题,并发布了一个高质量的数据集。

Details Motivation: 现有视频到音频生成方法主要针对短片段(<10秒)或依赖噪声数据,无法满足长视频高质量音频合成的需求。因此,作者提出了LD-LAudio-V1来解决这一问题。

Contribution: 1. 提出了双轻量适配器(dual lightweight adapters),支持长视频音频生成;2. 发布了一个干净、人工标注的视频到音频数据集,不含噪声或伪影。

Method: 扩展了state-of-the-art视频到音频模型,通过双轻量适配器提升长视频音频生成的性能,同时保持计算效率。

Result: 在多项指标上显著提升,如FD、KL、IS等指标均有明显改善,最高提升达65.87%。

Insight: 轻量适配器的设计有效减少了拼接伪影和时间不一致性,高质量数据集的发布为后续研究提供了重要支持。

Abstract: Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: $FD_{\text{passt}}$ 450.00 $\rightarrow$ 327.29 (+27.27%), $FD_{\text{panns}}$ 34.88 $\rightarrow$ 22.68 (+34.98%), $FD_{\text{vgg}}$ 3.75 $\rightarrow$ 1.28 (+65.87%), $KL_{\text{panns}}$ 2.49 $\rightarrow$ 2.07 (+16.87%), $KL_{\text{passt}}$ 1.78 $\rightarrow$ 1.53 (+14.04%), $IS_{\text{panns}}$ 4.17 $\rightarrow$ 4.30 (+3.12%), $IB_{\text{score}}$ 0.25 $\rightarrow$ 0.28 (+12.00%), $Energy\Delta10\text{ms}$ 0.3013 $\rightarrow$ 0.1349 (+55.23%), $Energy\Delta10\text{ms(vs.GT)}$ 0.0531 $\rightarrow$ 0.0288 (+45.76%), and $Sem.,Rel.$ 2.73 $\rightarrow$ 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio.

cs.IR [Back]

[104] The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers

Xingyu Deng,Xi Wang,Mark Stevenson

Main category: cs.IR

TL;DR: 该论文探讨了科学事实核查的复杂性,提出未来方向需解决证据检索中的语义限制、时间感知、文档结构解析、复杂表达处理和文献可信性评估等问题。

Details Motivation: 科学事实核查比一般核查更复杂,需处理学术文献的结构和多模态表达,而现有方法因基于简化数据集(如摘要)而未能应对完整文档的挑战。

Contribution: 论文指出了当前系统的局限性,提出了五个关键研究挑战,并通过初步实验验证了这些挑战的解决潜力,旨在推动面向实际应用的专门化信息检索系统。

Method: 通过分析现有方法的不足,提出证据驱动检索、时间感知检索、结构化文档解析、复杂表达处理和可信性评估等方法。

Result: 初步实验验证了提出的挑战和潜在解决方案,为未来科学事实核查系统的开发提供了方向。

Insight: 科学事实核查需结合语义、时间、结构和表达等多维度信息,未来的系统应更全面地处理学术文献的复杂性。

Abstract: Scientific fact-checking aims to determine the veracity of scientific claims by retrieving and analysing evidence from research literature. The problem is inherently more complex than general fact-checking since it must accommodate the evolving nature of scientific knowledge, the structural complexity of academic literature and the challenges posed by long-form, multimodal scientific expression. However, existing approaches focus on simplified versions of the problem based on small-scale datasets consisting of abstracts rather than full papers, thereby avoiding the distinct challenges associated with processing complete documents. This paper examines the limitations of current scientific fact-checking systems and reveals the many potential features and resources that could be exploited to advance their performance. It identifies key research challenges within evidence retrieval, including (1) evidence-driven retrieval that addresses semantic limitations and topic imbalance (2) time-aware evidence retrieval with citation tracking to mitigate outdated information, (3) structured document parsing to leverage long-range context, (4) handling complex scientific expressions, including tables, figures, and domain-specific terminology and (5) assessing the credibility of scientific literature. Preliminary experiments were conducted to substantiate these challenges and identify potential solutions. This perspective paper aims to advance scientific fact-checking with a specialised IR system tailored for real-world applications.

cs.LG [Back]

[105] How Causal Abstraction Underpins Computational Explanation

Atticus Geiger,Jacqueline Harding,Thomas Icard

Main category: cs.LG

TL;DR: 本文探讨了因果抽象理论如何在解释认知行为计算过程中发挥作用,并将其与深度学习和机器学习的当代讨论联系起来。

Details Motivation: 研究旨在解决如何在系统中实现特定计算并选择合适表示的问题,利用因果理论提供新视角。

Contribution: 提出了基于因果抽象的计算实现解释框架,探讨了表示在其中的作用,并将经典哲学问题与当代机器学习联系起来。

Method: 采用因果抽象理论作为分析工具,结合深度学习的案例,阐述计算的实现与表示的关系。

Result: 提供了计算实现的因果关系解释,强调了表示在预测和泛化中的重要性。

Insight: 因果抽象为理解计算和认知提供了统一框架,揭示了表示在泛化和预测中的关键作用。

Abstract: Explanations of cognitive behavior often appeal to computations over representations. What does it take for a system to implement a given computation over suitable representational vehicles within that system? We argue that the language of causality – and specifically the theory of causal abstraction – provides a fruitful lens on this topic. Drawing on current discussions in deep learning with artificial neural networks, we illustrate how classical themes in the philosophy of computation and cognition resurface in contemporary machine learning. We offer an account of computational implementation grounded in causal abstraction, and examine the role for representation in the resulting picture. We argue that these issues are most profitably explored in connection with generalization and prediction.

[106] Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning

Haitong Luo,Suhang Wang,Weiyao Zhang,Ruiqi Meng,Xuying Meng,Yujun Zhang

Main category: cs.LG

TL;DR: 该论文提出了HS-GPPT模型,通过混合谱滤波器和局部-全局对比学习,解决了现有方法无法处理现实图中异质性谱分布的问题,实现了预训练和下游任务间的谱对齐,从而在有限监督下高效迁移知识。

Details Motivation: 现实世界的图数据具有不同的同质性和异质性谱分布,而现有的基于同质性的预训练方法无法有效处理这种多样性。需要一种能够自适应谱分布的方法来实现高效的知识迁移。

Contribution: 1. 提出了HS-GPPT模型,结合混合谱滤波器和局部-全局对比学习;2. 设计了提示图(prompt graphs)以实现预训练和下游任务间的谱对齐;3. 在转导和归纳学习设置下验证了方法的有效性。

Method: 1. 使用混合谱滤波器作为骨干网络;2. 采用局部-全局对比学习捕获丰富的谱知识;3. 通过提示图对齐预训练和下游任务的谱分布。

Result: 实验表明,HS-GPPT在多种同质性和异质性图数据上均表现优异,验证了方法的有效性和泛化能力。

Insight: 谱对齐是实现知识迁移的关键,尤其是在有限监督下。混合谱滤波器能够灵活适应不同的谱分布,是处理现实图数据多样性的有效工具。

Abstract: Graph ``pre-training and prompt-tuning’’ aligns downstream tasks with pre-trained objectives to enable efficient knowledge transfer under limited supervision. However, existing methods rely on homophily-based low-frequency knowledge, failing to handle diverse spectral distributions in real-world graphs with varying homophily. Our theoretical analysis reveals a spectral specificity principle: optimal knowledge transfer requires alignment between pre-trained spectral filters and the intrinsic spectrum of downstream graphs. Under limited supervision, large spectral gaps between pre-training and downstream tasks impede effective adaptation. To bridge this gap, we propose the HS-GPPT model, a novel framework that ensures spectral alignment throughout both pre-training and prompt-tuning. We utilize a hybrid spectral filter backbone and local-global contrastive learning to acquire abundant spectral knowledge. Then we design prompt graphs to align the spectral distribution with pretexts, facilitating spectral knowledge transfer across homophily and heterophily. Extensive experiments validate the effectiveness under both transductive and inductive learning settings. Our code is available at https://anonymous.4open.science/r/HS-GPPT-62D2/.

[107] Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Jihang Wang,Dongcheng Zhao,Ruolin Chen,Qian Zhang,Yi Zeng

Main category: cs.LG

TL;DR: 该论文提出了RTE方法,通过自集成的方式提升脉冲神经网络(SNNs)的抗对抗攻击能力,同时平衡模型的鲁棒性与准确性。

Details Motivation: 现有研究对SNNs在对抗攻击下的脆弱性理解不足,缺乏针对时间动态特性的鲁棒性提升方法。

Contribution: 1. 提出了RTE框架,通过时间自集成增强每个子网络的鲁棒性;2. 分析了时间传递对抗漏洞的问题,并提出解决方案;3. 实验证明了方法的有效性。

Method: RTE通过统一的损失函数结合子网络鲁棒性和对抗扰动传递性优化,采用随机采样策略高效训练。

Result: 在多个基准测试中,RTE在鲁棒性-准确性权衡上优于现有方法,并重塑了SNNs的内部鲁棒性。

Insight: 时间结构在对抗学习中具有重要作用,RTE通过时间多样化的决策边界提升了模型鲁棒性。

Abstract: Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.

[108] Robust Convolution Neural ODEs via Contractivity-promoting regularization

Muhammad Zakwan,Liang Xu,Giancarlo Ferrari-Trecate

Main category: cs.LG

TL;DR: 该论文提出了一种通过收缩理论提升卷积神经ODE(NODE)鲁棒性的方法,利用正则化项或权重正则化促进收缩性,从而增强模型对输入噪声和对抗攻击的鲁棒性。

Details Motivation: 神经网络的脆弱性使其对输入噪声和对抗攻击敏感,尤其是连续深度网络(如NODE)。为此,作者提出通过收缩理论提升NODE的鲁棒性。

Contribution: 1. 提出使用收缩理论提升卷积NODE的鲁棒性; 2. 设计正则化项(基于系统动态的Jacobian)和权重正则化项以促进收缩性; 3. 在MNIST和FashionMNIST数据集上验证了方法的有效性。

Method: 1. 利用收缩理论保证NODE的轨迹收敛性; 2. 引入基于Jacobian的正则化项或权重正则化项(针对斜率受限激活函数); 3. 在噪声和攻击条件下测试模型性能。

Result: 在MNIST和FashionMNIST数据集上,所提方法显著提升了模型对噪声和对抗攻击的鲁棒性。

Insight: 通过收缩性设计可以增强连续深度神经网络的鲁棒性,正则化方法在降低计算负担的同时仍能有效提升性能。

Abstract: Neural networks can be fragile to input noise and adversarial attacks. In this work, we consider Convolutional Neural Ordinary Differential Equations (NODEs), a family of continuous-depth neural networks represented by dynamical systems, and propose to use contraction theory to improve their robustness. For a contractive dynamical system two trajectories starting from different initial conditions converge to each other exponentially fast. Contractive Convolutional NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output. Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics. To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions. The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks.

cs.MM [Back]

[109] Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao,Wei Song,Derui Wang,Jingling Xue,Jin Song Dong

Main category: cs.MM

TL;DR: 该论文揭示了视频大语言模型(VideoLLMs)在有害内容检测上的严重漏洞,即使有害内容清晰可见,模型也极少提及。作者提出了三种攻击方法,并发现当前模型设计中的三个关键缺陷。

Details Motivation: 视频大语言模型在关键应用中广泛部署,但其用户依赖自动生成摘要的交互方式隐藏了安全漏洞:模型会忽略视频中有害内容(如全帧或角落插入)。论文旨在揭示这一漏洞及其根源。

Contribution: 论文的主要贡献是:(1) 发现VideoLLMs在处理有害内容时的严重漏洞;(2) 分析三种设计缺陷(时间覆盖不足、空间信息丢失、编码-解码脱节);(3) 提出三种零查询黑盒攻击方法。

Method: 论文通过实验分析了VideoLLMs的三种设计缺陷,并设计了三种攻击方法(时间稀疏、空间压缩、弱解码利用)。大规模评估覆盖了五种主流VideoLLMs。

Result: 实验结果表明,在大多数情况下,有害内容的忽略率超过90%,即使有害内容出现在所有帧中,模型仍无法识别。

Insight: 论文揭示了当前VideoLLMs设计中的根本脆弱性,突出强调了需改进采样策略、令牌压缩和解码机制以实现语义覆盖,而非仅追求速度。

Abstract: Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs’ designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.

eess.IV [Back]

[110] HistoViT: Vision Transformer for Accurate and Scalable Histopathological Cancer Diagnosis

Faisal Ahmed

Main category: eess.IV

TL;DR: 该论文提出了一种基于Vision Transformer(ViT)的深度学习框架HistoViT,用于病理图像的癌症分类,解决了传统卷积神经网络的局限性,实现了高性能、低预处理需求和高可扩展性。

Details Motivation: 准确的癌症诊断在病理学中至关重要,但由于组织学复杂性,传统方法存在性能不足和预处理繁琐等问题。利用Transformer架构可以改进这些问题。

Contribution: 1. 提出一种基于ViT的框架用于多类肿瘤分类;2. 设计简化的预处理流程;3. 在四个基准数据集上验证了模型的优越性和泛化能力。

Method: 1. 微调ViT架构以适应病理图像;2. 将全切片图像分块并转换为PyTorch张量;3. 数据标准化以提升分类性能和收敛稳定性。

Result: 在乳腺癌(ICIAR2018)、前列腺癌(SICAPv2)、骨癌(UT-Osteosarcoma)和宫颈癌(SipakMed)数据集上分类准确率分别为99.32%、96.92%、95.28%和96.94%,AUC超过99%。

Insight: Transformer架构在病理图像分析中表现出色,提供了高性能和可扩展性的解决方案,有望推动自动化和可解释的癌症诊断系统发展。

Abstract: Accurate and scalable cancer diagnosis remains a critical challenge in modern pathology, particularly for malignancies such as breast, prostate, bone, and cervical, which exhibit complex histological variability. In this study, we propose a transformer-based deep learning framework for multi-class tumor classification in histopathological images. Leveraging a fine-tuned Vision Transformer (ViT) architecture, our method addresses key limitations of conventional convolutional neural networks, offering improved performance, reduced preprocessing requirements, and enhanced scalability across tissue types. To adapt the model for histopathological cancer images, we implement a streamlined preprocessing pipeline that converts tiled whole-slide images into PyTorch tensors and standardizes them through data normalization. This ensures compatibility with the ViT architecture and enhances both convergence stability and overall classification performance. We evaluate our model on four benchmark datasets: ICIAR2018 (breast), SICAPv2 (prostate), UT-Osteosarcoma (bone), and SipakMed (cervical) dataset – demonstrating consistent outperformance over existing deep learning methods. Our approach achieves classification accuracies of 99.32%, 96.92%, 95.28%, and 96.94% for breast, prostate, bone, and cervical cancers respectively, with area under the ROC curve (AUC) scores exceeding 99% across all datasets. These results confirm the robustness, generalizability, and clinical potential of transformer-based architectures in digital pathology. Our work represents a significant advancement toward reliable, automated, and interpretable cancer diagnosis systems that can alleviate diagnostic burdens and improve healthcare outcomes.

[111] AnatoMaskGAN: GNN-Driven Slice Feature Fusion and Noise Augmentation for Medical Semantic Image Synthesis

Zonglin Wu,Yule Xue,Qianxiang Hu,Yaoyao Feng,Yuqi Ma,Shanxiong Chen

Main category: eess.IV

TL;DR: AnatoMaskGAN通过GNN驱动的切片特征融合和噪声增强策略,提升医学语义图像合成的精度和感知质量,显著优于现有方法。

Details Motivation: 现有GAN方法在医学语义图像合成中存在空间一致性不足和多样性有限的问题,AnatoMaskGAN旨在解决这些问题以提升数据增强和分析能力。

Contribution: 1. 设计了基于GNN的强相关切片特征融合模块;2. 提出了三维空间噪声注入策略;3. 引入了灰度和纹理分类器优化生成过程。

Method: 1. GNN驱动切片特征融合;2. 三维噪声注入增强多样性;3. 灰度和纹理分类器优化生成质量。

Result: 在L2R-OASIS和L2R-Abdomen CT数据集上,PSNR和SSIM显著提升,如PSNR达26.50 dB(比SOTA高0.43 dB)。

Insight: 每个核心模块(特征融合、噪声注入、分类器)对提升重建精度和感知质量均有独立贡献,证明了设计的有效性。

Abstract: Medical semantic-mask synthesis boosts data augmentation and analysis, yet most GAN-based approaches still produce one-to-one images and lack spatial consistency in complex scans. To address this, we propose AnatoMaskGAN, a novel synthesis framework that embeds slice-related spatial features to precisely aggregate inter-slice contextual dependencies, introduces diverse image-augmentation strategies, and optimizes deep feature learning to improve performance on complex medical images. Specifically, we design a GNN-based strongly correlated slice-feature fusion module to model spatial relationships between slices and integrate contextual information from neighboring slices, thereby capturing anatomical details more comprehensively; we introduce a three-dimensional spatial noise-injection strategy that weights and fuses spatial features with noise to enhance modeling of structural diversity; and we incorporate a grayscale-texture classifier to optimize grayscale distribution and texture representation during generation. Extensive experiments on the public L2R-OASIS and L2R-Abdomen CT datasets show that AnatoMaskGAN raises PSNR on L2R-OASIS to 26.50 dB (0.43 dB higher than the current state of the art) and achieves an SSIM of 0.8602 on L2R-Abdomen CT–a 0.48 percentage-point gain over the best model, demonstrating its superiority in reconstruction accuracy and perceptual quality. Ablation studies that successively remove the slice-feature fusion module, spatial 3D noise-injection strategy, and grayscale-texture classifier reveal that each component contributes significantly to PSNR, SSIM, and LPIPS, further confirming the independent value of each core design in enhancing reconstruction accuracy and perceptual quality.

[112] Semi-Supervised Learning with Online Knowledge Distillation for Skin Lesion Classification

Siyamalan Manivannan

Main category: eess.IV

TL;DR: 论文提出了一种结合集成学习和在线知识蒸馏的半监督学习方法,用于皮肤病变分类,减少了对大量标注数据的依赖,并在资源受限环境中表现出色。

Details Motivation: 现有方法主要依赖全监督学习,需要大量标注数据,而获取这些数据成本高昂且困难。论文旨在通过半监督学习减轻标注负担。

Contribution: 提出了一种新颖的半监督学习方法,结合集成学习和在线知识蒸馏,提升了单个模型的性能,同时减少了标注数据的需求。

Method: 通过训练一个卷积神经网络模型集成,利用在线知识蒸馏将集成的知识转移到单个模型中,从而增强每个模型的性能。

Result: 在ISIC 2018和2019公开基准数据集上取得了超越当前最优结果的性能,知识蒸馏后的单个模型表现优于独立训练的模型。

Insight: 集成学习和在线知识蒸馏的结合可以显著提升半监督学习的性能,特别是在资源受限的实际应用中具有潜在优势。

Abstract: Deep Learning has emerged as a promising approach for skin lesion analysis. However, existing methods mostly rely on fully supervised learning, requiring extensive labeled data, which is challenging and costly to obtain. To alleviate this annotation burden, this study introduces a novel semi-supervised deep learning approach that integrates ensemble learning with online knowledge distillation for enhanced skin lesion classification. Our methodology involves training an ensemble of convolutional neural network models, using online knowledge distillation to transfer insights from the ensemble to its members. This process aims to enhance the performance of each model within the ensemble, thereby elevating the overall performance of the ensemble itself. Post-training, any individual model within the ensemble can be deployed at test time, as each member is trained to deliver comparable performance to the ensemble. This is particularly beneficial in resource-constrained environments. Experimental results demonstrate that the knowledge-distilled individual model performs better than independently trained models. Our approach demonstrates superior performance on both the \emph{International Skin Imaging Collaboration} 2018 and 2019 public benchmark datasets, surpassing current state-of-the-art results. By leveraging ensemble learning and online knowledge distillation, our method reduces the need for extensive labeled data while providing a more resource-efficient solution for skin lesion classification in real-world scenarios.

cs.SE [Back]

[113] Diffusion is a code repair operator and generator

Mukul Singh,Gust Verbruggen,Vu Le,Sumit Gulwani

Main category: cs.SE

TL;DR: 本文提出扩散模型作为代码修复和生成工具,利用其噪声去除过程实现最后一英里修复和训练数据生成。

Details Motivation: 代码扩散模型在后期去噪阶段的行为类似于对不完整或损坏代码的修复,这一现象被探索用于实际修复任务和数据增强。

Contribution: 1. 提出将预训练的代码扩散模型用于最后一英里修复任务;2. 通过扩散过程生成大量训练数据。

Method: 在Python、Excel和PowerShell三个领域中,通过向损坏代码添加噪声并恢复扩散过程实现修复,同时利用扩散过程生成输入-输出对作为训练数据。

Result: 实验验证了扩散模型在代码修复和生成任务中的有效性。

Insight: 扩散模型不仅适用于生成任务,其去噪过程中的中间状态还可用于修复和数据增强。

Abstract: Code diffusion models generate code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, differences between discrete representations of these snippets look like last-mile repairs applied to broken or incomplete code. We evaluate the extent to which this resemblance can be exploited to leverage pre-trained code diffusion models for the problem of last-mile repair by considering two applications with significant potential. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate arbitrary amount of training data for last-mile repair tasks (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments on 3 domains (Python, Excel and PowerShell) to evaluate applications, as well as analyze properties.

[114] ORFuzz: Fuzzing the “Other Side” of LLM Safety – Testing Over-Refusal

Haonan Zhang,Dongxia Wang,Yi Liu,Kexin Chen,Jiashui Wang,Xinlei Ying,Long Liu,Wenhai Wang

Main category: cs.SE

TL;DR: ORFuzz是首个针对LLM过度拒绝行为的进化式测试框架,通过类别感知种子选择、自适应变异优化和人类对齐的裁判模型,显著提升了测试效果。

Details Motivation: 当前方法在测试LLM的过度拒绝行为时存在不足,基准测试和测试生成能力有限,影响了LLM的可靠性和可用性。

Contribution: 提出了ORFuzz框架,通过整合三个核心组件,实现了对LLM过度拒绝行为的系统检测和分析,并构建了新的基准ORFuzzSet。

Method: 采用安全类别感知种子选择、LLM驱动的自适应变异优化和人类对齐的裁判模型OR-Judge。

Result: ORFuzz的测试案例生成率(6.98%)显著高于基线方法,ORFuzzSet在10种LLM上的平均过度拒绝率达到63.56%。

Insight: ORFuzz提供了一个自动化的测试框架和社区资源,为开发更可靠的LLM系统铺平了道路。

Abstract: Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz’s outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems.