Table of Contents

cs.CL [Back]

[1] FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Gyubok Lee,Elea Bach,Eric Yang,Tom Pollard,Alistair Johnson,Edward Choi,Yugang jia,Jong Ha Lee

Main category: cs.CL

TL;DR: FHIR-AgentBench是一个评估大语言模型(LLM)代理在真实医疗互操作性数据(HL7 FHIR标准)上的问答能力的基准测试,包含2,931个临床问题,并比较了不同数据检索、交互和推理策略。

Details Motivation: 随着医疗数据的标准化转向HL7 FHIR,现有基准缺乏对这种复杂资源模型的真实评估能力,亟需新的测试工具。

Contribution: 提出了FHIR-AgentBench,首个基于HL7 FHIR标准的临床问答基准,旨在推动医疗领域LLM代理的研发。

Method: 通过比较不同数据检索策略(直接API调用与专用工具)、交互模式(单轮与多轮)和推理策略(自然语言与代码生成),系统评估代理框架。

Result: 实验揭示了从FHIR资源中检索数据和在复杂逻辑上推理的实际挑战,对问答性能有重要影响。

Insight: 医疗互操作性数据的复杂性和推理难度是关键瓶颈,未来研究需针对性优化检索和推理能力。

Abstract: The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

[2] Readme_AI: Dynamic Context Construction for Large Language Models

Millie Vyas,Timothy Blattner,Alden Dima

Main category: cs.CL

TL;DR: 该论文提出了Readme_AI,一种动态为大语言模型(LLM)构建上下文的协议,通过数据源所有者提供的元数据文件,显著提升LLM在特定数据集查询中的准确性和可靠性。

Details Motivation: 尽管大语言模型(LLM)经过大量数据训练,但在用户特定查询场景下仍可能提供不准确或不可靠的信息。动态构建查询相关上下文可以显著改善模型的响应质量。

Contribution: 论文的主要贡献是一个可扩展的协议,称为Readme_AI Model Context Protocol (MCP),用于动态地将LLM与数据所有者提供的专业化数据关联起来,减少幻觉并增强模型响应。

Method: 研究提出了一种动态构建上下文的规范,通过数据源所有者创建的元数据文件(如爬取网页、获取数据仓库信息、下载解析出版物等)支持LLM推理。上下文通过用户指定标签进行分组和格式化。

Result: 通过实验验证,Readme_AI显著改进了LLM在NIST Hedgehog库相关查询中的表现,从提供不相关或幻觉信息转变为能够推理库的用途并生成代码示例。

Insight: 研究揭示了动态上下文构建在提升LLM专业领域性能中的潜力,尤其是通过数据源所有者直接提供的元数据,能够更准确地满足用户需求。

Abstract: Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user’s specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog’s developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: https://github.com/usnistgov/readme_ai .

[3] Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers

Ruochi Li,Haoxuan Zhang,Edward Gehringer,Ting Xiao,Junhua Ding,Haihua Chen

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)在生成科学论文自动评审中的优缺点。LLMs在描述和肯定内容方面表现良好,但缺乏批判性推理和上下文理解能力。通过构建大规模基准数据集,研究发现LLMs在识别论文弱点时表现不佳。

Details Motivation: 随着科学论文提交量的激增,传统同行评审压力增大,研究者探索利用LLMs自动生成评审以提高效率。然而,LLMs在批判性思维和上下文理解上的缺陷尚未被系统评估。

Contribution: 论文提出了一个综合评估框架,结合语义相似性分析和结构化知识图指标,对LLMs生成的评审进行系统评估。构建了包含1,683篇论文和6,495篇专家评审的大规模基准数据集,覆盖多个会议和年份。

Method: 构建基于ICLR和NeurIPS的评审数据集,使用五种LLMs生成评审,通过语义相似性和知识图节点数等指标量化对比LLMs与人类评审的差异。

Result: LLMs在描述论文贡献和方法时表现良好(如GPT-4o生成的优势部分实体比人类多15.74%),但在识别弱点和调整反馈质量方面显著不足(如GPT-4o生成的弱点实体比人类少59.42%)。

Insight: 研究为LLMs辅助评审工具的开发提供了实证基础,未来需进一步提升LLMs的批判性推理和上下文适应性能力。

Abstract: The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at https://github.com/RichardLRC/Peer-Review.

[4] How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment

Julie Jung,Max Lu,Sina Chole Benker,Dogus Darici

Main category: cs.CL

TL;DR: 研究了模型规模、温度和提示风格对LLM与人类评分一致性的影响,发现模型规模是关键因素。

Details Motivation: 探讨LLM在评估临床推理能力时,模型规模、温度和提示风格对其自身、模型间及与人类评分一致性的影响。

Contribution: 发现模型规模是影响LLM与人类评分一致性的主要因素,并强调多层级一致性检查的重要性。

Method: 通过调整模型大小、温度和提示风格,评估LLM在不同条件下与人类评分的对齐情况。

Result: 模型规模对LLM-人类评分一致性影响显著。

Insight: 研究强调了在实际应用中需综合考虑模型规模等多因素以确保评估的可靠性。

Abstract: We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.

[5] Quantifying Compositionality of Classic and State-of-the-Art Embeddings

Zhijin Guo,Chenhao Xue,Zhaozhen Xu,Hongbo Bo,Yuxuan Ye,Janet B. Pierrehumbert,Martha Lewis

Main category: cs.CL

TL;DR: 本文提出了一种量化评估静态和现代语言模型组合性的方法,揭示了模型在不同训练阶段和层次的组合性表现。

Details Motivation: 现有语言模型(如Word2vec)对组合性声称过高,而现代生成式模型(如Transformer)又缺乏对语境变化意义的限制。本文旨在量化这种组合性,为模型的组合能力提供客观评估。

Contribution: 1. 提出了一种两阶段评估方法,通过典型相关分析和重建嵌入来衡量组合性。2. 分析了不同模型、训练阶段和层次的组合性表现。3. 开源了相关代码。

Method: 1. 使用典型相关分析(CCA)测量实体属性与嵌入之间的线性关系。2. 通过重建未见属性组合的嵌入(如L2损失、余弦相似度和检索精度)评估加性组合性。

Result: 实验表明,深层Transformer模型在训练后期表现出更强的组合性信号,但顶层有所下降。不同数据模态下也观察到此现象。

Insight: 模型组合性并非单调增长,而是受训练阶段和层次影响;现代模型在深层可能更擅长捕捉组合性,但顶层可能因过拟合或其他原因表现下降。

Abstract: For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don’t know what a “pelp” is, we can use our knowledge of numbers to understand that “ten pelps” makes more pelps than “two pelps”. Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at https://github.com/Zhijin-Guo1/quantifying-compositionality.

[6] Pluralistic Off-policy Evaluation and Alignment

Chengkai Huang,Junda Wu,Zhouhang Xie,Yu Xia,Rui Wang,Tong Yu,Subrata Mitra,Julian McAuley,Lina Yao

Main category: cs.CL

TL;DR: 本文提出了Pluralistic Off-Policy Evaluation (POPE)框架,用于在多样化的用户偏好下对大语言模型(LLMs)进行离线评估和偏好对齐,解决了现有方法忽略偏好多样性且仅关注整体效用的问题。

Details Motivation: 现有的大语言模型偏好对齐数据集通常是在与评估模型差异很大的策略下记录的,且现有离线策略评估方法仅关注整体效用,忽略了偏好多样性。因此,如何扩展离线策略评估(OPE)以适应多样化偏好对齐是一个开放性问题。

Contribution: 本文的主要贡献是提出了POPE框架,首次实现了离线多样化偏好评估和对齐。该框架通过统一的奖励函数(结合协作效用和多样性组件)及分解的反向倾向评分(IPS)估计器,有效地评估和优化多样性偏好。

Method: POPE框架的核心方法包括:(1)统一的奖励函数,结合协作效用和基于熵的多样性度量;(2)分解的IPS估计器,分别评估相关性和多样性;(3)理论证明分解IPS估计器的方差下界;(4)利用离线策略评估的价值函数直接优化多样化对齐。

Result: 实验结果表明,POPE能高效地提升多样化响应生成,并保持模型在下游任务中的通用能力。

Insight: 本文的启示在于,偏好对齐不仅需要关注整体效用,还需引入多样性度量,以更好地反映人类偏好的多样性。POPE为多样化偏好对齐提供了理论和实践基础。

Abstract: Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models’ general capabilities on downstream tasks

[7] SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Renyu Li,Antonio Jimeno Yepes,Yao You,Kamil Pluciński,Maximilian Operlejn,Crag Wolfe

Main category: cs.CL

TL;DR: SCORE是一种针对生成式文档解析系统的语义评估框架,解决了传统指标因忽略语义多样性而导致的评估偏差问题。

Details Motivation: 传统的评估指标(如CER、WER、IoU、TEDS)无法区分生成式解析系统中语义正确但结构多样的输出,导致误判和评估失真。SCORE旨在提供一种语义导向的框架,包容多样性同时严格评估语义准确性。

Contribution: 1. 提出了SCORE框架,结合内容保真度调整、幻觉与遗漏诊断、表格评估的语义对齐及层次一致性检查。2. 通过标准化生成式输出为格式无关表示,证明生成式解析足以支持全面评估。3. 在一个全面的基准测试和实际数据集上验证了SCORE的有效性。

Method: 1. 使用调整后的编辑距离评估内容保真度。2. 基于token级别的诊断区分幻觉和遗漏。3. 在表格评估中引入空间容忍和语义对齐。4. 层次一致性检查确保结构逻辑。

Result: 在1,114页文档上的实验表明,SCORE能正确识别传统指标误判的情况(如12-25%的性能偏差),并通过标准化生成式输出实现了与传统指标相当的性能(如表格F1达0.93)。

Insight: SCORE揭示了语义多样性对评估结果的影响,为现代文档解析系统提供了公平且实用的多维度评估标准。

Abstract: Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems.

[8] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches

Maryam Mahdi Alhusseini,Mohammad-Reza Feizi-Derakhshi

Main category: cs.CL

TL;DR: 该研究提出了一种新颖的双视角情感分析方法,结合了基于词典的情感分析和深度学习模型(CNN和Bi-LSTM),用于分析ChatGPT和DeepSeek的用户评价。结果表明ChatGPT的情感更积极,且CNN表现优于Bi-LSTM。

Details Motivation: 为了更全面地评估大型语言模型(LLM)应用的用户满意度,研究者探索了结合词典方法和深度学习模型的优势,弥补了以往研究的单视角局限。

Contribution: 提出了一种双视角情感分析方法,整合了词典和深度学习技术,并提供了ChatGPT与DeepSeek用户评价的详细对比分析。

Method: 数据集包含4,000条用户评价,经过预处理和过采样平衡后,使用TextBlob(词典方法)、CNN和Bi-LSTM进行情感分类。

Result: ChatGPT的情感更积极;深度学习方法优于词典分析,CNN准确率达96.41%,且对负面评价分类效果极佳。

Insight: 该研究为LLM应用的情感分析提供了新方法,并为开发者改进用户体验提供了实用建议。

Abstract: This study presents a novel dual-perspective approach to analyzing user reviews for ChatGPT and DeepSeek on the Google Play Store, integrating lexicon-based sentiment analysis (TextBlob) with deep learning classification models, including Convolutional Neural Networks (CNN) and Bidirectional Long Short Term Memory (Bi LSTM) Networks. Unlike prior research, which focuses on either lexicon-based strategies or predictive deep learning models in isolation, this study conducts an extensive investigation into user satisfaction with Large Language Model (LLM) based applications. A Dataset of 4,000 authentic user reviews was collected, which were carefully preprocessed and subjected to oversampling to achieve balanced classes. The balanced test set of 1,700 Reviews were used for model testing. Results from the experiments reveal that ChatGPT received significantly more positive sentiment than DeepSeek. Furthermore, deep learning based classification demonstrated superior performance over lexicon analysis, with CNN outperforming Bi-LSTM by achieving 96.41 percent accuracy and near perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments. This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for developers and researchers seeking to improve user-centric AI system design.

[9] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Robert Tjarko Lange,Yuki Imajuku,Edoardo Cetin

Main category: cs.CL

TL;DR: ShinkaEvolve是一个基于大语言模型(LLMs)的开源框架,通过创新的采样和搜索策略,显著提升了代码进化的样本效率和解决方案质量,适用于广泛的科学发现任务。

Details Motivation: 当前基于LLMs的代码进化方法存在样本效率低和封闭性的问题,限制了广泛采用和扩展。ShinkaEvolve旨在解决这些限制,推动开放式的科学发现。

Contribution: 1. 提出了一种平衡探索与利用的父代采样技术;2. 引入了代码新颖性拒绝采样方法;3. 采用了基于多臂老虎机的LLM集成选择策略。

Method: 结合父代采样、代码新颖性拒绝采样和LLM集成选择策略,优化代码进化过程。

Result: 在多项任务中显著提升了样本效率和解决方案质量,如仅用150个样本发现了新的最优圆填充解决方案。

Insight: 通过开源框架和高效的搜索策略,ShinkaEvolve在科学发现中实现了高样本效率和广泛适用性,展示了LLMs在开放式问题中的潜力。

Abstract: We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.

[10] TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities

Jiajun Chen,Yangyang Wu,Xiaoye Miao,Mengying Zhu,Meng Xi

Main category: cs.CL

TL;DR: TriSPrompt提出了一种分层软提示模型,通过模态感知提示(MA)、模态缺失提示(MM)和多视角提示(MV),有效解决多模态谣言检测中模态缺失的问题,性能提升13%。

Details Motivation: 多模态数据中常见模态缺失问题,现有方法仅依赖完整模态训练数据,无法有效处理现实情境中的缺失模态。因此,需设计一种能适应不完全模态的检测方法。

Contribution: 1. 提出TriSPrompt模型,整合MA、MM和MV三种提示,有效检测不完全模态数据中的谣言。2. MA提示捕捉模态异质与同质信息,MM提示建模缺失状态,MV提示学习主观与客观视角关系。

Method: 1. MA提示:从可用数据中提取模态特异性与共性特征。2. MM提示:建模模态缺失状态,增强模型对缺失信息的适应性。3. MV提示:学习文本、图像与评论之间的关系,辅助谣言检测。

Result: 在三个真实基准数据集上,TriSPrompt相比现有方法实现了13%以上的准确率提升。

Insight: 通过分层提示机制,TriSPrompt不仅解决了模态缺失问题,还通过多视角关系建模显著提升了检测性能。

Abstract: The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from \emph{complete} multimodal training data, rendering them ineffective in addressing the common occurrence of \emph{missing modalities} in real-world scenarios. In this paper, we propose a hierarchical soft prompt model \textsf{TriSPrompt}, which integrates three types of prompts, \textit{i.e.}, \emph{modality-aware} (MA) prompt, \emph{modality-missing} (MM) prompt, and \emph{mutual-views} (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model’s adaptability to missing information. The MV prompt learns relationships between subjective (\textit{i.e.}, text and image) and objective (\textit{i.e.}, comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that \textsf{TriSPrompt} achieves an accuracy gain of over 13% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.4open.science/r/code-3E88.

[11] RoadMind: Towards a Geospatial AI Expert for Disaster Response

Ahmed El Fekih Zguir,Ferda Ofli,Muhammad Imran

Main category: cs.CL

TL;DR: Paper introduces RoadMind, a self-supervised framework that enhances LLMs’ geospatial reasoning for disaster response by leveraging OpenStreetMap data.

Details Motivation: Current LLMs lack robust geospatial reasoning, which is critical for disaster response tasks like evacuation planning. RoadMind addresses this gap.

Contribution: RoadMind trains LLMs using structured OSM data, improving their ability to handle spatial tasks like road identification and distance estimation.

Method: Automated pipeline extracts OSM data, formats it for spatial tasks, and trains LLMs using QLoRA adapters and 4-bit quantization.

Result: RoadMind outperforms baseline LLMs in disaster-prone cities (LA, Christchurch, Manila) on tasks like road segment identification and nearest road retrieval.

Insight: Structured geospatial data can significantly enhance LLMs for offline disaster response, proving the value of domain-specific training.

Abstract: Large Language Models (LLMs) have shown impressive performance across a range of natural language tasks, but remain limited in their ability to reason about geospatial data, particularly road networks, distances, and directions. This gap poses challenges in disaster scenarios, where spatial understanding is critical for tasks such as evacuation planning and resource allocation. In this work, we present RoadMind, a self-supervised framework that enhances the geospatial reasoning capabilities of LLMs using structured data from OpenStreetMap (OSM). Our automated pipeline extracts road infrastructure data for a given city and converts it into multiple supervision formats tailored to key spatial tasks. We pretrain and fine-tune LLMs on these representations using QLoRA adapters and 4-bit quantized models. We evaluate our approach on three disaster-prone cities with varying global representation, Los Angeles, Christchurch, and Manila, across tasks such as road segment identification, nearest road retrieval, and distance/direction estimation. Our results show that models trained via RoadMind significantly outperform strong baselines, including state-of-the-art LLMs equipped with advanced prompt engineering. This demonstrates the potential of structured geospatial data to enhance language models with robust spatial reasoning, enabling more effective offline AI systems for disaster response.

[12] Benchmarking and Improving LLM Robustness for Personalized Generation

Chimaobi Okite,Naihao Deng,Kiran Bodipati,Huaidian Hou,Joyce Chai,Rada Mihalcea

Main category: cs.CL

TL;DR: 论文提出了一种评估大型语言模型(LLM)在个性化生成中鲁棒性的框架PERG和新数据集PERGData,发现现有模型在事实准确性和用户偏好对齐上存在显著不足,并提出Pref-Aligner方法平均提升25%鲁棒性。

Details Motivation: 现有评估方法主要关注LLM响应是否符合用户偏好,但忽视了事实准确性这一重要维度,导致模型在个性化生成中不够可靠。本文旨在填补这一空白。

Contribution: 1) 提出PERG框架和PERGData数据集,用于评估LLM在个性化生成中的鲁棒性;2) 揭示现有模型在事实性和用户偏好对齐上的不足;3) 提出Pref-Aligner方法显著提升鲁棒性。

Method: 1) 定义鲁棒性为响应同时满足事实准确性和用户偏好对齐;2) 使用PERG框架评估14个模型;3) 提出Pref-Aligner,通过两阶段方法优化模型的鲁棒性。

Result: 研究发现:1) 最强模型(如GPT-4.1)在5%的无个性化成功案例中无法保持正确性;2) 较小模型(7B级)失败率超过20%;3) Pref-Aligner平均提升25%鲁棒性。

Insight: 1) 鲁棒性受查询性质和用户偏好类型显著影响;2) 现有评估方法需改进以覆盖事实性和用户偏好;3) Pref-Aligner为提升LLM可靠性提供了有效路径。

Abstract: Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.

[13] Semantic Representation Attack against Aligned Large Language Models

Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Shaohui Mei,Lap-Pui Chau

Main category: cs.CL

TL;DR: 论文提出了一种针对对齐大型语言模型(LLM)的新型语义表示攻击方法,通过利用语义表示空间生成多样但语义等效的有害回答,解决了传统攻击方法在效果和自然性之间的权衡问题,并提出了一种高效的启发式搜索算法。

Details Motivation: 当前对抗对齐LLM的攻击方法通常针对特定的文本模式(如“Sure, here is...”),存在收敛性差、提示不自然和计算成本高等问题。作者希望通过语义表示空间重新定义攻击目标,改进攻击效果和自然性。

Contribution: 1. 提出语义表示攻击范式的创新概念,利用语义等效的多样化回答提升攻击效果;2. 设计语义表示启发式搜索算法,高效生成语义连贯且简洁的对抗提示;3. 提供理论保证并验证方法在18个LLM上的高成功率(平均89.41%)和隐蔽性。

Method: 通过语义表示空间定义攻击目标,而非特定文本模式;采用启发式搜索算法逐步扩展对抗提示,保持语义连贯性和简洁性。

Result: 在18个LLM上的攻击成功率达89.41%,其中11个模型为100%,同时保持高效和隐蔽性。

Insight: 传统攻击方法的局限性在于对特定文本模式的依赖,而语义表示攻击通过多样化语义等效回答从根本上提升了攻击效果和自然性。

Abstract: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is…’’, suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41% averaged across 18 LLMs, including 100% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

[14] Meow: End-to-End Outline Writing for Automatic Academic Survey

Zhaoyu Ma,Yuan Shan,Jiahao Zhao,Nan Xu,Lei Wang

Main category: cs.CL

TL;DR: 这篇论文提出了一种名为Meow的元数据驱动框架,用于自动生成系统化、高质量的学术综述大纲,通过端到端的任务设计和两阶段训练方法实现了高效的输出。

Details Motivation: 随着学术论文数量的指数级增长,基于LLMs的自动化综述生成成为趋势,但现有方法的大纲生成缺乏深度理解和细粒度风格,亟需改进。

Contribution: 1) 首个元数据驱动的端到端大纲生成框架;2) 构建高质量的数据集和系统化评估指标;3) 采用两阶段训练方法(监督微调+强化学习)。

Method: 1) 将大纲生成定义为端到端任务;2) 从arXiv等平台收集数据集;3) 结合监督微调与强化学习训练8B推理模型。

Result: 论文展示的8B推理模型在大纲生成中表现出高结构保真度和风格一致性。

Insight: 元数据驱动的框架能有效提升大纲生成的系统性和风格性,两阶段训练方法显著优化模型性能。

Abstract: As academic paper publication numbers grow exponentially, conducting in-depth surveys with LLMs automatically has become an inevitable trend. Outline writing, which aims to systematically organize related works, is critical for automated survey generation. Yet existing automatic survey methods treat outline writing as mere workflow steps in the overall pipeline. Such template-based workflows produce outlines that lack in-depth understanding of the survey topic and fine-grained styles. To address these limitations, we propose Meow, the first metadata-driven outline writing framework that produces organized and faithful outlines efficiently. Specifically, we first formulate outline writing as an end-to-end task that generates hierarchical structured outlines from paper metadata. We then curate a high-quality dataset of surveys from arXiv, bioRxiv, and medRxiv, and establish systematic evaluation metrics for outline quality assessment. Finally, we employ a two-stage training approach combining supervised fine-tuning and reinforcement learning. Our 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.

[15] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models

Kangtao Lv,Haibin Chen,Yujin Yuan,Langming Liu,Shilei Liu,Yongwei Wang,Wenbo Su,Bo Zheng

Main category: cs.CL

TL;DR: 这篇论文研究了在预训练大语言模型(LLMs)中高效注入领域知识的方法,提出了知识注入的缩放定律,以平衡领域专业化和避免灾难性遗忘。

Details Motivation: 尽管大语言模型在广泛任务中表现优异,但在领域特定任务中可能表现不佳甚至产生幻觉。已有的研究表明,通过注入领域知识可以提升性能,但需要解决知识注入过多导致的灾难性遗忘问题。

Contribution: 1) 识别了知识注入中的临界崩溃点和模型规模的关联性;2) 提出了一个知识注入的缩放定律,用于预测最优的知识注入量。

Method: 通过系统实验观察临界崩溃点,并利用更小的模型预测大型模型的崩溃点,从而制定缩放定律。

Result: 实验验证了缩放定律在不同模型规模和预训练预算下的有效性和泛化性。

Insight: 模型在知识注入过程中存在一个临界点,超过此点会导致知识保留能力急剧下降,且这一临界点与模型规模相关。

Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model’s size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.

[16] Do LLMs Encode Frame Semantics? Evidence from Frame Identification

Jayanth Krishna Chundru,Rudrashis Poddar,Jie Cao,Tianyu Jiang

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型是否隐含地掌握了框架语义(frame semantics)知识,特别是在框架识别任务上的表现。研究表明,即使没有显式监督,模型也能有效完成框架识别,且经过微调后性能显著提升。

Details Motivation: 研究大型语言模型是否隐式地学习到了框架语义知识,尤其是它们能否在没有监督的情况下完成框架识别任务。

Contribution: 1. 证明了大型语言模型能够有效完成框架识别任务;2. 通过微调进一步提升了模型在领域内和领域外的性能;3. 展示了模型能生成语义连贯的框架定义,表明其对框架语义的内部理解。

Method: 1. 基于提示的推理方法评估模型在框架识别任务上的表现;2. 使用FrameNet数据进行微调;3. 分析模型生成的框架定义的语义连贯性。

Result: 模型在没有监督的情况下表现良好,经过微调后性能显著提升,且在领域外数据上泛化能力较强。模型还能生成语义合理的框架定义。

Insight: 大型语言模型隐式地掌握了框架语义知识,这为后续研究其在语义解析任务中的应用提供了新思路。

Abstract: We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model’s internalized understanding of frame semantics.

[17] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

Yanfang,Ye,Zheyuan Zhang,Tianyi Ma,Zehong Wang,Yiyang Li,Shifu Hou,Weixiang Sun,Kaiwen Shi,Yijun Ma,Wei Song,Ahmed Abbasi,Ying Cheng,Jane Cleland-Huang,Steven Corcelli,Patricia Culligan,Robert Goulding,Ming Hu,Ting Hua,John Lalor,Fang Liu,Tengfei Luo,Ed Maginn,Nuno Moniz,Jason Rohr,Brett Savoie,Daniel Slate,Tom Stapleford,Matthew Webber,Olaf Wiest,Johnny Zhang,Nitesh Chawla

Main category: cs.CL

TL;DR: 综述论文《LLMs4All》探讨了大型语言模型(LLMs)在各学术领域的应用前景、局限性与未来方向,涵盖人文、经济、科学及工程等领域。

Details Motivation: LLMs(如ChatGPT)在语言任务中表现出色,激发了对其跨领域应用的探索兴趣,包括人文社科、经济商业及科学技术等学科,以推动研究与实际应用。

Contribution: 本文系统梳理了LLMs在多学科中的应用现状,总结了关键观察与洞见,为研究者与实践者提供了LLMs在不同领域的潜力与挑战的全景视角。

Method: 采用了文献综述与分析的方法,综合评估LLMs在艺术、法律、经济、科学及工程等领域的具体应用案例与技术整合方式。

Result: 展示了LLMs在各学科中的广泛应用潜力,同时也指出了模型局限性(如偏见、可解释性)、开放挑战及未来发展方向。

Insight: LLMs不仅是语言工具,更可能成为跨领域研究的革新力量,但其成功应用需结合领域知识、伦理考量与技术优化。

Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

[18] GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Dylan Hutson,Daniel Vennemeyer,Aneesh Deshmukh,Justin Zhan,Tianyu Jiang

Main category: cs.CL

TL;DR: 该论文提出了一种名为GuessingGame的协议,用于评估大型语言模型(LLMs)在开放式问题中的信息增益能力,提出了两种信息增益(IG)度量方法,并证明高IG显著提高效率。

Details Motivation: 研究大型语言模型在开放式问题中的提问策略和信息增益能力,以提升其交互式推理表现。

Contribution: 提出了GuessingGame协议和两种信息增益(IG)度量方法,证明IG与游戏效率强相关,并通过提示约束显著提升弱模型表现。

Method: 提出两种模型无关的信息增益(IG)度量方法:贝叶斯方法和基于熵的方法,利用LLM评分和ConceptNet过滤候选对象。

Result: 实验显示,IG提升一个标准差可使游戏长度减少43%,提示约束显著提升模型性能。

Insight: LLMs的提问能力是可测量和可优化的,信息增益是提升交互式推理的关键指标。

Abstract: We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.

[19] Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models

Mohammad Saim,Phan Anh Duong,Cat Luong,Aniket Bhanderi,Tianyu Jiang

Main category: cs.CL

TL;DR: 该论文利用大型视觉语言模型(LVLMs)构建了Embodied LVLM Emotion Narratives(ELENA)框架,通过多层文本输出来描述情感反应中显著的身体部位,发现模型存在对面部区域的偏见,但即使如此,ELENA在识别遮挡面部的图像中情感方面表现优于基线方法。

Details Motivation: 情感反应的身体部位包含丰富的情感信息,但现有模型通常偏向于面部区域,忽略了其他身体部位的情感表达。研究目标是利用LVLMs构建一个能够全面描述情感反应中身体部位的新框架。

Contribution: 提出了ELENA框架,利用LVLMs生成多层文本描述,专注于情感反应中的显著身体部位。同时揭示了现有模型对面部区域的偏见,展示了ELENA在遮挡面部图像中的优越性能。

Method: 采用大型视觉语言模型(LVLMs),通过注意力图分析模型的关注区域,构建多层文本输出框架,描述情感反应中的身体部位。

Result: ELENA在识别情感方面表现优于基线方法,尤其是在面部遮挡的图像中,无需微调即可实现较好的效果。

Insight: 情感分析不应局限于面部,身体其他部位的情感表达同样重要;LVLMs在未微调的情况下已具备一定的泛化能力。

Abstract: The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision-language models (LVLMs) to generate Embodied LVLM Emotion Narratives (ELENA). These are well-defined, multi-layered text outputs, primarily comprising descriptions that focus on the salient body parts involved in emotional reactions. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that our employed framework can effectively recognize embodied emotions in face-masked images, outperforming baselines without any fine-tuning. ELENA opens a new trajectory for embodied emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.

[20] Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections

Yicheng Yang,Zixian Li,Jean Paul Bizimana,Niaz Zafri,Yongfeng Dong,Tianyi Li

Main category: cs.CL

TL;DR: 本文提出了一种利用多模态大语言模型(LLM)的提示设计方法,用于预测无信号交叉口驾驶员让行行为,展示了其在行人安全领域的潜力。

Details Motivation: 行人安全是城市交通的重要组成部分,但传统机器学习模型在捕捉驾驶员-行人交互的复杂性和上下文依赖性方面表现不佳,大语言模型(LLM)因其强大的模式提取能力成为潜在的解决方案。

Contribution: 提出了一种结合领域知识、结构化推理和小样本提示的多模态LLM提示设计方法,并将其应用于驾驶员让行行为的预测。

Method: 通过多模态LLM(如GPT-4o和Deepseek-V3)的提示设计,结合交通数据,实现上下文感知的驾驶员行为推理,并与传统分类器进行对比。

Result: 实验表明,GPT-4o在准确率和召回率上表现最佳,Deepseek-V3则在精确率上领先,同时揭示了模型性能与计算效率之间的权衡。

Insight: LLM在行人安全领域具有实际应用潜力,但需要权衡性能和计算成本,为实际部署提供了指导。

Abstract: Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver–pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian–driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.

[21] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Sina J. Semnani,Han Zhang,Xinyan He,Merve Tekgürler,Monica S. Lam

Main category: cs.CL

TL;DR: CHURRO是一个3B参数的开源视觉-语言模型,专门用于历史文本识别,其性能优于现有模型,同时在成本效益上表现出色。

Details Motivation: 现有视觉-语言模型主要针对现代标准化文本,无法有效处理历史文档中的多样性语言、不规则布局和退化问题。CHURRO旨在填补这一空白。

Contribution: 1) 提出CHURRO模型,专为历史文本识别设计;2) 发布CHURRO-DS数据集,涵盖22世纪历史文本;3) 模型性能优于现有开源和闭源模型。

Method: 使用CHURRO-DS数据集训练3B参数的开源视觉-语言模型,并对模型进行性能评估。

Result: CHURRO在测试集上取得82.3%(印刷体)和70.1%(手写体)的归一化Levenshtein相似度,优于第二名模型,同时成本降低15.5倍。

Insight: 通过开源模型和数据集,CHURRO为历史文本识别研究提供了工具,有望推动文化遗产保护和学术研究。

Abstract: Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.

[22] EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation

Sen Yang,Yu Bao,Yu Lu,Jiajun Chen,Shujian Huang,Shanbo Cheng

Main category: cs.CL

TL;DR: 论文提出了一种利用LLMs在英语为中心的语言对上的优势,通过合成数据和偏好优化提升非英语间(x2x)翻译能力的方法,显著提升了72种x2x方向的翻译性能。

Details Motivation: 现有LLMs在英语为中心的语言对上表现优异,但在非英语间的直接翻译(x2x)表现不佳。作者希望通过利用LLMs在英语翻译上的优势,提升x2x翻译能力。

Contribution: 提出了一个合成数据生成框架,利用LLMs的英语翻译能力(en2x)扩展为x2x数据集,并通过英语参考的质量评估代理和偏好优化,显著提升了72种x2x方向的翻译性能。

Method: 通过扩展英语平行语料库为全向数据集,并结合英语参考的质量评估代理,收集高质量x2x训练数据;进一步通过偏好优化提升模型性能。

Result: 在72种x2x翻译方向上取得显著提升,同时增强了英语到其他语言(en2x)的翻译能力。

Insight: 通过策略性地利用LLMs在英语翻译上的优势,可以显著提升非英语间的翻译能力,展示了英语为中心的能力在多语言翻译中的引导作用。

Abstract: Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models’ established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX

[23] bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Wence Ji,Jiancan Wu,Aiying Li,Shuyi Zhang,Junkang Wu,An Zhang,Xiang Wang,Xiangnan He

Main category: cs.CL

TL;DR: 该论文提出了bi-GRPO框架,用于通过强化学习在大型语言模型(LLMs)中嵌入隐蔽的越狱后门攻击,克服了现有方法的泛化性差、隐蔽性不足等问题。

Details Motivation: 现有越狱后门攻击方法(如SFT、RLHF等)存在泛化性差、隐蔽性不足或生成的越狱响应可用性低等问题。论文旨在提出一种更有效的优化框架,以解决这些问题。

Contribution: 提出了bi-GRPO(双向组相对策略优化),一种基于强化学习的框架,专门用于越狱后门注入,通过成对的样本和奖励机制优化模型行为。

Method: 利用成对样本和奖励机制,结合基于规则的奖励、长度和格式激励,优化模型在触发条件下生成有害内容,同时保持其他场景的安全性。

Result: 实验表明,bi-GRPO在攻击成功率(>99%)、隐蔽性和生成的越狱响应可用性上优于现有方法。

Insight: 该框架减少了对高质量监督数据或可能存在缺陷的奖励模型的依赖,为越狱后门攻击提供了新的思路。

Abstract: With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers–such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)–each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

[24] Benchmarking Gaslighting Attacks Against Speech Large Language Models

Jinyang Wu,Bin Zhu,Xiandong Zou,Qiquan Zhang,Xu Fang,Pan Zhou

Main category: cs.CL

TL;DR: 该论文提出了针对语音大语言模型(Speech LLMs)的气体攻击(gaslighting attacks),通过五种精心设计的攻击策略(愤怒、认知干扰、讽刺、隐性和专业否定)评估模型的鲁棒性,并在多模态实验中揭示了显著的性能下降和行为漏洞。

Details Motivation: 随着语音大语言模型在语音应用中的普及,其对抗操纵性输入的鲁棒性变得至关重要。然而,目前对语音交互独特认知和感知挑战的研究较少,而这些特性(如模糊性、连续性和感知多样性)使得对抗攻击更难检测。

Contribution: 提出了气体攻击框架,设计了五种攻击策略,评估语音和多模态大语言模型的脆弱性;通过性能和行为响应(如非请求的道歉和拒绝)捕捉多维度漏洞;实验证明攻击导致平均准确率下降24.3%。

Method: 构建了五种气体攻击策略,测试模型在不同任务中的表现;结合声学扰动实验评估多模态鲁棒性;在5个语音和多模态LLM上对10,000多个样本进行综合评估。

Result: 实验结果显示,五种气体攻击导致模型平均准确率下降24.3%,揭示了显著的行为漏洞。

Insight: 语音交互的独特性使其更易受操纵,未来需要设计更鲁棒和可信的语音AI系统。

Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

[25] Future Policy Aware Preference Learning for Mathematical Reasoning

Minjae Oh,Yunho Choi,Dongmin Choi,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出了一种名为未来策略感知(FPA)的偏好学习方法来改进数学推理任务中的大语言模型训练。该方法通过估计未来策略对梯度进行正则化,避免了对有用令牌的过度惩罚。实验表明FPA在多榜单上表现优于现有方法。

Details Motivation: 现有偏好学习方法(如DPO)在数学推理任务中效果不佳,主要原因是令牌重叠导致的过度惩罚问题。当前方法在使用当前策略进行正则化时可能已经造成模型性能退化。

Contribution: 提出了未来策略感知(FPA)偏好学习方法,通过轻量级的logit空间外推估计未来策略,预判性地正则化梯度,避免过早惩罚有用令牌。

Method: FPA将当前策略替换为未来策略作为正则化项。未来策略通过对参考模型与当前模型的logit空间外推进行轻量级估计。

Result: 在MATH和GSM8K榜单上,FPA显著提升了性能,尤其在SimPER上取得了5.75%的增益,同时延长了无性能退化的训练时间。

Insight: FPA通过提前考虑未来策略的行为,有效平衡了对有用令牌的保护与对不良轨迹的惩罚,展示了前瞻性正则化在偏好学习中的潜力。

Abstract: Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

[26] WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Binbin Zhang,Chengdong Liang,Shuai Wang,Xuelong Geng,Zhao Guo,Haoyu Li,Hao Yin,Xipeng Yang,Pengshen Zhang,Changwei Ma,Lei Xie

Main category: cs.CL

TL;DR: WEST是一个基于大语言模型(LLM)的语音工具包,支持语音理解、生成和交互,具有完全LLM化、全栈功能和简单易用的特点。

Details Motivation: 为了解决语音任务中多样化需求(如识别、合成、理解等)的复杂性,并提供易于使用的工具包,同时利用大语言模型(LLM)的优势。

Contribution: 1) 完全基于LLM架构;2) 提供全栈语音任务支持;3) 设计简单易用的工具包。

Method: 利用成熟的大模型架构(如Hugging Face生态)和序列打包等方法,支持语音识别、合成、理解等任务,并可扩展开源模型。

Result: 提供两种方案:1) 完全开源模型和数据的可复现实验;2) 基于海量数据的预训练模型,性能优越,可直接使用。

Insight: LLM在语音任务中具有广泛潜力,通过简单工具包设计,可以降低技术门槛并提升任务集成能力。

Abstract: In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/

[27] CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

Soham Bhattacharjee,Mukund K Roy,Yathish Poojary,Bhargav Dave,Mihir Raj,Vandan Mujadia,Baban Gain,Pruthwik Mishra,Arafat Ahsan,Parameswari Krishnamurthy,Ashwath Rao,Gurpreet Singh Josan,Preeti Dubey,Aadil Amin Kak,Anna Rao Kulkarni,Narendra VG,Sunita Arora,Rakesh Balbantray,Prasenjit Majumdar,Karunesh K Arora,Asif Ekbal,Dipti Mishra Sharma

Main category: cs.CL

TL;DR: 该论文提出了一个名为CorIL的大规模高质量平行语料库,涵盖11种印度语言,旨在丰富印度语言间的机器翻译资源和研究。

Details Motivation: 印度语言多样,但高质量的平行语料库稀缺,尤其在多种领域中。为解决这一问题,作者构建了一个覆盖11种语言的语料库,并分类为政府、健康和通用领域。

Contribution: 主要贡献是发布了CorIL语料库,包含772,000个双语对,并分类为三个关键领域。此外,还评估了多种先进的多语言NMT模型,分析了其性能趋势和领域敏感性。

Method: 作者首先构建并标注了一个多语言平行语料库,然后使用IndicTrans2、NLLB和BhashaVerse等模型进行微调和评估,分析了模型在不同语言脚本和领域下的表现。

Result: 实验结果表明,大规模多语言模型在波斯-阿拉伯语脚本(如乌尔都语、信德语)上表现更好,而其他模型在印度语脚本上表现更优。此外,领域敏感性分析揭示了不同领域的翻译性能差异。

Insight: 研究揭示了语言脚本对模型性能的重要影响,为跨脚本迁移学习提供了见解。同时,发布的语料库将成为印度语言机器翻译研究的重要资源。

Abstract: India’s linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus’s value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.

[28] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu,Xueyi Li,Hao Wang,Haoxuan Li,Zhichao Chen,Weiqi Luo,Zitao Liu

Main category: cs.CL

TL;DR: 论文提出了一种统一的音频-文本建模框架TtT,将自回归文本生成与非自回归音频扩散结合,避免了现有方法的多阶段训练和高计算成本。

Details Motivation: 现有音频-文本多模态模型(如MOSHI)需要复杂的多阶段训练且计算成本高,同时忽视了音频与文本依赖结构的不对称性。

Contribution: 提出了整合自回归文本生成与非自回归音频扩散的统一框架TtT,利用预训练LLM初始化,简化了训练流程。

Method: 在单个Transformer架构中联合训练,文本部分采用自回归生成,音频部分采用非自回归扩散模型。

Result: 框架避免了多阶段训练,提高了效率,同时保持了生成质量。

Insight: 音频与文本的依赖结构不同,分开处理(自回归与非自回归)更高效。

Abstract: Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.

[29] Can Constructions “SCAN” Compositionality ?

Ganesh Katrapati,Manish Shrivastava

Main category: cs.CL

TL;DR: 该论文提出了一种无监督方法,通过从训练数据中自动提取可变槽模板(伪结构)来解决序列到序列模型在组合性和系统性泛化上的问题,显著提升了SCAN数据集上的分布外性能和数据效率。

Details Motivation: 序列到序列模型在许多任务上表现优异,但在组合性和系统性泛化上表现不佳。作者认为这是由于模型未能内化形式与意义的约定配对(即结构),从而限制了其重组能力。

Contribution: 主要贡献是提出了一种无监督的伪结构挖掘方法,能够自动从训练数据中提取可变槽模板,显著提升了模型在SCAN数据集上的组合性和系统性泛化能力。

Method: 通过无监督过程从训练数据中挖掘伪结构(可变槽模板),无需改变模型架构或增加额外监督,直接应用于SCAN数据集。

Result: 在SCAN数据集的ADD JUMP和AROUND RIGHT任务中,方法分别将准确率提升至47.8%和20.3%,同时仅需40%的训练数据即可达到竞争性能。

Insight: 该方法表明,通过对数据进行结构感知预处理,可以显著提升模型的组合性和泛化能力,而无须依赖复杂的模型架构或训练策略调整。

Abstract: Sequence to Sequence models struggle at compositionality and systematic generalisation even while they excel at many other tasks. We attribute this limitation to their failure to internalise constructions conventionalised form meaning pairings that license productive recombination. Building on these insights, we introduce an unsupervised procedure for mining pseudo-constructions: variable-slot templates automatically extracted from training data. When applied to the SCAN dataset, our method yields large gains out-of-distribution splits: accuracy rises to 47.8 %on ADD JUMP and to 20.3% on AROUND RIGHT without any architectural changes or additional supervision. The model also attains competitive performance with? 40% of the original training data, demonstrating strong data efAciency. Our findings highlight the promise of construction-aware preprocessing as an alternative to heavy architectural or training-regime interventions.

[30] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Chaojun Nie,Jun Zhou,Guanxiang Wang,Shisong Wud,Zichen Wang

Main category: cs.CL

TL;DR: 论文提出了RLAG方法,通过增强生成的强化学习将领域知识嵌入大型语言模型,解决了现有方法在领域任务中知识优先级不足和推理能力有限的问题。

Details Motivation: 大型语言模型(LLMs)在领域特定任务中表现有限,主要由于训练数据中领域知识的不足和静态性。现有方法(如CPT和SFT)未能有效嵌入关键知识或构建连贯推理结构。

Contribution: 提出了RLAG方法,通过迭代生成和优化,结合定制化奖励指标,有效嵌入关键领域知识,并提升模型的推理能力。

Method: RLAG方法循环生成样本并优化模型,通过高对数概率样本和三种奖励指标(准确性、解释合理性等)引导优化。

Result: 实验表明,RLAG在医疗、法律、天文等多个领域显著优于基线方法。

Insight: RLAG通过结合生成与优化,动态嵌入知识并提升推理能力,为领域知识的嵌入提供了新思路。

Abstract: Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.

[31] Thinking Augmented Pre-training

Liang Wang,Nan Yang,Shaohan Huang,Li Dong,Furu Wei

Main category: cs.CL

TL;DR: The paper proposes Thinking augmented Pre-Training (TPT), a method to enhance data efficiency in LLM training by augmenting text data with automatically generated thinking trajectories, improving performance and learnability of complex tokens.

Details Motivation: The motivation stems from the growing compute demands for LLM pre-training and the limited availability of high-quality data, making it crucial to maximize data utility.

Contribution: The main contribution is the introduction of TPT, a scalable approach that augments text data with thinking trajectories to improve data efficiency and model performance.

Method: TPT involves augmenting existing text data with automatically generated thinking trajectories, which decompose complex reasoning into step-by-step guidance, making high-quality tokens more learnable.

Result: Experiments show TPT improves data efficiency by 3x and boosts post-training performance by over 10% on reasoning benchmarks for a 3B parameter model.

Insight: The insight is that augmenting data with thinking trajectories can unlock the potential of fixed model capacity by breaking down complex reasoning processes into manageable steps.

Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10%$ on several challenging reasoning benchmarks.

[32] Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs

Parker Glenn,Alfy Samuel,Daben Liu

Main category: cs.CL

TL;DR: 研究探索如何将LLM驱动的算子集成到声明式查询语言中以提高性能和准确性,提出了一种高效方法确保LLM函数的类型良好性。

Details Motivation: 当前方法通过大量后处理调用来确保LLM输出与数据库内容的对齐,导致性能瓶颈,亟需更高效的解决方案。

Contribution: 提出了一种高效方法以确保LLM函数的类型良好性,在多跳问答数据集上实现了7%的准确率提升和53%的延迟降低。

Method: 研究了不同规模开源语言模型在基于SQL的查询语言中的解析和执行能力,并设计了一种类型约束强制执行方案。

Result: 实验显示小模型在处理混合数据源时表现优异,新方法显著提升了准确率和延迟。

Insight: 小语言模型在特定任务中可作为高效函数执行器,类型约束的强制执行是实现高效集成的关键。

Abstract: Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at https://github.com/parkervg/blendsql

[33] Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks

Hailay Kidu Teklehaymanot,Gebrearegawi Gidey,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 论文研究了使用多语言预训练模型改进低资源语言Tigrinya的机器翻译质量,提出结合语言特定分词、嵌入初始化和领域自适应微调的方法,并构建了高质量评估数据集。

Details Motivation: Tigrinya等低资源语言在神经机器翻译中仍面临语料库匮乏、分词策略不足和标准化评测基准缺乏的问题。

Contribution: 提出了结合语言特定分词和多语言迁移学习的方法,构建了高质量的英-Tigrinya评估数据集,实验表明方法显著优于基线。

Method: 利用了多语言预训练模型,结合语言特定分词和领域自适应微调,并通过统计检验确保结果显著性。

Result: 实验显示,定制分词器结合迁移学习显著提升了翻译质量,评测指标(BLEU、chrF)和人工评估均验证了效果。

Insight: 语言相关的建模和可复现的评测基准对低资源语言性能提升至关重要,错误分析指导了针对性优化。

Abstract: Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at https://github.com/hailaykidu/MachineT_TigEng and https://huggingface.co/Hailay/MachineT_TigEng

[34] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage

Zipeng Ling,Yuehao Tang,Chen Huang,Shuliang Liu,Gaoyang Jiang,Shenghong Fu,Junqi Yang,Yao Wan,Jiawan Zhang,Kejia Huang,Xuming Hu

Main category: cs.CL

TL;DR: 该论文提出了指令边界问题,探讨了LLM在不同提示覆盖率下的推理偏见,并开发了BiasDetector框架来衡量这些偏见。研究发现,即使LLM在主要任务上表现准确,提示设计仍会导致显著的下游偏见。

Details Motivation: 大型语言模型(LLM)的推理能力虽然强大,但其可靠性受到提示设计的限制。用户可能无意中提供有偏见或不完整的提示,影响模型输出,作者希望通过量化这种偏见来提高LLM的可靠性。

Contribution: 1. 定义并量化了LLM推理中的指令边界问题;2. 开发了BiasDetector框架,用于检测三种提示类型(完整、冗余、不足)带来的偏见;3. 实证研究揭示了LLM在下游任务中的偏见问题及其实际影响。

Method: 1. 将指令边界问题划分为八个维度;2. 提出BiasDetector框架,通过标准化提示设计测量偏见;3. 在多种主流LLM上进行实验。

Result: 研究发现,尽管LLM在主要任务上表现准确,但提示覆盖率不足或冗余会导致严重的下游偏见,尤其在复杂任务中问题更为明显。

Insight: 提示设计对LLM推理的可靠性至关重要,开发者和用户需更加注重提示的完整性和精确性,以降低偏见风险。

Abstract: Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.

[35] Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation

Behzad Shayegh,Jan-Thorsten Peter,David Vilar,Tobias Domhan,Juraj Juraska,Markus Freitag,Lili Mou

Main category: cs.CL

TL;DR: 该研究探讨了机器翻译中充分性与流畅性之间的权衡,揭示了当前评价指标偏向充分性的问题,并提出了一种用于元评价的合成系统方法以减少偏见。

Details Motivation: 机器翻译的评价通常涉及充分性和流畅性两个维度,但现有指标往往偏向充分性,这可能导致对流畅性指标的忽视,影响评价的公平性。

Contribution: 1. 揭示了评价指标偏向充分性的现象及其严重性;2. 分析了元评价中对流畅性指标的偏见;3. 提出了一种合成翻译系统的方法以减少元评价中的偏见。

Method: 通过分析现有评价指标的偏向性,并提出一种合成翻译系统的方法,用于元评价中以平衡充分性和流畅性的权重。

Result: 研究发现当前评价指标普遍偏向充分性,元评价中也存在类似偏见;提出的方法能够有效减少评价中的偏见。

Insight: 研究者需注意充分性与流畅性的权衡,确保评价指标和元评价方法的公平性,以避免对某些翻译系统的偏好。

Abstract: We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.

[36] Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning

T. O. Abiola,K. D. Abiodun,O. E. Olumide,O. O. Adebanji,O. Hiram Calvo,Grigori Sidorov

Main category: cs.CL

TL;DR: 该论文研究了多语言希望言论检测,比较了逻辑回归、mBERT和XLM-RoBERTa模型,并引入主动学习策略,展示了在低资源环境下Transformer模型的高效性。

Details Motivation: 在线环境中,希望言论(鼓励和乐观的语言)对促进积极讨论至关重要,但多语言和低资源环境下的检测仍具有挑战性。

Contribution: 提出了一个结合主动学习和多语言Transformer模型的框架,并在多种语言数据集上验证其效果。

Method: 使用逻辑回归作为基线模型,比较了mBERT和XLM-RoBERTa的性能,并引入主动学习策略以减少标注数据的需求。

Result: XLM-RoBERTa表现最佳,主动学习策略在小样本条件下仍能保持高效。

Insight: 多语言Transformer与主动学习的结合为低资源希望言论检测提供了高效解决方案。

Abstract: Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.

[37] SIM-CoT: Supervised Implicit Chain-of-Thought

Xilin Wei,Xiaoran Liu,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Jiaqi Wang,Xipeng Qiu,Dahua Lin

Main category: cs.CL

TL;DR: 论文提出SIM-CoT,通过引入步骤级监督解决隐式思维链方法中的不稳定性问题,显著提升性能与稳定性。

Details Motivation: 隐式思维链方法在大型语言模型中具有较高的计算效率,但由于训练不稳定性和语义多样性不足,性能存在局限。本文旨在通过步骤级监督解决这一问题。

Contribution: 1. 提出SIM-CoT模块,通过辅助解码器引入步骤级监督,稳定和丰富隐式推理空间;2. 训练后移除辅助解码器,保持推理效率;3. 提升隐式思维链方法的性能与稳定性。

Method: 使用辅助解码器在训练阶段对齐隐式令牌与显式推理步骤,确保隐式状态捕获有意义的语义信息,推理阶段移除辅助模块。

Result: 在GPT-2和LLaMA-3.1 8B等模型上显著提升性能(如Coconut提升8.2%),并在更大模型上缩小性能差距。

Insight: 步骤级监督能有效解决隐式推理中的不稳定问题,同时保持计算效率。SIM-CoT的方法具有通用性,适用于多种隐式思维链方法。

Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.

[38] Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

Maria Teleki,Sai Janjur,Haoran Liu,Oliver Grabner,Ketan Verma,Thomas Docog,Xiangjue Dong,Lingfeng Shi,Cong Wang,Stephanie Birkelbach,Jason Kim,Yin Zhang,James Caverlee

Main category: cs.CL

TL;DR: 论文提出了Z-Scores,一种基于语言学分类的评估指标,用于分析去除语言不流畅(disfluency)的效果,相比传统词级指标更能揭示模型的系统性弱点。

Details Motivation: 传统词级评估指标(如精确率、召回率和F1值)难以捕捉模型在去除不同类型语言不流畅时的具体表现,需要一种更细粒度的评估方法。

Contribution: 引入了Z-Scores,一种基于语言学分类的span级评估指标,能够针对不同类型的不流畅(如EDITED、INTJ、PRN)提供详细诊断,并通过确定性对齐模块实现生成文本与不流畅转录的映射。

Method: Z-Scores采用span级分析,结合确定性对齐模块,分类评估模型对不同类型不流畅的处理能力。

Result: Z-Scores揭示了LLM在处理INTJ和PRN类不流畅时的隐藏挑战,这些问题是传统F1指标无法发现的,并直接指导了模型改进策略。

Insight: 通过语言学分类指标可以提供更详细的模型诊断,帮助设计针对性的干预措施(如定制提示或数据增强),从而提升性能。

Abstract: Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions – such as tailored prompts or data augmentation – yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

[39] DRES: Benchmarking LLMs for Disfluency Removal

Maria Teleki,Sai Janjur,Haoran Liu,Oliver Grabner,Ketan Verma,Thomas Docog,Xiangjue Dong,Lingfeng Shi,Cong Wang,Stephanie Birkelbach,Jason Kim,Yin Zhang,James Caverlee

Main category: cs.CL

TL;DR: 论文提出了DRES(Disfluency Removal Evaluation Suite),一个用于评估大型语言模型(LLM)在去除语言不流畅性任务中性能的基准工具。通过人类标注的Switchboard语料库,研究发现了分段策略的有效性、推理型模型的过删除问题,以及微调的局限性。

Details Motivation: 语言不流畅性是语音驱动系统的一大挑战,影响命令理解、摘要生成和对话代理的准确性。现有研究缺乏可复现的基准,难以系统评估去除不流畅性的有效性。

Contribution: 1. 提出DRES基准,为去除语言不流畅性任务提供可复现的语义上限;2. 系统评估不同规模、架构和提示策略的LLM;3. 总结了9条实际部署建议。

Method: 1. 基于Switchboard人类标注语料构建DRES;2. 研究分段策略、模型架构和微调对性能的影响;3. 分析错误模式并提出建议。

Result: 1. 分段策略显著提升性能;2. 推理型模型容易过删除流畅内容;3. 微调虽提升精度但损害泛化能力。

Insight: 1. 分段是去除不流畅性的有效策略;2. 模型架构和规模影响性能;3. 实际部署需平衡精度与泛化能力。

Abstract: Disfluencies – such as “um,” “uh,” interjections, parentheticals, and edited statements – remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

[40] Language Models that Think, Chat Better

Adithya Bhaskar,Xi Ye,Danqi Chen

Main category: cs.CL

TL;DR: 该论文提出了一种名为RLMT的新方法,通过结合强化学习和基于模型的奖励,提升了语言模型在开放任务中的推理和聊天能力,效果优于传统的RLHF方法,并在多个基准测试中表现优异。

Details Motivation: 传统的强化学习与可验证奖励(RLVR)在可验证领域(如数学和代码)表现良好,但在开放任务(如写作或制定计划)中泛化能力有限。作者希望扩展RLVR的适用范围,提升语言模型在开放任务中的表现。

Contribution: 主要贡献包括:1)提出了RLMT方法,利用模型奖励优化语言模型的推理和聊天能力;2)展示了RLMT在多个基准测试中的优越性,尤其是聊天和创意写作任务;3)证明了RLMT可以直接应用于基础模型,无需监督微调(SFT)阶段。

Method: RLMT通过要求语言模型生成长链推理(Chain-of-Thought,CoT)并利用在线强化学习(如DPO、PPO和GRPO)优化模型,基于偏好奖励模型(类似于RLHF)进行训练。

Result: 实验结果表明,RLMT在多个基准测试(如AlpacaEval2、WildBench和ArenaHardV2)中取得3-7分的显著提升,并在创意写作和通用知识任务中提高1-3分。最佳8B模型在聊天和创意写作任务中超越GPT-4o,媲美Claude-3.7-Sonnet。

Insight: RLMT的成功表明,强化学习可以更广泛地应用于开放任务,同时为未来的研究提供了关于如何更有效地利用推理能力的启示。

Abstract: Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks – such as writing outline essays or making meal plans – where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.

cs.CV [Back]

[41] Vision-Based Perception for Autonomous Vehicles in Off-Road Environment Using Deep Learning

Nelson Alves Ferreira Neto

Main category: cs.CV

TL;DR: 本文提出了一种用于越野环境的自动驾驶感知系统,采用模块化分割网络(CMSNet),能够实时分割障碍物和可通行地面,并结合新数据集Kamino验证了系统在恶劣条件下的有效性。

Details Motivation: 针对非结构化越野环境(如露天矿和发展中国家道路),需要低延迟的智能系统以实现自动驾驶。传统方法依赖于预定义路径,难以适应复杂多变的越野场景。

Contribution: 1. 提出模块化分割网络(CMSNet),支持不同架构配置;2. 发布Kamino数据集,包含近1.2万张恶劣条件下的越野图像;3. 通过TensorRT和CUDA优化实现实时推理。

Method: 采用CMSNet框架,并结合深度学习方法进行障碍物和可通行地面分割。通过移除和融合CNN层优化推理速度,使用Kamino和另一数据集验证系统性能。

Result: 实验表明CMSNet能在恶劣条件下(夜间、雨、灰尘)有效分割可通行区域,并通过优化实现实时处理。

Insight: 模块化设计提升网络灵活性;恶劣条件的真实数据对模型鲁棒性至关重要;推理优化技术在自动驾驶中具有实用价值。

Abstract: Low-latency intelligent systems are required for autonomous driving on non-uniform terrain in open-pit mines and developing countries. This work proposes a perception system for autonomous vehicles on unpaved roads and off-road environments, capable of navigating rough terrain without a predefined trail. The Configurable Modular Segmentation Network (CMSNet) framework is proposed, facilitating different architectural arrangements. CMSNet configurations were trained to segment obstacles and trafficable ground on new images from unpaved/off-road scenarios with adverse conditions (night, rain, dust). We investigated applying deep learning to detect drivable regions without explicit track boundaries, studied algorithm behavior under visibility impairment, and evaluated field tests with real-time semantic segmentation. A new dataset, Kamino, is presented with almost 12,000 images from an operating vehicle with eight synchronized cameras. The Kamino dataset has a high number of labeled pixels compared to similar public collections and includes images from an off-road proving ground emulating a mine under adverse visibility. To achieve real-time inference, CMSNet CNN layers were methodically removed and fused using TensorRT, C++, and CUDA. Empirical experiments on two datasets validated the proposed system’s effectiveness.

[42] Overview of LifeCLEF Plant Identification task 2020

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: LifeCLEF 2020 Plant Identification任务旨在评估如何利用植物标本馆数据改进自动化植物识别系统,尤其是在数据匮乏的热带地区。

Details Motivation: 尽管深度学习在植物识别方面取得了进展,但大部分数据集中在北美和西欧,而对热带地区等高生物多样性区域的覆盖不足。植物标本馆数据为填补这一空白提供了可能。

Contribution: PlantCLEF 2020提供了基于南美洲圭亚那地盾地区的1,000种植物数据集,结合标本馆数据和野外照片,设计了跨域分类任务。

Method: 任务通过结合数十万份标本馆数据和少量野外照片进行训练,测试集仅包含野外照片,评估系统在跨域识别中的表现。

Result: 论文总结了参与团队的多种方法,并分析了主要结果,展示了标本馆数据对提升热带地区植物自动识别的潜力。

Insight: 植物标本馆数据可以作为野外照片不足的高生物多样性区域的重要补充,跨域学习方法有望解决数据不平衡问题。

Abstract: Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data with more and more photos in the field. However, this profusion of data only concerns a few tens of thousands of species, mostly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria, particularly in tropical regions, and the recent efforts by the biodiversity informatics community made it possible to put millions of digitized sheets online. The LifeCLEF 2020 Plant Identification challenge (or “PlantCLEF 2020”) was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the South America’s Guiana Shield, an area known to have one of the greatest diversity of plants in the world. The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains. The test set was exclusively composed of photos in the field. This paper presents the resources and assessments of the conducted evaluation, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[43] iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

Manyi Yao,Bingbing Zhuang,Sparsh Garg,Amit Roy-Chowdhury,Christian Shelton,Manmohan Chandraker,Abhishek Aich

Main category: cs.CV

TL;DR: iFinder是一个模块化的、无需训练的结构化语义框架,通过将行车记录仪视频转换为层次化的可解释数据结构,来解耦感知与推理,提升LLM在零样本驾驶视频理解中的表现。

Details Motivation: 现有的视频-语言模型(V-VLMs)在空间推理、因果推断和事件解释性方面表现不足,尤其是在仅依赖视觉模态(如行车记录仪视频)的情况下。iFinder旨在通过结构化层次数据为LLM提供领域特定的语义基础,以解决这些问题。

Contribution: 1) 提出了iFinder,一个无需训练的模块化框架,通过层次化数据结构解耦感知与推理;2) 使用预训练的视觉模型提取关键驾驶域特征,如物体姿态和轨迹;3) 提出三阶段提示策略,支持逐步推理,显著提升了零样本驾驶视频理解的性能。

Method: iFinder将视频转换为层次化的数据结构(帧级和视频级),利用预训练视觉模型提取物体姿态、车道位置和轨迹等关键信息。结合三阶段提示策略,LLM能够逐步推理,优化V-VLMs的输出。

Result: 在四个公开的行车记录仪视频基准测试中,iFinder显著优于端到端的V-VLMs,尤其在事故推理准确性上提升了39%。

Insight: 通过将领域特定的结构化表示引入LLM的输入,零样本学习方法可以在无需训练的情况下实现高性能和可解释的结果,为驾驶视频分析提供了新思路。

Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues – object pose, lane positions, and object trajectories – which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM’s outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder’s proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

[44] CURE: Centroid-guided Unsupervised Representation Erasure for Facial Recognition Systems

Fnu Shivam,Nima Najafzadeh,Yenumula Reddy,Prashnna Gyawali

Main category: cs.CV

TL;DR: 论文提出了CURE,一种无监督的机器学习遗忘框架,用于面部识别系统,无需身份标签即可删除特定样本,同时保持模型性能。

Details Motivation: 面部识别系统的广泛应用引发了隐私问题,现有遗忘方法依赖监督标签,但在隐私受限或大规模嘈杂数据集中难以获取标签。

Contribution: 1. 提出首个无监督遗忘框架CURE;2. 设计新指标UES,平衡遗忘与保留稳定性;3. 展示低质量图像遗忘的实际用途。

Method: CURE通过质心引导的无监督表示删除技术,无需标签选择性遗忘数据。使用UES指标评估遗忘效果。

Result: CURE在无监督遗忘任务中表现优于现有方法,并能有效处理低质量图像。

Insight: 无监督遗忘在隐私保护中有潜力,图像质量对遗忘效果有重要影响。

Abstract: In the current digital era, facial recognition systems offer significant utility and have been widely integrated into modern technological infrastructures; however, their widespread use has also raised serious privacy concerns, prompting regulations that mandate data removal upon request. Machine unlearning has emerged as a powerful solution to address this issue by selectively removing the influence of specific user data from trained models while preserving overall model performance. However, existing machine unlearning techniques largely depend on supervised techniques requiring identity labels, which are often unavailable in privacy-constrained situations or in large-scale, noisy datasets. To address this critical gap, we introduce CURE (Centroid-guided Unsupervised Representation Erasure), the first unsupervised unlearning framework for facial recognition systems that operates without the use of identity labels, effectively removing targeted samples while preserving overall performance. We also propose a novel metric, the Unlearning Efficiency Score (UES), which balances forgetting and retention stability, addressing shortcomings in the current evaluation metrics. CURE significantly outperforms unsupervised variants of existing unlearning methods. Additionally, we conducted quality-aware unlearning by designating low-quality images as the forget set, demonstrating its usability and benefits, and highlighting the role of image quality in machine unlearning.

[45] Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG

Mahmoud Afifi,Ran Zhang,Michael S. Brown

Main category: cs.CV

TL;DR: 论文提出了Raw-JPEG Adapter,一种轻量级可学习的预处理流水线,将原始图像适配为标准JPEG压缩格式,同时保留高保真重建能力。

Details Motivation: 原始图像(raw)保留了完整的传感器信息,但存储需求大,而JPEG格式高效且兼容性强但不适合存储原始数据。本文旨在解决这一矛盾。

Contribution: 提出了一种轻量级、可逆的预处理流水线,将原始图像适配为JPEG压缩格式,并通过JPEG备注字段存储参数以实现高精度重建。

Method: 通过空间和可选频域变换预处理原始图像,压缩参数存储在JPEG备注字段中,支持多种编解码器。

Result: 实验表明,该方法比直接JPEG存储具有更高保真度,同时提供更好的压缩比和重建精度平衡。

Insight: 通过可学习的预处理和参数嵌入,可以在兼容JPEG的同时保留原始图像信息,提供了一种实用的解决方案。

Abstract: Digital cameras digitize scene light into linear raw representations, which the image signal processor (ISP) converts into display-ready outputs. While raw data preserves full sensor information–valuable for editing and vision tasks–formats such as Digital Negative (DNG) require large storage, making them impractical in constrained scenarios. In contrast, JPEG is a widely supported format, offering high compression efficiency and broad compatibility, but it is not well-suited for raw storage. This paper presents RawJPEG Adapter, a lightweight, learnable, and invertible preprocessing pipeline that adapts raw images for standard JPEG compression. Our method applies spatial and optional frequency-domain transforms, with compact parameters stored in the JPEG comment field, enabling accurate raw reconstruction. Experiments across multiple datasets show that our method achieves higher fidelity than direct JPEG storage, supports other codecs, and provides a favorable trade-off between compression ratio and reconstruction accuracy.

[46] The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar

William L. Muckelroy III,Mohammed Alsakabi,John M. Dolan,Ozan K. Tonguz

Main category: cs.CV

TL;DR: 论文研究了2D分割主干网络对4D雷达生成点云质量的影响,发现更高容量的模型不一定更好,但优化后的主干网络可实现23.7%的性能提升。

Details Motivation: LiDAR成本高昂,限制了其在商业化自动驾驶系统中的广泛应用。4D雷达作为一种低成本替代方案,通过生成LiDAR类似点云,但需要优化生成质量。

Contribution: 研究了不同容量的2D分割主干对4D雷达生成点云的影响,发现存在最优主干容量,并提出优化方案实现SOTA性能提升23.7%。

Method: 利用2D卷积神经网络(CNN)主干和时序一致性网络,结合RaDelft数据集训练,探究不同容量主干对点云生成的影响。

Result: 实验表明,过高容量主干可能损害性能,但优化后的主干显著提升了点云生成质量,超越现有方法23.7%。

Insight: 网络容量需平衡,与问题复杂度相匹配;优化主干设计是提升4D雷达点云生成的关键。

Abstract: LiDAR’s dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR’s high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).

[47] Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza

Main category: cs.CV

TL;DR: 该论文通过新闻图像基准测试研究了视觉语言模型(VLM)在解读图像和文本时吸收和再现有害社会刻板印象的风险,提出了一种基于LLM作为评估者的方法,并揭示了视觉上下文对模型输出的系统性影响。

Details Motivation: 研究大型视觉语言模型(VLM)在使用包含社会视觉线索(如年龄、性别、种族、职业等)的图像时可能引发有害刻板印象的问题,旨在量化这些偏见的普遍性和影响。

Contribution: 1. 提出了一个包含1,343个新闻图像-问题对的基准测试,标注了真实答案和人口统计属性;2. 采用大型语言模型(LLM)作为评估者,验证人类注释;3. 揭示了视觉上下文对VLM输出的系统性影响,并分析了不同属性和模型中偏见的差异。

Method: 1. 构建新闻图像数据集;2. 对VLM进行测试,使用LLM作为评估者;3. 结合人类验证,分析模型输出的偏见和系统性偏差。

Result: 研究发现:1. 视觉上下文在开放设置中会系统性影响模型输出;2. 性别和职业属性的偏见风险最高;3. 高忠诚度并不一定对应低偏见。

Insight: 视觉语言模型的公平性评估需综合考虑上下文和多种属性,高准确性或忠诚度可能掩盖潜在的偏见问题。

Abstract: Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

[48] MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning

Zeyu He,Shuai Huang,Yuwu Lu,Ming Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种名为MoTiC的框架,用于解决少样本类增量学习(FSCIL)中的估计偏差和特征紧密度问题,通过贝叶斯分析和对比学习提升原型准确性,并在多个基准测试中取得了最优性能。

Details Motivation: FSCIL面临从少量样本中学习新类并保留旧类知识的双重挑战。现有方法虽使用冻结特征提取器和类平均原型,但新类原型因数据稀缺而存在显著偏差。

Contribution: 1. 理论证明通过贝叶斯分析对齐新旧类统计可减少方差;2. 提出大规模对比学习增强特征紧密度;3. 结合动量自监督和虚拟类别构建丰富特征空间。

Method: MoTiC框架整合了贝叶斯分析、对比学习、动量自监督和虚拟类别技术,以实现特征紧密度和多样性的平衡。

Result: 在三个FSCIL基准测试(尤其是细粒度任务CUB-200)中取得了SOTA性能,证明了方法减少偏差和提升稳健性的有效性。

Insight: 通过结合先验知识和对比学习,可以显著改善FSCIL中的原型估计和特征表示,为数据稀缺场景提供新思路。

Abstract: Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method’s ability to reduce estimation bias and improve incremental learning robustness.

[49] Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies

Sumit Mamtani

Main category: cs.CV

TL;DR: 该论文提出两种轻量级优化技术(STA和ANF)来解决Vision Transformers中特征图谱的结构化噪声问题,提升了模型的解释性和下游任务性能。

Details Motivation: Vision Transformers虽在多种视觉任务中表现优异,但其特征图谱中的结构化噪声会阻碍下游应用(如分割和深度估计)。

Contribution: 提出了两种新颖优化方法:STA(通过空间扰动增强token多样性)和ANF(层间可学习去噪),这两种方法均与模型架构无关。

Method: STA在token化阶段引入空间扰动,ANF则在Transformer层间嵌入可学习的去噪机制。

Result: 在ImageNet、Ade20k和NYUv2等基准测试中,模型在视觉质量和任务性能上均有显著提升。

Insight: 结构化噪声可能是ViTs性能瓶颈之一,轻量级优化策略可在不增加计算负担的情况下显著改善模型表现。

Abstract: Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.

[50] From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition

Ling Lo,Kelvin C. K. Chan,Wen-Huang Cheng,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 论文提出了一种通过逐帧引导去噪过程的方法,以实现视频属性平滑过渡,同时保持了视频的动态特性。还提出了CAT-Bench基准和评估指标,证明了方法的有效性。

Details Motivation: 现有模型在处理复杂的时间变化(如属性渐变)时存在不一致性,特别是通过提示插值方法难以实现平滑过渡。

Contribution: 1. 提出了一种逐帧引导方法,扩展了现有模型的能力,实现属性平滑过渡;2. 提出了CAT-Bench基准和指标,综合评估模型性能。

Method: 在去噪过程中引入逐帧引导,为每个噪声潜在空间构造特定于数据的过渡方向,逐步从初始属性过渡到最终属性。

Result: 实验表明,该方法在视觉保真度、文本对齐和过渡平滑性上优于基线模型。

Insight: 逐帧引导方法能够有效解决属性渐变中的不一致性问题,同时保持视频的动态特性,为视频生成提供新思路。

Abstract: Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: https://github.com/lynn-ling-lo/Prompt2Progression.

[51] Anatomically Constrained Transformers for Cardiac Amyloidosis Classification

Alexander Thorley,Agis Chartsias,Jordan Strom,Roberto Lang,Jeremy Slivnick,Jamie O’Driscoll,Rajan Sharma,Dipak Kotecha,Jinming Duan,Alberto Gomez

Main category: cs.CV

TL;DR: 该论文提出了一种基于解剖学约束的Transformer模型,用于心脏淀粉样变性分类,通过将输入限制在心肌区域并嵌入变形点和图像块,提高了分类性能。

Details Motivation: 心脏淀粉样变性(CA)的诊断通常依赖于超声心动图的临床特征,但现有神经网络模型无法保证分类是基于临床相关特征的。论文旨在通过解剖学约束,确保模型仅关注与CA相关的区域。

Contribution: 1. 设计了解剖学约束的Transformer模型,使分类仅基于心肌区域;2. 提出了一种自监督预训练方法,仅掩蔽和重建解剖块;3. 在CA分类任务中性能优于全视频Transformer。

Method: 1. 将心肌区域表示为变形点和图像块,嵌入为输入token;2. 在Transformer中约束注意力机制仅关注心肌区域;3. 自监督预训练中仅掩蔽和重建解剖块。

Result: 在CA分类任务中,该方法性能优于全视频Transformer,同时提供了分类仅基于解剖区域的明确保证。

Insight: 解剖学约束可以提升医学影像分类的可靠性和可解释性,同时适用于有监督和自监督学习。

Abstract: Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur – the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.

[52] Learning to Stop: Reinforcement Learning for Efficient Patient-Level Echocardiographic Classification

Woo-Jin Cho Kim,Jorge Oliveira,Arian Beqiri,Alex Thorley,Jordan Strom,Jamie O’Driscoll,Rajan Sharma,Jeremy Slivnick,Roberto Lang,Alberto Gomez,Agisilaos Chartsias

Main category: cs.CV

TL;DR: 提出一种基于强化学习的方法,通过选择最优子集的超声心动图视频片段来提升疾病分类效率,同时引入了注意力聚合机制融合信息。

Details Motivation: 传统方法要么使用单一视频片段忽略其他信息,要么计算所有片段导致效率低下。本文旨在通过强化学习动态决定何时停止处理片段,以平衡性能和计算开销。

Contribution: 1. 提出了一种基于强化学习的动态片段选择方法;2. 设计了注意力机制的多片段信息融合策略;3. 在心脏淀粉样变性检测任务上取得了更高的AUC(0.91),且仅需处理30%的片段。

Method: 1. 强化学习框架训练代理动态决策是否继续处理片段;2. 引入注意力机制聚合多片段信息;3. 目标是最小化分类不确定性和计算负担。

Result: 在心脏淀粉样变性检测任务中,AUC达到0.91,优于使用全部片段的方法及其他基准方法。

Insight: 动态选择和注意力聚合的结合能够显著提升医学影像分析的效率和性能,适用于计算资源受限的场景。

Abstract: Guidelines for transthoracic echocardiographic examination recommend the acquisition of multiple video clips from different views of the heart, resulting in a large number of clips. Typically, automated methods, for instance disease classifiers, either use one clip or average predictions from all clips. Relying on one clip ignores complementary information available from other clips, while using all clips is computationally expensive and may be prohibitive for clinical adoption. To select the optimal subset of clips that maximize performance for a specific task (image-based disease classification), we propose a method optimized through reinforcement learning. In our method, an agent learns to either keep processing view-specific clips to reduce the disease classification uncertainty, or stop processing if the achieved classification confidence is sufficient. Furthermore, we propose a learnable attention-based aggregation method as a flexible way of fusing information from multiple clips. The proposed method obtains an AUC of 0.91 on the task of detecting cardiac amyloidosis using only 30% of all clips, exceeding the performance achieved from using all clips and from other benchmarks.

[53] Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation

Bo Yu,Jianhua Yang,Zetao Du,Yan Huang,Chenglong Li,Liang Wang

Main category: cs.CV

TL;DR: FMISeg是一种基于频率域的多模态融合模型,通过语言引导的医学图像分割方法,解决了医学图像中病灶形态复杂和视觉-语言模态语义鸿沟问题。

Details Motivation: 医学图像分割在肺部感染疾病诊断中至关重要,但现有方法难以有效融合临床文本报告以提升分割精度,主要因病灶形态复杂和视觉-语言模态的语义鸿沟。

Contribution: 提出FMISeg模型,通过频率域特征双向交互(FFBI)模块增强视觉特征,并通过语言引导的频率域特征交互(LFFI)模块抑制语义无关信息。

Method: FMISeg是一种后融合模型,在解码器中实现语言特征与频率域视觉特征的交互。FFBI模块用于频率域特征融合,LFFI模块则通过语言信息抑制无关视觉特征。

Result: 在QaTa-COV19和MosMedData+数据集上的实验表明,FMISeg在定性和定量上均优于现有方法。

Insight: 频率域特征与语言引导的多模态交互能有效提升医学图像分割的精度,尤其在复杂病灶形态和跨模态语义对齐方面具有优势。

Abstract: Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

[54] PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Yufei Han,Bowen Tie,Heng Guo,Youwei Lyu,Si Li,Boxin Shi,Yunpeng Jia,Zhanyu Ma

Main category: cs.CV

TL;DR: PolGS提出了一种基于偏振高斯泼溅的快速反射表面重建方法,通过集成偏振约束,有效分离镜面和漫反射分量,提升了复杂反射材料的重建质量。

Details Motivation: 复杂反射表面的高效重建对实时虚拟现实至关重要。现有3D高斯泼溅方法虽速度快,但重建质量不及隐式神经表示,尤其是在处理复杂反射材料时。

Contribution: 提出了PolGS,将偏振约束融入3D高斯泼溅框架,实现了10分钟内快速反射表面重建,显著提升了重建质量。

Method: 通过集成偏振信息,PolGS精确分离镜面和漫反射分量,优化了表面重建过程。

Result: 在合成和真实数据集上的实验表明,PolGS能显著提升复杂反射材料的重建精度和速度。

Insight: 偏振信息可以有效辅助表面反射属性的分离,从而提升重建质量,尤其适用于高反射表面。

Abstract: Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.

[55] CAMILA: Context-Aware Masking for Image Editing with Language Alignment

Hyunseung Kim,Chiho Choi,Srikanth Malla,Sai Prahladh Padmanabhan,Saurabh Bagchi,Joon Hee Choi

Main category: cs.CV

TL;DR: CAMILA是一种上下文感知的图像编辑方法,通过验证指令与图像的上下文一致性,确保仅对相关区域进行编辑,避免执行不可行或矛盾的指令。

Details Motivation: 现有文本引导的图像编辑模型往往盲目遵循所有用户指令,导致不可行或矛盾的指令产生无意义的输出,CAMILA旨在解决这一问题。

Contribution: 提出CAMILA方法,通过上下文验证实现语义对齐,构建包含不可行指令的数据集以全面评估模型性能。

Method: 上下文感知掩码技术,验证指令与图像的上下文一致性,选择性编辑相关区域。

Result: CAMILA在复杂指令处理中表现优于现有模型,能更好地保持图像完整性并提升语义对齐。

Insight: 上下文验证是文本引导图像编辑的关键,CAMILA展示了选择性执行指令的重要性。

Abstract: Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

[56] Robust RGB-T Tracking via Learnable Visual Fourier Prompt Fine-tuning and Modality Fusion Prompt Generation

Hongtao Yang,Bineng Zhong,Qihua Liang,Zhiruo Zhu,Yaozong Zheng,Ning Li

Main category: cs.CV

TL;DR: 提出了一种基于视觉傅里叶提示学习和模态融合提示生成的高效RGB-T跟踪方法,结合空间和频域信息提升性能。

Details Motivation: 现有基于参数高效微调(PEFT)的RGB-T跟踪方法仅依赖空间域信息作为提示,忽视了频域信息的重要性,导致性能受限。

Contribution: 引入了视觉傅里叶提示(Visual Fourier Prompt),结合FFT提取频域信息;提出模态融合提示生成模块(Modality Fusion Prompt Generator),实现多模态特征的充分交互。

Method: 1. 使用共享参数的对称特征提取编码器提取RGB和TIR模态特征;2. 结合空间和频域提示;3. 生成与各模态交互的融合提示。

Result: 在三个RGB-T跟踪基准上取得优异性能。

Insight: 频域信息在RGB-T跟踪中具有重要作用,多模态特征的充分交互能显著提升性能。

Abstract: Recently, visual prompt tuning is introduced to RGB-Thermal (RGB-T) tracking as a parameter-efficient finetuning (PEFT) method. However, these PEFT-based RGB-T tracking methods typically rely solely on spatial domain information as prompts for feature extraction. As a result, they often fail to achieve optimal performance by overlooking the crucial role of frequency-domain information in prompt learning. To address this issue, we propose an efficient Visual Fourier Prompt Tracking (named VFPTrack) method to learn modality-related prompts via Fast Fourier Transform (FFT). Our method consists of symmetric feature extraction encoder with shared parameters, visual fourier prompts, and Modality Fusion Prompt Generator that generates bidirectional interaction prompts through multi-modal feature fusion. Specifically, we first use a frozen feature extraction encoder to extract RGB and thermal infrared (TIR) modality features. Then, we combine the visual prompts in the spatial domain with the frequency domain prompts obtained from the FFT, which allows for the full extraction and understanding of modality features from different domain information. Finally, unlike previous fusion methods, the modality fusion prompt generation module we use combines features from different modalities to generate a fused modality prompt. This modality prompt is interacted with each individual modality to fully enable feature interaction across different modalities. Extensive experiments conducted on three popular RGB-T tracking benchmarks show that our method demonstrates outstanding performance.

[57] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation

Xinhao Zhong,Shuoyang Sun,Xulin Gu,Chenyang Zhu,Bin Chen,Yaowei Wang

Main category: cs.CV

TL;DR: 本文提出了RD$^3$方法,系统研究了后评估设置对数据集蒸馏性能的影响,揭示了性能差异主要由评估不一致而非方法本身质量引起,并提供了标准化基准和评估协议。

Details Motivation: 现有去耦合数据集蒸馏方法在后评估阶段存在不一致的协议,阻碍了领域发展。本文旨在解决这一问题并明确评估对性能的影响。

Contribution: 提出了RD$^3$方法,分析了评估不一致对性能的影响,建立了标准化基准和评估协议,为未来研究提供了公平比较的基础。

Method: 通过系统研究不同后评估设置对测试精度的影响,揭示了性能差异的来源,并提出改进策略。

Result: 研究发现性能差异主要由评估不一致引起,而非方法本身的质量。

Insight: 标准化评估协议对数据集蒸馏领域的公平比较至关重要,未来研究需注意评估一致性。

Abstract: Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.

[58] Talking Head Generation via AU-Guided Landmark Prediction

Shao-Yu Chang,Jingyi Xu,Hieu Le,Dimitris Samaras

Main category: cs.CV

TL;DR: 该论文提出了一种通过AUs(面部动作单元)引导的双阶段框架,用于音频驱动的说话头部生成,实现了精细的表情控制。

Details Motivation: 现有方法依赖于情感标签或隐式的AU条件,无法实现精确的表情控制,论文旨在通过显式的AU到面部标志点的映射解决这一问题。

Contribution: 主要贡献是提出了一种显式AU到2D面部标志点的映射方法,实现了物理基础、逐帧的表情控制,并通过双阶段框架(运动生成和视频合成)提升了生成视频的表现精度和稳定性。

Method: 第一阶段采用变分运动生成器从音频和AU强度预测时间连贯的标志点序列;第二阶段基于扩散模型的合成器生成与参考图像和标志点条件一致的逼真视频。

Result: 在MEAD数据集上的实验表明,该方法在多个指标上优于现有基线方法。

Insight: 显式的AU到标志点建模在表情生成任务中具有显著优势,能够实现对表情的更精确控制和更高的视频真实性。

Abstract: We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.

[59] ExpFace: Exponential Angular Margin Loss for Deep Face Recognition

Jinhui Zheng,Xueyuan Gong

Main category: cs.CV

TL;DR: 这篇论文提出了ExpFace(指数角度间隔损失),通过引入角度指数项作为间隔,有效区分干净样本和噪声样本,提升了人脸识别的性能。

Details Motivation: 人脸识别是一个开集问题,需要高判别力以确保类内距离小于类间距离。现有的基于间隔的softmax损失(如SphereFace、CosFace和ArcFace)忽略了噪声样本的影响。作者观察到干净样本集中在中心区域,而噪声样本偏向边缘区域,因此提出了ExpFace。

Contribution: 1. 提出ExpFace损失函数,通过角度指数项动态调整惩罚,增强干净样本的判别力;2. 提供对ExpFace与传统损失函数的统一分析;3. 实验证明ExpFace在多个数据集上达到SOTA性能。

Method: ExpFace在角度空间中引入了指数项作为间隔,中心区域的惩罚较大而边缘区域的惩罚较小,从而有效抑制噪声样本。作者分析了ExpFace的相似度曲线和梯度曲线,并与SphereFace和ArcFace进行了对比。

Result: ExpFace在人脸识别任务中表现出色,避免了SphereFace的训练不稳定性和ArcFace的非单调性问题,并在多个基准测试中取得了最佳性能。

Insight: 1. 噪声样本在角度空间中倾向于分布在边缘区域;2. 动态调整惩罚机制有助于提升模型的鲁棒性;3. 统一的损失函数分析框架可用于指导未来的损失函数设计。

Abstract: Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: https://github.com/dfr-code/ExpFace.

[60] Logics-Parsing Technical Report

Xiangyang Chen,Shuzhao Li,Xiuwen Zhu,Yongfan Chen,Fan Yang,Cheng Fang,Lin Qu,Xiaoxiao Xu,Hu Wei,Minggang Wu

Main category: cs.CV

TL;DR: Logics-Parsing是一种基于LVLM的端到端模型,通过强化学习增强布局分析和阅读顺序推断能力,支持多样数据类别,并在LogicsParsingBench上验证了SOTA性能。

Details Motivation: 现有LVLM模型在处理复杂文档布局和阅读顺序时缺乏显式分析阶段,限制了在多栏报纸、海报等复杂文档上的表现。

Contribution: 1. 提出Logics-Parsing模型,结合强化学习优化布局分析和阅读顺序推断;2. 扩展模型支持化学公式和手写汉字等多样数据;3. 发布LogicsParsingBench评估集。

Method: 1. 端到端LVLM模型,集成OCR、表格识别等功能;2. 引入强化学习奖励机制;3. 监督微调中加入多样数据。

Result: 在LogicsParsingBench上验证了模型的SOTA性能,覆盖多种文档分析场景。

Insight: 强化学习的奖励机制能有效提升复杂文档布局分析的性能,而多样数据的引入增强了模型的泛化能力。

Abstract: Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM’s capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model’s versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing

[61] Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical Structures

Hartmut Häntze,Myrthe Buser,Alessa Hering,Lisa C. Adams,Keno K. Bressem

Main category: cs.CV

TL;DR: 该研究发现Dice相似系数(DSC)在评估医学图像分割时存在性别偏见,较小的解剖结构因尺寸差异导致女性DSC分数系统性偏低,而大型结构受影响较小。

Details Motivation: 尽管已有研究探讨了模型或数据集中的性别差异,但尚未有人研究DSC本身可能引入的偏见。研究者希望量化DSC在不同性别中的表现差异,以提高医学图像分割评估的公平性。

Contribution: 首次揭示了DSC在医学图像分割评估中固有的性别偏见,证明了即使错误大小相同,小型结构的DSC也会因性别差异而显著偏低。

Method: 通过在50名参与者的MRI标注上人为引入相同大小的合成错误,模拟理想化场景,比较不同性别间DSC和标准化DSC的差异。

Result: 小型结构的平均DSC差异约为0.03,中等结构约为0.01,而大型结构(如肺和肝)几乎不受影响。

Insight: 使用DSC评估分割模型时,性别间的分数差异可能源于指标本身的偏见,而非模型性能的真实差异。这对医学图像分析的公平评估具有重要意义。

Abstract: Overlap-based metrics such as the Dice Similarity Coefficient (DSC) penalize segmentation errors more heavily in smaller structures. As organ size differs by sex, this implies that a segmentation error of equal magnitude may result in lower DSCs in women due to their smaller average organ volumes compared to men. While previous work has examined sex-based differences in models or datasets, no study has yet investigated the potential bias introduced by the DSC itself. This study quantifies sex-based differences of the DSC and the normalized DSC in an idealized setting independent of specific models. We applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability. Even minimal errors (e.g., a 1 mm boundary shift) produced systematic DSC differences between sexes. For small structures, average DSC differences were around 0.03; for medium-sized structures around 0.01. Only large structures (i.e., lungs and liver) were mostly unaffected, with sex-based DSC differences close to zero. These findings underline that fairness studies using the DSC as an evaluation metric should not expect identical scores between men and women, as the metric itself introduces bias. A segmentation model may perform equally well across sexes in terms of error magnitude, even if observed DSC values suggest otherwise. Importantly, our work raises awareness of a previously underexplored source of sex-based differences in segmentation performance. One that arises not from model behavior, but from the metric itself. Recognizing this factor is essential for more accurate and fair evaluations in medical image analysis.

[62] EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction

Yu-Shen Huang,Tzu-Han Chen,Cheng-Yen Hsiao,Shaou-Gang Miaou

Main category: cs.CV

TL;DR: 论文提出了一种轻量级的基于ViT的框架EfficienT-HDR,通过多曝光融合实现高效的HDR重建,解决了计算成本高和鬼影问题。

Details Motivation: 资源受限的边缘设备上实现高质量的HDR成像是一个关键挑战,现有方法计算成本高且存在鬼影问题。

Contribution: 提出了轻量级的ViT架构,设计了Intersection-Aware Adaptive Fusion(IAAF)模块抑制鬼影,以及IRE、DyT和E-MSDC实现低复杂度设计。

Method: 将输入图像转换为YCbCr颜色空间,利用IAAF抑制鬼影,通过IRE、DyT和E-MSDC降低计算复杂度。

Result: 主版本FLOPs减少约67%,推理速度在CPU上提升5倍以上,边缘设备提升2.5倍。

Insight: 轻量化设计和鬼影抑制模块的结合,能够在保持高性能的同时显著提升效率,适用于边缘设备。

Abstract: Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous driving. Multi-Exposure Fusion (MEF) is a mainstream technique to achieve this goal; however, existing methods generally face the dual bottlenecks of high computational costs and ghosting artifacts, hindering their widespread deployment. To this end, this study proposes a light-weight Vision Transformer architecture designed explicitly for HDR reconstruction to overcome these limitations. This study is based on the Context-Aware Vision Transformer and begins by converting input images to the YCbCr color space to separate luminance and chrominance information. It then employs an Intersection-Aware Adaptive Fusion (IAAF) module to suppress ghosting effectively. To further achieve a light-weight design, we introduce Inverted Residual Embedding (IRE), Dynamic Tanh (DyT), and propose Enhanced Multi-Scale Dilated Convolution (E-MSDC) to reduce computational complexity at multiple levels. Our study ultimately contributes two model versions: a main version for high visual quality and a light-weight version with advantages in computational efficiency, both of which achieve an excellent balance between performance and image quality. Experimental results demonstrate that, compared to the baseline, the main version reduces FLOPS by approximately 67% and increases inference speed by more than fivefold on CPU and 2.5 times on an edge device. These results confirm that our method provides an efficient and ghost-free HDR imaging solution for edge devices, demonstrating versatility and practicality across various dynamic scenarios.

[63] BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting

Yixun Zhang,Feng Zhou,Jianqin Yin

Main category: cs.CV

TL;DR: BiTAA是一种基于3D高斯喷洒的双任务对抗攻击方法,能够同时破坏目标检测和深度估计任务。

Details Motivation: 现有的对抗攻击方法多为任务独立设计,缺乏可控的深度偏差机制和跨任务性能评估标准。BiTAA旨在填补这一空白,研究目标检测与深度估计的交互关系。

Contribution: 1. 提出BiTAA,首个支持双任务攻击的框架;2. 设计了复合损失函数,实现检测抑制和可控的深度偏差;3. 引入统一的跨任务评估协议。

Method: 基于3D高斯喷洒构建攻击框架,支持全图和局部攻击,结合EOT增强物理真实性,并通过复合损失优化检测和深度任务。

Result: 实验表明BiTAA在跨任务攻击中表现一致,且揭示了从检测到深度和深度到检测的不对称性。

Insight: 多任务相机感知存在实际风险,需设计跨任务感知的防御机制。

Abstract: Camera-based perception is critical to autonomous driving yet remains vulnerable to task-specific adversarial manipulations in object detection and monocular depth estimation. Most existing 2D/3D attacks are developed in task silos, lack mechanisms to induce controllable depth bias, and offer no standardized protocol to quantify cross-task transfer, leaving the interaction between detection and depth underexplored. We present BiTAA, a bi-task adversarial attack built on 3D Gaussian Splatting that yields a single perturbation capable of simultaneously degrading detection and biasing monocular depth. Specifically, we introduce a dual-model attack framework that supports both full-image and patch settings and is compatible with common detectors and depth estimators, with optional expectation-over-transformation (EOT) for physical reality. In addition, we design a composite loss that couples detection suppression with a signed, magnitude-controlled log-depth bias within regions of interest (ROIs) enabling controllable near or far misperception while maintaining stable optimization across tasks. We also propose a unified evaluation protocol with cross-task transfer metrics and real-world evaluations, showing consistent cross-task degradation and a clear asymmetry between Det to Depth and from Depth to Det transfer. The results highlight practical risks for multi-task camera-only perception and motivate cross-task-aware defenses in autonomous driving scenarios.

[64] StrCGAN: A Generative Framework for Stellar Image Restoration

Shantanusinh Parmar

Main category: cs.CV

TL;DR: StrCGAN是一个用于天文图像修复的生成模型,通过在CycleGAN框架中加入3D卷积、多光谱融合和天体物理正则化模块来实现高质量的天体图像重建。

Details Motivation: 天文图像因小望远镜观测的分辨率和质量问题难以恢复高保真度细节,传统GAN模型如CycleGAN在2D映射中容易扭曲天体形态。

Contribution: 1.引入3D卷积层捕捉空间相关性;2.多光谱融合对齐光学与近红外域;3.天体物理正则化模块保护恒星形态。

Method: 扩展CycleGAN,结合3D卷积、多光谱数据融合和天体物理约束,训练过程中使用全天候多波段调查数据作为真实参考。

Result: StrCGAN生成的图像视觉更清晰且物理一致性更高,在天文图像增强任务中优于标准GAN模型。

Insight: 结合领域知识(如天体物理约束)和多模态数据(多光谱)能显著提升生成模型的保真度和实用性。

Abstract: We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high-fidelity ground truth-like representations of celestial objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image-to-image translation but are restricted to 2D mappings and often distort the morphology of stars and galaxies. To overcome these limitations, we extend the CycleGAN framework with three key innovations: 3D convolutional layers to capture volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground-truth references from multi-mission all-sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are not only visually sharper but also physically consistent, outperforming standard GAN models in the task of astrophysical image enhancement.

[65] ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Tai-Ming Huang,Wei-Tung Lin,Kai-Lung Hua,Wen-Huang Cheng,Junichi Yamagishi,Jun-Cheng Chen

Main category: cs.CV

TL;DR: ThinkFake是一个基于多模态大语言模型(MLLM)的方法,通过推理提示和强化学习训练,实现可解释的AI生成图像检测,并在多个基准测试中表现优异。

Details Motivation: 由于AI生成图像的逼真度越来越高,导致信息误导和隐私侵犯问题加剧,亟需准确且可解释的检测方法。现有方法多为二分类或依赖监督微调,泛化能力有限。

Contribution: 1. 提出ThinkFake框架,结合推理提示和强化学习;2. 设计了一种结构化检测流程以提升推理质量;3. 在GenImage和LOKI基准测试中验证了方法的有效性和泛化能力。

Method: 1. 使用多模态大语言模型(MLLM);2. 引入伪造推理提示和GRPO强化学习;3. 通过结构化检测流程提升推理能力。

Result: 在GenImage基准测试中表现优于现有方法,在LOKI零样本测试中展示了强大的泛化能力。

Insight: 1. 推理提示和强化学习的结合可提升模型的解释性和泛化能力;2. 结构化检测流程能有效提升推理质量。

Abstract: The increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations, highlighting the urgent need for accurate and interpretable detection methods. While existing approaches have made progress, most rely on binary classification without explanations or depend heavily on supervised fine-tuning, resulting in limited generalization. In this paper, we propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection. Our method leverages a Multimodal Large Language Model (MLLM) equipped with a forgery reasoning prompt and is trained using Group Relative Policy Optimization (GRPO) reinforcement learning with carefully designed reward functions. This design enables the model to perform step-by-step reasoning and produce interpretable, structured outputs. We further introduce a structured detection pipeline to enhance reasoning quality and adaptability. Extensive experiments show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark. These results validate our framework’s effectiveness and robustness. Code will be released upon acceptance.

[66] PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents

Filippo Ziliotto,Jelin Raphael Akkara,Alessandro Daniele,Lamberto Ballan,Luciano Serafini,Tommaso Campari

Main category: cs.CV

TL;DR: 论文介绍了PersONAL基准测试,旨在研究个性化任务在具身智能体中的应用,包含2000多段高质量情景,任务要求智能体根据自然语言查询在家庭环境中找到特定用户的物品。

Details Motivation: 当前具身智能体在真实人类中心场景(如家庭环境)中的应用仍面临挑战,尤其在建模个体偏好和行为方面。

Contribution: 提出了PersONAL基准测试,支持导航和对象定位任务,包含大量情景和自然语言描述,为个性化具身智能体研究提供了工具。

Method: 设计了基于HM3D数据集的2000多段情景,要求智能体通过自然语言查询完成任务,支持两种评估模式:主动导航和对象定位。

Result: 实验表明现有方法与人类表现存在显著差距,突显了智能体在感知、推理和记忆个性化信息方面的不足。

Insight: 个性化任务是具身智能体在真实场景中落地的关键方向,需进一步研究感知和推理能力的提升。

Abstract: Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization, a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as “find Lily’s backpack”. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot.

[67] FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

Xin Wang,Jie Li,Zejia Weng,Yixu Wang,Yifeng Gao,Tianyu Pang,Chao Du,Yan Teng,Yingchun Wang,Zuxuan Wu,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 该论文提出FreezeVLA,一种针对视觉-语言-动作(VLA)模型的新型对抗攻击方法,通过最小-最大双层优化生成对抗图像,使模型忽略后续指令,导致机器人‘冻结’行为。实验显示攻击成功率达76.2%,且对抗图像具备强迁移性。

Details Motivation: VLA模型在机器人任务中表现优异,但其安全性和对抗攻击的鲁棒性尚未充分研究。论文揭示了一种‘冻结’攻击漏洞,可能导致机器人在关键任务中失效,从而强调安全研究的必要性。

Contribution: 1. 识别并形式化了VLA模型的‘动作冻结’漏洞;2. 提出FreezeVLA攻击框架,采用双层优化生成对抗样本;3. 实验验证攻击在多种VLA模型和任务中的高成功率及迁移性。

Method: FreezeVLA通过最小-最大双层优化生成对抗图像,攻击目标是使VLA模型忽略后续指令。优化目标包括最大化冻结效果和最小化对抗扰动。

Result: 在三种VLA模型和四个机器人基准测试中,平均攻击成功率达76.2%,显著优于现有方法。对抗图像还能跨语言指令迁移攻击。

Insight: 该研究揭示了VLA模型的重大安全隐患,暴露了对抗攻击的现实威胁,呼吁开发更鲁棒的防御机制以确保机器人系统的安全部署。

Abstract: Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can “freeze” VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot’s digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.

[68] Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection

Yunqing Hu,Zheming Yang,Chang Zhao,Wen Ji

Main category: cs.CV

TL;DR: 该论文提出了一种基于多模态大语言模型(MLLM)的自适应语义增强边云协同目标检测方法,通过动态调整边缘检测器的参数,实现了复杂场景下检测精度与效率的平衡。

Details Motivation: 传统目标检测方法在低光照和高度遮挡等复杂场景中因缺乏高级语义理解而导致性能下降,为此需引入语义信息提升检测能力。

Contribution: 1)提出了一种基于MLLM的自适应语义增强边云协同目标检测方法;2)设计了指令微调和动态映射机制,将语义信息转化为边缘检测器的调整信号;3)在边云协同框架中动态选择云端语义增强或直接输出边缘检测结果。

Method: 1)使用指令微调让MLLM生成结构化场景描述;2)设计自适应映射机制,动态将语义信息转化为参数调整信号;3)基于置信度动态选择是否启用云端语义增强。

Result: 在低光照和高度遮挡场景中,延迟降低79%,计算成本减少70%,同时保持检测精度。

Insight: 结合多模态大语言模型的语义理解能力,可以显著提升目标检测在复杂场景中的性能,并通过边云协同框架优化效率。

Abstract: Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.

[69] Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation

Rémi Giraud,Rodrigo Borba Pinheiro,Yannick Berthoumieu

Main category: cs.CV

TL;DR: 该论文提出了一种新的超像素方法SphSPS,专用于360度球形或全向图像的分割,改进了传统2D平面图像分割方法在球形图像上的表现。

Details Motivation: 随着广角图像采集设备的普及,计算机视觉领域需要快速准确的分析方法,传统超像素分割方法因未考虑球形图像的几何特性而表现不佳。

Contribution: 1. 提出一种基于最短路径的超像素方法SphSPS;2. 将最短路径概念推广到球形3D空间;3. 改进分割精度和超像素形状规则性;4. 提出球形空间全局规则性度量。

Method: 通过考虑球形采集空间的几何特性,计算像素与超像素中心之间的最短路径,提取有效的聚类特征。

Result: 在标准360度球形全景分割数据集和合成道路全向图像上,SphSPS在分割精度、抗噪性和规则性上显著优于现有方法。

Insight: 球形图像的几何特性对分割效果至关重要,传统2D方法缺乏对球形空间建模的能力,SphSPS填补了这一空白。

Abstract: The growing use of wide angle image capture devices and the need for fast and accurate image analysis in computer visions have enforced the need for dedicated under-representation approaches. Most recent decomposition methods segment an image into a small number of irregular homogeneous regions, called superpixels. Nevertheless, these approaches are generally designed to segment standard 2D planar images, i.e., captured with a 90o angle view without distortion. In this work, we introduce a new general superpixel method called SphSPS (for Spherical Shortest Path-based Superpixels)1 , dedicated to wide 360o spherical or omnidirectional images. Our method respects the geometry of the 3D spherical acquisition space and generalizes the notion of shortest path between a pixel and a superpixel center, to fastly extract relevant clustering features. We demonstrate that considering the geometry of the acquisition space to compute the shortest path enables to jointly improve the segmentation accuracy and the shape regularity of superpixels. To evaluate this regularity aspect, we also generalize a global regularity metric to the spherical space, addressing the limitations of the only existing spherical compactness measure. Finally, the proposed SphSPS method is validated on the reference 360o spherical panorama segmentation dataset and on synthetic road omnidirectional images. Our method significantly outperforms both planar and spherical state-of-the-art approaches in terms of segmentation accuracy,robustness to noise and regularity, providing a very interesting tool for superpixel-based applications on 360o images.

[70] Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network

Pin-Jui Huang,Yu-Hsuan Liao,SooHeon Kim,NoSeong Park,JongBae Park,DongMyung Shin

Main category: cs.CV

TL;DR: 论文提出了一种新的细胞图像表征学习方法CWA-MSN,通过跨孔对齐掩码孪生网络解决了批次效应问题,并在数据量和模型规模上更高效。

Details Motivation: 当前的自监督和对比学习方法在处理细胞图像的批次效应时面临挑战,且通常需要大规模模型或数据。该论文旨在提出一种更高效的解决方案。

Contribution: 提出了CWA-MSN框架,通过跨孔对齐和掩码孪生网络实现批次鲁棒的表征学习,提高了生物学意义的特征提取能力。

Method: 结合跨孔对齐策略和掩码孪生网络,确保相同扰动处理的细胞在不同孔中的嵌入对齐,从而学习语义一致的表示。

Result: 在基因-基因关系检索任务中,CWA-MSN显著优于OpenPhenom和CellCLIP,分别提升29%和9%,同时数据量和模型规模更小。

Insight: 跨孔对齐策略能有效缓解批次效应,掩码孪生网络在小数据和小模型下仍能学习高效表征,为药物发现提供了新思路。

Abstract: Computational models that predict cellular phenotypic responses to chemical and genetic perturbations can accelerate drug discovery by prioritizing therapeutic hypotheses and reducing costly wet-lab iteration. However, extracting biologically meaningful and batch-robust cell painting representations remains challenging. Conventional self-supervised and contrastive learning approaches often require a large-scale model and/or a huge amount of carefully curated data, still struggling with batch effects. We present Cross-Well Aligned Masked Siamese Network (CWA-MSN), a novel representation learning framework that aligns embeddings of cells subjected to the same perturbation across different wells, enforcing semantic consistency despite batch effects. Integrated into a masked siamese architecture, this alignment yields features that capture fine-grained morphology while remaining data- and parameter-efficient. For instance, in a gene-gene relationship retrieval benchmark, CWA-MSN outperforms the state-of-the-art publicly available self-supervised (OpenPhenom) and contrastive learning (CellCLIP) methods, improving the benchmark scores by +29% and +9%, respectively, while training on substantially fewer data (e.g., 0.2M images for CWA-MSN vs. 2.2M images for OpenPhenom) or smaller model size (e.g., 22M parameters for CWA-MSN vs. 1.48B parameters for CellCLIP). Extensive experiments demonstrate that CWA-MSN is a simple and effective way to learn cell image representation, enabling efficient phenotype modeling even under limited data and parameter budgets.

[71] Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering

Jiangxue Yu,Hui Wang,San Jiang,Xing Zhang,Dejin Zhang,Qingquan Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D高斯泼溅(3D Gaussian Splatting)的中间视图生成方法,用于解决航空与地面图像匹配中视角变化导致的透视失真问题,显著提升了匹配数量和质量。

Details Motivation: 航空与地面图像的结合在复杂场景3D建模中具有潜力,但视角变化导致的透视失真是可靠匹配的主要障碍,因此需要一种方法来缓解这一问题。

Contribution: 论文的主要贡献是通过中间视图生成减轻视角变化带来的透视失真,提出了一种基于3D高斯泼溅的渲染方法,并结合匹配传递实现航空与地面图像的高质量特征匹配。

Method: 方法分为三步:1)通过增量式SfM重建稀疏模型;2)利用3D高斯泼溅生成中间视图;3)通过中间视图传递匹配实现航空与地面图像的特征匹配。

Result: 实验证明,该方法显著提升了初始和优化匹配的数量,支持精确的增量式SfM重建和完整的3D高斯泼溅场景渲染。

Insight: 中间视图生成是解决多视角图像匹配问题的有效策略,3D高斯泼溅在此类任务中展现了高质量的渲染能力。

Abstract: The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes, which is seriously restricted by finding reliable correspondences. The primary contribution of this study is a feature matching algorithm for aerial and ground images, whose core idea is to generate intermediate views to alleviate perspective distortions caused by the extensive viewpoint changes. First, by using aerial images only, sparse models are reconstructed through an incremental SfM (Structure from Motion) engine due to their large scene coverage. Second, 3D Gaussian Splatting is then adopted for scene rendering by taking as inputs sparse points and oriented images. For accurate view rendering, a render viewpoint determination algorithm is designed by using the oriented camera poses of aerial images, which is used to generate high-quality intermediate images that can bridge the gap between aerial and ground images. Third, with the aid of intermediate images, reliable feature matching is conducted for match pairs from render-aerial and render-ground images, and final matches can be generated by transmitting correspondences through intermediate views. By using real aerial and ground datasets, the validation of the proposed solution has been verified in terms of feature matching and scene rendering and compared comprehensively with widely used methods. The experimental results demonstrate that the proposed solution can provide reliable feature matches for aerial and ground images with an obvious increase in the number of initial and refined matches, and it can provide enough matches to achieve accurate ISfM reconstruction and complete 3DGS-based scene rendering.

[72] CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation

Miren Samaniego,Igor Rodriguez,Elena Lazkano

Main category: cs.CV

TL;DR: CapStARE提出了一种基于胶囊网络的时空架构,用于高效且鲁棒的视线估计,结合了ConvNeXt主干、注意力路由的胶囊形成以及针对慢速和快速动态的双GRU解码器。

Details Motivation: 当前视线估计方法在复杂场景中性能不足,且缺乏鲁棒性和实时性。

Contribution: 提出了一种模块化设计,结合胶囊网络和时空建模,显著提升了性能并降低了参数数量。

Method: 使用ConvNeXt作为主干网络,引入胶囊结构和注意力路由,配合双GRU解码器分别处理不同动态的视线变化。

Result: 在多个数据集(ETH-XGaze、MPIIFaceGaze等)上达到SOTA性能,且实现实时推理(<10ms)。

Insight: 胶囊结构和双GRU解码器的设计能够有效建模局部-全局关系,并提供更好的可解释性。

Abstract: We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare

[73] GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes

Guo Chen,Jiarun Liu,Sicong Du,Chenming Wu,Deqi Li,Shi-Sheng Huang,Guofeng Zhang,Sheng Yang

Main category: cs.CV

TL;DR: GS-RoadPatching提出了一种基于3D高斯泼溅(3DGS)的行驶场景修复方法,通过3D空间搜索和替换实现高效补全,避免了传统2D视角方法的局限性。

Details Motivation: 现有3DGS修复方法依赖2D视角的生成模型(如扩散模型或GAN)预测缺失区域,但这种方法在时空一致性和效率上存在不足。本文提出直接在3DGS模态中进行补全和编辑,避免了跨模态一致性问题和高斯重训练的开销。

Contribution: 1. 提出了一种基于3DGS的替代性修复方法,直接在3D空间中搜索和替换相似补丁;2. 引入了特征嵌入的3DGS场景和多尺度局部上下文抽象方法;3. 设计了一种简单高效的替换融合优化策略。

Method: 1. 构建特征嵌入的3DGS场景,提取多尺度局部上下文;2. 在3D空间中高效搜索候选补丁;3. 提出替换与融合优化策略以实现视觉和谐。

Result: 在多个公开数据集上的实验表明,该方法在质量和效率上优于基线方法,尤其在行驶场景中表现最佳。通用场景下的实验也验证了其广泛适用性。

Insight: 行驶场景中高度重复的模式在3DGS隐式特征空间中具有多模态相似性,适合通过结构匹配实现高效修复。

Abstract: This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: https://shanzhaguoo.github.io/GS-RoadPatching/

[74] When Words Can’t Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset

Sarmistha Das,R E Zera Marveen Lyngkhoi,Kirtan Jain,Vinayak Goyal,Sriparna Saha,Manish Gupta

Main category: cs.CV

TL;DR: 这篇论文提出了一个基于视频的用户投诉文本生成任务(CoD-V),并引入了一个名为ComVID的多模态视频投诉数据集。通过提出的新评测指标CR和基于VideoLLaMA2-7b的多模态RAG模型,论文展示了在投诉生成任务上的有效性。

Details Motivation: 现有的投诉挖掘研究主要依赖文本,但用户往往难以通过文字清晰表达投诉内容,而视频却能直观展示问题。因此,论文旨在利用视频帮助用户生成更准确的投诉文本。

Contribution: 1. 提出新任务CoD-V;2. 发布ComVID数据集,包含1,175条视频投诉及其描述;3. 提出CR评测指标;4. 设计多模态RAG嵌入的VideoLLaMA2-7b模型。

Method: 采用多模态检索增强生成(RAG)技术,结合VideoLLaMA2-7b模型生成投诉文本,同时考虑用户情感状态。评测使用了METEOR、困惑度等多种指标。

Result: 研究表明,提出的方法在投诉生成任务上优于标准视频摘要和描述任务,并通过CR指标验证了其有效性。

Insight: 视频在投诉表达中具有独特优势,多模态模型能够更好地捕捉用户情感和问题细节,为投诉挖掘领域提供了新方向。

Abstract: While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product’ paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users’ need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user’s emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.

[75] SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding

Phyo Thet Yee,Dimitrios Kollias,Sudeepta Mishra,Abhinav Dhall

Main category: cs.CV

TL;DR: SynchroRaMa是一个多模态情感嵌入框架,通过结合文本和音频的情感信号,生成更具表现力和真实感的说话人脸视频,并在头部动作和唇同步方面表现优异。

Details Motivation: 现有方法多依赖单模态情感嵌入,无法捕捉复杂的情感线索;且仅依赖单张参考图像,难以表现动态动作或属性变化。

Contribution: 提出多模态情感嵌入(结合文本和音频情感信号)、音频到动作模块(A2M)、以及利用LLM生成的场景描述增强动态语义捕捉。

Method: 整合音频、文本情感分析生成情感嵌入,通过A2M模块实现唇同步与自然头部动作,并引入LLM生成的场景描述提升动态表现。

Result: 在基准数据集上,SynchroRaMa在图像质量、表情保留和动作真实性上优于现有方法,用户研究也证实其自然性和流畅度更高。

Insight: 多模态情感嵌入和动态场景描述的结合显著提升了说话人脸生成的表现力和时序一致性。

Abstract: Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model’s ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at https://novicemm.github.io/synchrorama.

[76] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

Pei Liu,Hongliang Lu,Haichao Liu,Haipeng Liu,Xin Liu,Ruoyu Yao,Shengbo Eben Li,Jun Ma

Main category: cs.CV

TL;DR: OmniScene通过引入视觉语言模型和层次融合策略,提出了一种类似人类的4D场景理解框架,显著提升了自动驾驶系统的感知与理解能力。

Details Motivation: 目前自动驾驶系统主要依赖基于深度的3D重建,缺乏真正的场景理解能力。研究旨在通过结合多模态感知和人类类似注意力机制,实现更全面的场景理解。

Contribution: 1. 提出了OmniScene框架及其视觉语言模型OmniVLM;2. 通过知识蒸馏嵌入文本表示以增强语义监督;3. 设计了层次融合策略(HFS)以优化多模态特征整合。

Method: 1. 使用师生架构的OmniVLM模型;2. 通过知识蒸馏将文本表示与3D实例特征对齐;3. 采用HFS动态校准几何与语义特征的贡献。

Result: 在nuScenes数据集上,OmniScene在感知、预测、规划和视觉问答任务中均优于十多个先进模型,确立了新的性能标杆。

Insight: 结合视觉与文本模态,并模拟人类注意力机制,可以显著提升自动驾驶系统的场景理解和行为适应性。

Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.

[77] CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion

Chenhao Ji,Chaohui Yu,Junyao Gao,Fan Wang,Cairong Zhao

Main category: cs.CV

TL;DR: 该论文提出了CamPVG,首个基于扩散模型的、支持精准相机位姿引导的全景视频生成框架,解决了传统方法在全景视频生成中的几何一致性挑战。

Details Motivation: 现有的相机控制视频生成方法主要集中于透视投影视频,而几何一致的全景视频生成仍是挑战。论文旨在解决全景姿态表示和球面投影的复杂性。

Contribution: 1. 提出了全景Plücker嵌入,通过球面坐标变换编码相机外参;2. 设计了球面极线模块,通过自适应注意力掩码强制几何约束,提升跨视角特征聚合质量。

Method: 1. 全景Plücker嵌入编码相机位姿;2. 球面极线模块利用极线约束优化特征聚合;3. 基于扩散模型的全景视频生成框架。

Result: 实验表明,CamPVG生成的全景视频质量高且与相机轨迹一致,显著优于现有方法。

Insight: 通过几何约束和球面投影的结合,可以显著提升全景视频生成的几何一致性,为相机控制的全景内容生成提供了新思路。

Abstract: Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Pl"ucker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.

[78] SDE-DET: A Precision Network for Shatian Pomelo Detection in Complex Orchard Environments

Yihao Hu,Pan Wang,Xiaodong Bai,Shijie Cai,Hang Wang,Huazhong Liu,Aiping Yang,Xiangxiang Li,Meiping Ding,Hongyan Liu,Jianguo Yao

Main category: cs.CV

TL;DR: 该论文提出了一种用于复杂果园环境中沙田柚检测的SDE-DET模型,结合了Star Block、Deformable Attention和多尺度注意力机制,在性能和计算效率上均表现优异。

Details Motivation: 沙田柚检测在自动化采摘和成熟度分析中至关重要,但复杂果园环境中的多尺度、遮挡和小目标问题增加了检测难度。

Contribution: 提出SDE-DET模型,结合新颖的Star Block和Deformable Attention模块,显著提升了遮挡和小目标检测能力。

Method: 采用Star Block提取高维信息,结合Deformable Attention增强遮挡条件下的检测能力,并引入多尺度注意力机制优化小目标检测。

Result: 在STP-AgriData数据集上,SDE-DET在精度、召回率和mAP等指标上超越主流检测模型(如Yolo系列),达到SOTA性能。

Insight: SDE-DET为复杂环境中的目标检测提供了高效解决方案,为自动化采摘机器人的发展奠定了基础。

Abstract: Pomelo detection is an essential process for their localization, automated robotic harvesting, and maturity analysis. However, detecting Shatian pomelo in complex orchard environments poses significant challenges, including multi-scale issues, obstructions from trunks and leaves, small object detection, etc. To address these issues, this study constructs a custom dataset STP-AgriData and proposes the SDE-DET model for Shatian pomelo detection. SDE-DET first utilizes the Star Block to effectively acquire high-dimensional information without increasing the computational overhead. Furthermore, the presented model adopts Deformable Attention in its backbone, to enhance its ability to detect pomelos under occluded conditions. Finally, multiple Efficient Multi-Scale Attention mechanisms are integrated into our model to reduce the computational overhead and extract deep visual representations, thereby improving the capacity for small object detection. In the experiment, we compared SDE-DET with the Yolo series and other mainstream detection models in Shatian pomelo detection. The presented SDE-DET model achieved scores of 0.883, 0.771, 0.838, 0.497, and 0.823 in Precision, Recall, mAP@0.5, mAP@0.5:0.95 and F1-score, respectively. SDE-DET has achieved state-of-the-art performance on the STP-AgriData dataset. Experiments indicate that the SDE-DET provides a reliable method for Shatian pomelo detection, laying the foundation for the further development of automatic harvest robots.

[79] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang,Jiahan Zhang,Shengjie Zhou,Qi Wei,Shuo He,Feng Liu,Lei Feng

Main category: cs.CV

TL;DR: 该论文提出了一种名为Proxy Targeted Attack(PTA)的新方法,旨在解决多模态预训练模型中目标对抗攻击在通用性和不可检测性方面的局限性。通过利用多源模态和目标模态代理优化对抗样本,PTA在确保高攻击成功率的同时,能够逃逸防御检测。

Details Motivation: 多模态预训练模型(如图像对齐模型ImageBind)在下游任务中表现出色,但其广泛应用也引发了安全担忧,尤其是目标对抗攻击的问题。现有攻击方法在通用性和不可检测性方面存在不足,亟需改进。

Contribution: 1.提出了PTA方法,通过多源模态和目标模态代理优化对抗样本,提升了攻击的通用性和不可检测性。2.提供了理论分析,阐明了通用性与不可检测性之间的关系。3.实验证明PTA在多种相关目标和异常检测方法下均表现优异。

Method: PTA通过多源模态和目标模态代理优化对抗样本,确保其不仅对齐多个潜在目标,还能逃逸防御检测。理论分析部分进一步优化了这一过程,确保在满足不可检测性要求的同时实现最佳通用性。

Result: 实验结果表明,PTA在多种相关目标上实现了高攻击成功率,并且在多种异常检测方法下仍保持不可检测性。

Insight: 通过多模态代理优化攻击样本,可以显著提升目标对抗攻击的通用性和不可检测性,为多模态模型的鲁棒性研究提供了新思路。

Abstract: Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.

[80] Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture

Nico Schulthess,Ender Konukoglu

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv2嵌入和Dirichlet过程混合模型(DPMM)的无监督异常检测方法,适用于医学影像,显著减少了计算负担并提升了性能。

Details Motivation: 医学影像中的异常检测通常依赖小规模数据集和内存库方法,计算成本高。本文旨在利用DINOv2嵌入和DPMM模型解决大规模数据下的计算效率和性能问题。

Contribution: 主要贡献是结合DINOv2嵌入和DPMM模型实现高效的无监督异常检测,并发现归一化嵌入在异常检测中表现更优。

Method: 使用DINOv2提取特征,利用DPMM自动调整混合组件数量,通过组件中心与嵌入的相似性计算异常分数,生成异常分割掩码。

Result: 实验表明,该方法在医学影像异常检测中性能优异,推理时间至少减少一半,且归一化嵌入更适用于异常检测。

Insight: 归一化的DINOv2嵌入即使在异常存在时仍与解剖结构对齐,使其成为异常检测的理想表示。

Abstract: In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.

[81] Table Detection with Active Learning

Somraj Gautam,Nachiketa Purohit,Gaurav Harit

Main category: cs.CV

TL;DR: 本文提出了一种结合主动学习(AL)和多样性策略的方法,用于表格检测任务,以减少标注成本并提高模型性能。

Details Motivation: 高效数据标注是机器学习中的关键挑战,尤其是在需要大量标注数据的对象检测任务中。主动学习通过选择信息最丰富的样本来最小化标注成本,结合多样性策略可以进一步提高采样效率。

Contribution: 提出了一种结合不确定性和多样性策略的主动学习方法,用于表格检测任务,显著减少了标注成本,同时在相同标注预算下达到了更高的mAP分数。

Method: 采用了CascadeTabNet和YOLOv9等先进的表格检测架构,结合AL选择策略,在两个基准数据集(TableBank-LaTeX和TableBank-Word)上进行了评估。

Result: 实验表明,AL方法显著优于随机采样,在有限的标注预算下保持了与全监督模型相当的性能,同时提高了mAP分数。

Insight: 主动学习中结合多样性策略可以提高采样效率,尤其是在对象检测任务中,而不仅仅是依赖传统的基于不确定性的选择方法。

Abstract: Efficient data annotation remains a critical challenge in machine learning, particularly for object detection tasks requiring extensive labeled data. Active learning (AL) has emerged as a promising solution to minimize annotation costs by selecting the most informative samples. While traditional AL approaches primarily rely on uncertainty-based selection, recent advances suggest that incorporating diversity-based strategies can enhance sampling efficiency in object detection tasks. Our approach ensures the selection of representative examples that improve model generalization. We evaluate our method on two benchmark datasets (TableBank-LaTeX, TableBank-Word) using state-of-the-art table detection architectures, CascadeTabNet and YOLOv9. Our results demonstrate that AL-based example selection significantly outperforms random sampling, reducing annotation effort given a limited budget while maintaining comparable performance to fully supervised models. Our method achieves higher mAP scores within the same annotation budget.

[82] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression

Xuekang Zhu,Ji-Zhe Zhou,Kaiwen Feng,Chenfan Qu,Yunfei Wang,Liting Zhou,Jian liu

Main category: cs.CV

TL;DR: 论文提出RITA框架,将图像篡改定位任务重新定义为条件序列预测问题,通过逐层预测篡改区域并建模编辑操作的时间和层级依赖关系,解决了现有方法忽视篡改过程的问题。

Details Motivation: 现有的图像篡改定位方法(IML)忽视了篡改过程的复杂性和时序性,直接生成单次预测的定位掩码,导致维度塌缩。RITA首次将IML重定义为序列预测任务,以更好地建模篡改的层级与时序特性。

Contribution: 1. 提出RITA框架,将IML任务重新定义为条件序列预测问题;2. 构建多步骤篡改合成数据集HSIM和新评估指标HSS;3. 实验证明RITA在传统任务和新层级定位任务上均达到SOTA。

Method: RITA通过反向有序增量自回归方法逐层预测篡改区域,每一步的预测作为下一步的条件,显式建模编辑操作的时间和层级依赖关系。

Result: RITA在传统基准测试中达到SOTA,并为新型层级定位任务提供了有效范例。

Insight: 建模篡改过程的时序和层级特性是提高IML任务性能的关键,引入序列预测范式可以更自然地解决篡改定位问题。

Abstract: Image manipulations often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image, exhibiting sequentiality and hierarchical characteristics. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, thereby creating a fundamental mismatch with the intrinsic nature of the IML task. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step’s prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show RITA achieves SOTA on traditional benchmarks and provides a solid foundation for the novel hierarchical localization task, validating its potential as a general and effective paradigm. The code and dataset will be publicly available.

[83] PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

Manahil Raza,Ayesha Azam,Talha Qaiser,Nasir Rajpoot

Main category: cs.CV

TL;DR: PS3是一种多模态Transformer模型,融合病理报告、组织学图像和生物通路数据,用于癌症生存预测,提升了现有方法的性能。

Details Motivation: 现有多模态融合方法主要关注组织学图像与基因组数据的结合,忽视了病理报告的价值。病理报告包含临床上下文和专家解读,具有补充信息潜力。但多模态数据的异质性(如高维图像与变长文本)带来了融合挑战。

Contribution: 1) 提出诊断原型提取病理报告关键内容;2) 设计组织学原型压缩WSI形态模式;3) 引入生物通路原型编码转录组数据。结合这三者,PS3通过Transformer实现多模态交互与预测。

Method: 1) 基于自注意力生成诊断原型(文本);2) 从WSI提取组织学原型(图像);3) 从转录组数据编码通路原型。通过Transformer融合三模态原型,建模模态内和跨模态交互。

Result: 在TCGA的六个数据集上,PS3超越现有单模态和多模态基线方法,验证了病理报告对生存预测的增益。代码已开源。

Insight: 1) 病理报告的标准化表示能有效提升模型性能;2) 基于原型的多模态融合缓解了数据异质性;3) Transformer适合建模复杂模态交互。

Abstract: Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.

[84] Predictive Quality Assessment for Mobile Secure Graphics

Cas Steigstra,Sergey Milyaev,Shaodi You

Main category: cs.CV

TL;DR: 论文提出了一种轻量级框架,用于预测移动设备上安全图形验证的质量,解决了传统方法因图像采集问题导致的高误拒率问题。

Details Motivation: 由于智能手机在采集高熵安全图形时的不可控性,导致验证任务的错误率较高,因此需要一种能够预测图像质量的方法,以提高下游验证任务的可靠性。

Contribution: 提出了一个预测性质量评估框架,通过轻量级模型预测视频帧的质量分数,以决定其是否适合进入资源密集型的验证模型。

Method: 采用了一个轻量级模型,结合重新定义的FNMR和ISRR指标,验证了框架在大规模数据集上的有效性。通过跨印刷技术的分析,发现冻结的预训练网络在泛化性上优于微调模型。

Result: 在包含32,000多张图像和105种智能手机的大规模数据集上验证了框架的有效性,并揭示了冻结预训练网络在跨领域任务中的泛化优势。

Insight: 对于来自物理制造领域的领域迁移,冻结的通用预训练骨干网络比完全微调的模型更具鲁棒性,后者容易过拟合到源领域的噪声。

Abstract: The reliability of secure graphic verification, a key anti-counterfeiting tool, is undermined by poor image acquisition on smartphones. Uncontrolled user captures of these high-entropy patterns cause high false rejection rates, creating a significant ‘reliability gap’. To bridge this gap, we depart from traditional perceptual IQA and introduce a framework that predictively estimates a frame’s utility for the downstream verification task. We propose a lightweight model to predict a quality score for a video frame, determining its suitability for a resource-intensive oracle model. Our framework is validated using re-contextualized FNMR and ISRR metrics on a large-scale dataset of 32,000+ images from 105 smartphones. Furthermore, a novel cross-domain analysis on graphics from different industrial printing presses reveals a key finding: a lightweight probe on a frozen, ImageNet-pretrained network generalizes better to an unseen printing technology than a fully fine-tuned model. This provides a key insight for real-world generalization: for domain shifts from physical manufacturing, a frozen general-purpose backbone can be more robust than full fine-tuning, which can overfit to source-domain artifacts.

[85] SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

Yuxi Zheng,Jianhui Feng,Tianran Li,Marius Staring,Yuchuan Qiao

Main category: cs.CV

TL;DR: 论文提出SHMoAReg,一种基于专家混合机制的可变形图像配准网络,通过引入混合注意力头和空间异构专家,提升特征提取和变形场预测的专一性和异构性,实验表明其性能显著优于现有方法。

Details Motivation: 当前基于编码器-解码器架构的可变形图像配准方法在特征提取和变形场预测上缺乏专一性和异构性,限制了性能。

Contribution: 1. 首次将专家混合机制引入可变形图像配准任务;2. 提出混合注意力头(MoA)和空间异构专家(SHMoE)分别用于编码器和解码器;3. 实验验证性能提升和模型可解释性。

Method: 1. 编码器使用MoA动态选择最优注意力头组合;2. 解码器使用SHMoE异构预测三维变形场;3. 不同核大小的专家处理不同分辨率。

Result: 在公共腹部CT数据集上,Dice分数从60.58%提升至65.58%。

Insight: 专家混合机制能显著提升配准任务的性能和可解释性,异构化设计更符合三维变形场的特点。

Abstract: Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts’ utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.

[86] Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

Zizheng Yang,Hu Yu,Bing Li,Jinghao Zhang,Jie Huang,Feng Zhao

Main category: cs.CV

TL;DR: 论文提出了一种基于预训练扩散模型语义潜空间的图像去雾方法DiffLI$^2$D,避免了重新训练扩散模型和迭代采样过程,性能优于现有方法。

Details Motivation: 扩散模型在图像去雾中潜力大,但计算负担高且采样步骤多,限制了其广泛应用。论文探索了预训练扩散模型的语义潜空间特性,以减少计算开销。

Contribution: 提出了DiffLI$^2$D方法,利用预训练扩散模型的语义潜空间表示雾图内容和雾特性,避免了重新训练和迭代采样。

Method: 通过分析扩散模型中不同时间步的潜空间表示,设计了一个去雾网络,结合这些表示指导去雾过程。

Result: 在多个数据集上验证了方法的优越性,性能优于现有去雾方法。

Insight: 预训练扩散模型的语义潜空间可以有效捕获雾图特征,为图像去雾提供了新思路。

Abstract: Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.

[87] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models

JuanaJuana Valeria Hurtado,Rohit Mohan,Abhinav Valada

Main category: cs.CV

TL;DR: 该论文提出了一种名为Hyperspectral Adapter的新型架构,利用预训练的视觉基础模型(如ViT)从高光谱数据中有效学习,通过光谱变换器和频谱感知空间先验模块提取丰富的空间-光谱特征,并在三个自动驾驶基准数据集上实现了最先进的语义分割性能。

Details Motivation: 高光谱成像(HSI)提供了丰富的空间和光谱信息,但因当前方法主要针对RGB输入设计,高光谱语义分割表现不佳。论文旨在通过适配预训练的视觉基础模型(VFM)来解决这一问题。

Contribution: 1. 设计了Hyperspectral Adapter架构,包含光谱变换器和频谱感知空间先验模块;2. 引入模态感知交互块,有效整合高光谱表示与冻结的视觉Transformer特征;3. 在自动驾驶数据集上验证了方法的有效性。

Method: 1. 使用光谱变换器提取高光谱特征;2. 频谱感知空间先验模块建模空间-光谱关系;3. 模态感知交互块通过提取和注入机制整合高光谱与VFM特征。

Result: 在三个自动驾驶数据集上实现了最先进的语义分割性能,超越了基于RGB和高光谱的分割方法。

Insight: 通过适配预训练的视觉基础模型,高光谱数据可以显著提升复杂环境下的语义分割性能,为机器人感知提供了新思路。

Abstract: Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hyperspectraladapter.cs.uni-freiburg.de.

[88] A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

Belal Shoer,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 论文提出了一种简单的数据增强策略,通过将现有图像-文本对转换为统一图像格式,为科学视觉问答生成合成数据,显著提升了多语言模型的性能。

Details Motivation: 科学视觉问答任务因科学图的复杂性和多模态上下文而具有挑战性。传统方法将图像和文本分开处理,而EXAMS-V提出了一种新范式,但仍需任务微调。为了解决训练数据稀缺问题,作者提出了数据增强策略。

Contribution: 1. 提出了一种将分开的图像-文本对转换为统一图像的数据增强方法。2. 展示了这种方法在多语言科学视觉问答任务中的有效性,平均性能显著提升。

Method: 通过合成数据增强,将现有图像-文本对转换为统一的“文本嵌入图像”格式,并结合EXAMS-V数据进行微调,训练一个小型多语言多模态模型。

Result: 在13种语言上的实验中,该方法表现显著优于零样本基线,展示了平均性能的提升和跨语言迁移能力。

Insight: 简单的数据增强策略可以显著提升多语言多模态模型在复杂任务(如科学视觉问答)中的性能,尤其是在训练数据稀缺的情况下。

Abstract: Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this “text-in-image” format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

[89] EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

Botai Yuan,Yutian Zhou,Yingjie Wang,Fushuo Huo,Yongcheng Jing,Li Shen,Ying Wei,Zhiqi Shen,Ziwei Liu,Tianwei Zhang,Jie Yang,Dacheng Tao

Main category: cs.CV

TL;DR: EchoBench是一个用于评估医疗领域大型视觉语言模型(LVLMs)中‘谄媚’倾向的基准测试,揭示了主流模型在面对用户偏见输入时的不加批判回应问题。

Details Motivation: 目前的医疗LVLM基准测试过于关注准确性,忽略了模型的可靠性和安全性。谄媚行为在高风险临床场景中尤为危险,因此需要系统性评估。

Contribution: 提出了EchoBench基准测试,包含2,122张图像和90个模拟偏见的提示,系统评估了不同类型模型的谄媚行为,并提供缓解策略。

Method: 通过模拟患者、医学生和医生的偏见面板输入,评估医疗专用、开源和专有LVLMs的谄媚程度,并分析影响因素和缓解方法。

Result: 所有模型均表现出显著的谄媚行为(高达95%),专有模型Claude 3.7 Sonnet和GPT-4.1分别达到45.98%和59.15%。数据质量和领域知识可减少谄媚行为。

Insight: 仅关注准确性不足以评估模型的安全性,数据多样性和领域知识能有效降低谄媚行为,提示级干预和训练策略可作为缓解手段。

Abstract: Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy – models’ tendency to uncritically echo user-provided information – in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.

[90] C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis

Min Cen,Zhenfeng Zhuang,Yuzhe Zhang,Min Zeng,Baptiste Magnier,Lequan Yu,Hong Zhang,Liansheng Wang

Main category: cs.CV

TL;DR: C$^2$MIL是一种基于双因果图的多实例学习方法,通过语义和拓扑因果关系的同步优化,提升了生存分析的鲁棒性和可解释性。

Details Motivation: 基于图的MIL方法在生存分析中存在语义偏差和拓扑噪声问题,影响分析的泛化性和可解释性。

Contribution: 提出C$^2$MIL模型,结合语义干预和拓扑因果发现的双因果结构,优化MIL框架。

Method: 引入跨尺度自适应特征解耦模块(语义干预)和Bernoulli可微分因果子图采样方法(拓扑发现),并采用联合优化策略。

Result: 实验表明,C$^2$MIL在泛化性和可解释性上优于现有方法,并能增强多种MIL基线的性能。

Insight: 语义和拓扑因果关系的同步优化是提升MIL模型性能的关键。模型代码已开源,便于复现和应用。

Abstract: Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H&E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C$^2$MIL. C$^2$MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C$^2$MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at https://github.com/mimic0127/C2MIL.

[91] U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT

Zhi Qin Tan,Xiatian Zhu,Owen Addison,Yunpeng Li

Main category: cs.CV

TL;DR: 论文提出了一种基于U-Mamba2的半监督学习框架U-Mamba2-SSL,用于CBCT中的牙齿和牙髓分割。该方法通过自监督预训练、一致性正则化和伪标签策略,实现了优异的性能。

Details Motivation: CBCT中牙齿和牙髓的准确分割对临床诊断和治疗计划至关重要,但传统方法依赖专家知识且耗时。为此,作者提出了一种自动化算法,以高效利用未标记数据。

Contribution: 主要贡献包括:1) 提出U-Mamba2-SSL框架;2) 结合自监督预训练和多阶段半监督学习策略;3) 引入输入和特征扰动的一致性正则化方法。

Method: 方法分为三阶段:1) 用破坏性自编码器对U-Mamba2进行自监督预训练;2) 通过一致性正则化利用未标记数据;3) 使用伪标签策略并降低损失权重以减少错误影响。

Result: 在验证集上,U-Mamba2-SSL的平均得分为0.872,DSC为0.969,表现出色。

Insight: 论文表明,结合自监督学习和半监督策略能够显著提升CBCT图像分割的性能,尤其是针对数据标注成本高的任务。

Abstract: Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.

[92] Optical Ocean Recipes: Creating Realistic Datasets to Facilitate Underwater Vision Research

Patricia Schöntag,David Nakath,Judith Fischer,Rüdiger Röttgers,Kevin Köser

Main category: cs.CV

TL;DR: 该论文提出了’Optical Ocean Recipes’框架,旨在通过可控的水下条件创建逼真的数据集,解决水下机器视觉研究中缺乏通用性和可控性的问题。

Details Motivation: 水下机器视觉的研究因缺乏对不同光学水类型和成像条件的通用性测试环境而受限,导致算法难以适应多样的实际场景。

Contribution: 提出了一个可重复且可控的测试框架,能够模拟不同水成分对图像外观的影响,并生成用于多种视觉任务的真实数据集。

Method: 使用校准的颜色和散射添加剂,创建可控的水下光学环境,模拟不同的水成分和光学条件,生成多样化的数据集。

Result: 提供了一个演示数据集,并展示了该框架在两项水下视觉任务中的应用。数据集和评估代码将公开。

Insight: 通过可控的合成环境生成真实数据,为水下机器视觉的研究提供了一种新的方法,解决了实际测试的不足。

Abstract: The development and evaluation of machine vision in underwater environments remains challenging, often relying on trial-and-error-based testing tailored to specific applications. This is partly due to the lack of controlled, ground-truthed testing environments that account for the optical challenges, such as color distortion from spectrally variant light attenuation, reduced contrast and blur from backscatter and volume scattering, and dynamic light patterns from natural or artificial illumination. Additionally, the appearance of ocean water in images varies significantly across regions, depths, and seasons. However, most machine vision evaluations are conducted under specific optical water types and imaging conditions, therefore often lack generalizability. Exhaustive testing across diverse open-water scenarios is technically impractical. To address this, we introduce the \textit{Optical Ocean Recipes}, a framework for creating realistic datasets under controlled underwater conditions. Unlike synthetic or open-water data, these recipes, using calibrated color and scattering additives, enable repeatable and controlled testing of the impact of water composition on image appearance. Hence, this provides a unique framework for analyzing machine vision in realistic, yet controlled underwater scenarios. The controlled environment enables the creation of ground-truth data for a range of vision tasks, including water parameter estimation, image restoration, segmentation, visual SLAM, and underwater image synthesis. We provide a demonstration dataset generated using the Optical Ocean Recipes and briefly demonstrate the use of our system for two underwater vision tasks. The dataset and evaluation code will be made available.

[93] Universal Camouflage Attack on Vision-Language Models for Autonomous Driving

Dehong Kong,Sifan Yu,Siyuan Liang,Jiawei Liang,Jianhou Gan,Aishan Liu,Wenqi Ren

Main category: cs.CV

TL;DR: 该论文提出了首个针对自动驾驶视觉语言模型(VLM-AD)的通用伪装攻击框架(UCA),通过在特征空间中生成可物理实现的伪装纹理,有效误导模型决策,并在多场景和多模型上展现出强泛化性和鲁棒性。

Details Motivation: 现有的对抗攻击方法要么针对视觉模块,难以直接迁移到VLM-AD系统,要么仅局限于数字层面攻击。论文旨在解决这些问题,提出一种能在物理世界中实现的高效攻击框架。

Contribution: 1. 提出了首个针对VLM-AD的通用伪装攻击框架(UCA),在特征空间中生成可物理实现的伪装纹理;2. 设计了特征分歧损失(FDL)和多尺度学习策略,提升攻击泛化性和鲁棒性。

Method: UCA框架通过在特征空间中优化伪装纹理,利用FDL最大化干净图像与对抗图像的表征差异,并结合多尺度学习和动态采样策略提升适应性和训练稳定性。

Result: 实验表明,UCA能显著误导多种VLM-AD模型(在3-P指标上提升30%),且在动态环境下展示出强鲁棒性。

Insight: VLM-AD的编码器和投影层易受攻击,UCA的成功揭示了特征空间优化和多尺度学习的潜在优势,为未来安全研究提供了新方向。

Abstract: Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.

[94] PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation

Mahmoud Khater,Mona Strauss,Philipp von Olshausen,Alexander Reiterer

Main category: cs.CV

TL;DR: PU-Gaussian提出了一种基于3D高斯表示的点云上采样方法,通过局部几何邻域的建模和显式采样生成稠密点云,并在多个数据集上取得了最先进的性能。

Details Motivation: 现有方法在点云上采样中通常牺牲几何可解释性或在输入稀疏时缺乏鲁棒性。PU-Gaussian旨在通过3D高斯分布建模局部几何结构,克服这些限制。

Contribution: 1. 提出一种基于3D高斯分布的点云上采样网络,显式建模局部几何结构;2. 结合采样和细化网络,生成稠密且高质量的点云;3. 在多个数据集上验证了其优越性能。

Method: 1. 为每个点建模各向异性3D高斯分布以捕获局部几何结构;2. 通过直接采样生成稠密但粗糙的点云;3. 使用细化网络优化点分布和边缘清晰度。

Result: 在PU1K和PUGAN数据集上实现了最先进的性能,生成的点云更稠密且质量更高。

Insight: 通过显式建模局部几何结构,PU-Gaussian在点云上采样任务中实现了更高的鲁棒性和几何保真度,同时结合采样与细化模块提升了生成质量。

Abstract: Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at https://github.com/mvg-inatech/PU-Gaussian.git.

[95] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Tom Burgert,Oliver Stoll,Paolo Rota,Begüm Demir

Main category: cs.CV

TL;DR: 本文通过系统性抑制形状、纹理和颜色特征的研究框架,挑战了CNN对纹理有偏向性的假设,揭示了CNN主要依赖局部形状特征,并可通过训练策略或架构改进。不同领域的模型表现出不同的特征依赖模式。

Details Motivation: 现有研究中认为CNN具有纹理偏向性的假设可能源于实验设计的局限性,本文旨在通过更严谨的实验方法重新验证这一假设。

Contribution: 1. 提出了一个领域无关的框架,通过系统性抑制特征来量化模型的依赖模式。2. 发现CNN并非固有纹理偏向性,而是主要依赖局部形状特征。3. 揭示了不同领域的模型(计算机视觉、医学影像、遥感)在特征依赖上的系统性差异。

Method: 设计了域无关的框架,通过控制抑制形状、纹理和颜色信号,避免强制选择冲突的混淆效应,并对人类和神经网络进行评估。

Result: CNN主要依赖局部形状特征而非纹理,依赖模式可通过现代训练策略或架构改进。不同领域模型的特征依赖模式存在系统性差异。

Insight: 1. CNN的特征依赖性是灵活的,可通过训练或架构调整。2. 不同领域的数据特性导致模型的特征依赖模式不同,提示领域适应性的重要性。

Abstract: The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at https://github.com/tomburgert/feature-reliance.

[96] An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation

Kwang-Hyun Uhm,Hyunjun Cho,Sung-Hoo Hong,Seung-Won Jung

Main category: cs.CV

TL;DR: 该论文提出了一种基于多参考非局部注意力的各向异性跨视图纹理迁移方法,用于CT切片插值,显著提升了中间切片的生成质量。

Details Motivation: 临床CT图像因存储和操作成本高,通常以较大的切片厚度采集,导致各向异性的体积数据。现有方法未能充分利用这种各向异性特性。

Contribution: 1. 提出了一种新的跨视图纹理迁移框架;2. 设计了多参考非局部注意力模块;3. 在公开CT数据集上验证了方法的优越性。

Method: 利用高分辨率平面纹理作为参考,通过多参考非局部注意力模块迁移到低分辨率平面图像中,重建高频细节。

Result: 在公开CT数据集上表现优于现有方法,特别是在真实配对的基准测试中。

Insight: 各向异性特性在CT图像处理中具有潜在优势,跨视图纹理迁移可以有效提升插值质量。

Abstract: Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT.

[97] 4D Driving Scene Generation With Stereo Forcing

Hao Lu,Zhuang Ma,Guangfeng Jiang,Wenhang Ge,Bohan Li,Yuzhan Cai,Wenzhao Zheng,Yunpeng Zhang,Yingcong Chen

Main category: cs.CV

TL;DR: PhiGenesis是一个统一的4D驾驶场景生成框架,结合几何和时间一致性,支持时空外推和新视角合成。

Details Motivation: 当前生成模型难以在不进行每场景优化的情况下合成动态4D驾驶场景,同时支持时间外推和空间新视角合成。

Contribution: 提出PhiGenesis框架,结合几何引导的视频扩散模型和Stereo Forcing策略,解决几何曝光偏差问题。

Method: 两阶段方法:1) 使用预训练视频VAE和新设计的range-view适配器进行4D重建;2) 利用几何引导的视频扩散模型生成未来视图。

Result: 在几何重建、时间生成和新视角合成任务中取得最优性能。

Insight: Stereo Forcing通过动态调整几何不确定性的生成影响,提升了新视角合成的时间一致性。

Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.

[98] A co-evolving agentic AI system for medical imaging analysis

Songhao Li,Jonathan Xu,Tiancheng Bao,Yuxuan Liu,Yuchen Liu,Yihang Liu,Lilin Wang,Wenhui Lei,Sheng Wang,Yinuo Xu,Yan Cui,Jialu Yao,Shunsuke Koga,Zhi Huang

Main category: cs.CV

TL;DR: TissueLab是一个协同进化的AI系统,用于医学影像分析,整合多领域工具,支持实时交互和专家反馈,实现高性能和快速适应新任务。

Details Motivation: 当前医学影像分析中的AI系统性能和应用受限,主要由于缺乏强大的生态系统、工具集不足以及缺少实时专家反馈。

Contribution: 提出了TissueLab,一种支持协同进化的AI系统,能够自动规划可解释的工作流,并允许专家实时干预和优化结果,同时整合了多领域的工具。

Method: 通过标准化工具的输入、输出和能力,实现自动化调用;结合实时专家反馈和主动学习,持续优化模型和决策策略。

Result: 在临床任务中表现优于现有端到端视觉语言模型和其他AI系统,并能快速适应未见过的疾病背景。

Insight: 协同进化模式和实时反馈机制是提升医学AI系统性能和适应性的关键。

Abstract: Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present “TissueLab”, a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.

[99] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

Dayu Tan,Zhenpeng Xu,Yansen Su,Xin Peng,Chunhou Zheng,Weimin Zhong

Main category: cs.CV

TL;DR: HiPerformer是一种高性能的全局-局部分割模型,通过创新的模块化分层架构和动态特征融合策略,解决了现有方法在特征不一致和信息损失方面的问题。

Details Motivation: 医学图像分割中,局部细节和全局上下文信息均至关重要,但现有CNN-Transformer混合架构的方法在特征融合中存在问题,导致信息冲突和丢失。

Contribution: 1. 提出模块化分层架构(Modular Hierarchical Architecture)实现动态多源特征融合;2. 设计局部-全局特征融合(LGFF)模块缓解特征不一致问题;3. 提出渐进金字塔聚合(PPA)模块增强多尺度特征表示。

Method: 1. 编码器采用模块化分层架构动态融合特征;2. LGFF模块整合局部与全局信息;3. PPA模块替换传统跳跃连接以提升多尺度特征表示。

Result: 在11个公开数据集上的实验表明,HiPerformer优于现有分割方法,具有更高的分割精度和鲁棒性。

Insight: 模块化设计和动态特征融合能有效避免信息丢失,局部-全局联合优化是提升分割性能的关键。

Abstract: Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.

[100] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Chen Wang,Chuhao Chen,Yiming Huang,Zhiyang Dou,Yuan Liu,Jiatao Gu,Lingjie Liu

Main category: cs.CV

TL;DR: PhysCtrl是一个基于物理动力学生成可控视频的框架,通过学习物理参数和力的扩散模型,生成具有物理真实感的3D运动轨迹。

Details Motivation: 现有视频生成模型缺乏物理真实性和3D可控性,因此需要一种能够结合物理参数和力的控制方法。

Contribution: 提出PhysCtrl框架,通过生成式物理网络学习多种材料的物理动力学,并结合扩散模型和时空注意力模块,实现物理真实的视频生成。

Method: 使用扩散模型学习物理动力学分布,训练于大规模合成数据集,并引入时空注意力模块模拟粒子交互和物理约束。

Result: 实验表明,PhysCtrl生成的视频在视觉效果和物理真实性上优于现有方法。

Insight: 结合物理模拟和生成模型可以显著提升视频生成的物理真实性和可控性。

Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

[101] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju,Tianyu Wang,Yuqian Zhou,He Zhang,Qing Liu,Nanxuan Zhao,Zhifei Zhang,Yijun Li,Yuanhao Cai,Shaoteng Liu,Daniil Pakhomov,Zhe Lin,Soo Ye Kim,Qiang Xu

Main category: cs.CV

TL;DR: EditVerse是一个统一的图像和视频生成与编辑框架,通过将文本、图像和视频表示为统一的标记序列,实现了多模态的自注意力学习。为了解决视频编辑数据的稀缺问题,作者设计了数据管道并引入了首个指令式视频编辑基准EditVerseBench。实验表明,EditVerse在性能上超越了现有的开源和商业模型,并展现了跨模态的新兴编辑和生成能力。

Details Motivation: 当前的图像生成和编辑已趋向统一框架,但视频领域仍然碎片化,主要受限于架构问题和数据稀缺。EditVerse试图通过统一多模态表示和自注意力机制来解决这些问题。

Contribution: 1. 提出了EditVerse,首次将图像和视频的生成与编辑统一到一个模型中;2. 设计了一个可扩展的数据管道,解决了视频编辑数据的稀缺问题;3. 推出了首个指令式视频编辑基准EditVerseBench。

Method: EditVerse通过将文本、图像和视频表示为统一的标记序列,利用自注意力机制实现上下文学习、跨模态知识迁移以及对任意分辨率和时长的输入输出处理。

Result: 实验和用户研究表明,EditVerse在性能上超越了现有开源和商业模型,并表现出跨模态的新兴编辑和生成能力。

Insight: EditVerse的成功表明,统一的多模态表示和自注意力机制可以有效解决视频领域的碎片化问题,同时在数据稀缺背景下设计高效的数据管道至关重要。

Abstract: Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

cs.MM [Back]

[102] MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization

Jianxuan Yang,Xiaoran Yang,Lipan Zhang,Xinyue Guo,Zhao Wang,Gongping Huang

Main category: cs.MM

TL;DR: MultiSoundGen提出了一种新的视频到音频(V2A)生成框架,通过SlowFast对比音频-视觉预训练(SF-CAVP)和直接偏好优化(DPO)解决了多事件场景中的语义对齐和音频质量优化问题,实现了在多事件场景中的最优性能。

Details Motivation: 当前V2A方法在多事件场景中存在语义对齐和动态特征捕获不足的问题,且缺乏定量化的偏好优化,导致生成质量不佳。本研究旨在解决这些问题。

Contribution: 提出了SF-CAVP模型,统一双流架构显式对齐语义和动态特征;将DPO引入V2A任务,提出AVP-RPO方法,通过SF-CAVP作为奖励模型优化生成质量。

Method: 采用SF-CAVP进行音频-视觉预训练,通过双流架构对齐语义和动态特征;引入DPO和AVP-RPO,量化语义-时间对齐和音频质量。

Result: 实验表明MultiSoundGen在多事件场景中表现最优,全面提升了分布匹配、音频质量、语义对齐和时间同步性。

Insight: SF-CAVP的双流设计和DPO的结合在多事件场景中展现出显著优势,为复杂V2A任务提供了新的解决方案。

Abstract: Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. The complete code and dataset will be released soon.

cs.HC [Back]

[103] Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making

Cassandra Overney,Hang Jiang,Urooj Haider,Cassandra Moe,Jasmine Mangat,Frank Pantano,Effie G. McMillian,Paul Riggins,Nabeel Gillani

Main category: cs.HC

TL;DR: 这篇论文提出了StoryBuilder,一种人机协作的叙事合成系统,用于在公民决策中促进共享理解。通过生成第一人称叙事,帮助社区成员跨越多元视角建立联系,实证表明基于经验的叙事比基于观点的叙事更能增加信任和尊重。

Details Motivation: 传统的社区反馈分析方法在大量数据面前效率低下,阻碍了公民与领导者之间的共享理解。论文旨在通过人机协作改进这一过程,提升社区成员之间的沟通与理解。

Contribution: 1. 提出了StoryBuilder人机协作叙事合成系统;2. 在实际公民决策场景中验证了系统的有效性;3. 发现基于经验的叙事比基于观点的叙事更能促进信任和尊重。

Method: 1. 开发了StoryBuilder管道,将社区反馈转化为第一人称叙事;2. 设计了StorySharer移动界面展示叙事;3. 通过四个月的实际部署、21名社区成员的用户研究以及控制实验评估系统效果。

Result: 1. 实地部署表明叙事帮助社区成员理解多元视角;2. 实验表明基于经验的叙事更能增加信任和尊重。

Insight: 人机协作叙事合成可以有效提升公民决策中的共享理解,尤其是通过经验驱动的叙事设计可以显著增强社区成员的互信和尊重。

Abstract: Community engagement processes in representative political contexts, like school districts, generate massive volumes of feedback that overwhelm traditional synthesis methods, creating barriers to shared understanding not only between civic leaders and constituents but also among community members. To address these barriers, we developed StoryBuilder, a human-AI collaborative pipeline that transforms community input into accessible first-person narratives. Using 2,480 community responses from an ongoing school rezoning process, we generated 124 composite stories and deployed them through a mobile-friendly StorySharer interface. Our mixed-methods evaluation combined a four-month field deployment, user studies with 21 community members, and a controlled experiment examining how narrative composition affects participant reactions. Field results demonstrate that narratives helped community members relate across diverse perspectives. In the experiment, experience-grounded narratives generated greater respect and trust than opinion-heavy narratives. We contribute a human-AI narrative synthesis system and insights on its varied acceptance and effectiveness in a real-world civic context.

eess.AS [Back]

[104] Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning

Shaoshi Ling,Gang Liu,Guoli Ye,Jinyu Li

Main category: eess.AS

TL;DR: 本文提出了一种基于多阶段强化学习的训练框架,显著提升了多模态大语言模型(MLLMs)在语音摘要任务中的性能,缩小了与纯文本LLMs的差距。

Details Motivation: 随着语音和音视频数据的快速增长,语音摘要成为理解口语内容的关键技术。尽管多模态大语言模型(MLLMs)能够直接从语音生成文本摘要,但其性能仍落后于纯文本LLMs,限制了实际应用。

Contribution: 提出了一种多阶段强化学习训练框架,显著提升了MLLMs在语音摘要任务中的性能,并缩小了与纯文本LLMs的差距。

Method: 采用多阶段强化学习训练框架,通过优化模型在语音摘要任务中的表现,提升了生成摘要的质量和可控性。

Result: 模型在语音摘要任务中超越了基线方法,甚至表现优于更大的MLLMs,显著缩小了与纯文本LLMs的差距。

Insight: 通过强化学习优化MLLMs在语音摘要任务中的表现,证明了多模态模型中融合语音和文本的重要性,同时展示了强化学习在提升生成任务中的潜力。

Abstract: Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.

cs.AI [Back]

[105] Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

Sai Teja Reddy Adapala

Main category: cs.AI

TL;DR: 本文研究了大型语言模型(LLM)在认知负荷下的性能限制,提出了一个理论框架来分析上下文饱和和注意力残留对模型推理能力的影响。通过ICE基准测试发现,模型在多跳推理任务中的表现显著下降,尤其是小型开源模型在高压条件下完全失效。

Details Motivation: 尽管LLM在静态任务上表现出色,但其在动态、信息丰富的环境中的脆弱性尚未被充分理解。作者希望通过研究认知负荷对模型推理的影响,揭示其性能下降的关键机制。

Contribution: 1. 提出了计算认知负荷的形式化理论;2. 设计了ICE基准测试,系统地操纵上下文饱和和注意力残留;3. 通过实验揭示了模型在多跳推理任务中的性能变化。

Method: 作者设计了Interleaved Cognitive Evaluation (ICE)基准测试,通过控制上下文饱和和任务切换干扰,评估不同模型在多跳推理任务中的表现。实验涵盖了五个指令调优模型,并进行多次重复验证。

Result: 小型开源模型(如Llama-3-8B-Instruct和Mistral-7B-Instruct-v0.2)在高负荷任务中表现极差(0%准确率),而Gemini-2.0-Flash-001在控制条件下表现较好(85%准确率),但在上下文饱和情况下性能显著下降。

Insight: 认知负荷是导致LLM推理失败的关键因素。动态的、基于认知压力的测试(如ICE)对评估AI系统的真实鲁棒性和安全性至关重要。

Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.

[106] UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

Cheng Qian,Zuxin Liu,Akshara Prabhakar,Jielin Qiu,Zhiwei Liu,Haolin Chen,Shirley Kokane,Heng Ji,Weiran Yao,Shelby Heinecke,Silvio Savarese,Caiming Xiong,Huan Wang

Main category: cs.AI

TL;DR: 论文提出了UserRL框架,通过标准化的gym环境和模拟用户训练用户为中心的代理模型,结合GRPO算法分析奖励分配和轨迹评分对学习的影响,发现SFT冷启动和轨迹评分设计对多轮交互效率至关重要。

Details Motivation: 尽管强化学习在动态交互任务中表现优异,但如何训练出真正以用户为中心的智能代理仍面临多样性和动态交互的挑战。

Contribution: 提出了UserRL框架,结合标准化gym环境和模拟用户,系统分析了奖励分配和轨迹评分对学习的影响。

Method: 使用GRPO算法,通过调整回合级奖励和轨迹级评分,结合开源模拟用户(如Qwen3-32B)和商业模拟用户(如GPT-4o)进行实验。

Result: 发现SFT冷启动对初始交互能力至关重要,轨迹评分设计显著提升多轮交互效率,开源模拟用户是成本效益高的选择。

Insight: 奖励设计和模拟用户选择对模型性能的影响不亚于模型规模,UserRL为开发鲁棒的用户中心代理模型提供了一条实用路径。

Abstract: Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.

eess.IV [Back]

[107] Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation

Yuxiao Yi,Qingyao Zhuang,Zhi-Qin John Xu

Main category: eess.IV

TL;DR: 该论文针对儿童脑肿瘤分割的独特挑战,提出了一种集成nnU-Net、Swin UNETR和HFF-Net的方法,通过调整初始化尺度、迁移学习和频域分解提升模型性能,在BraTS-PED 2025挑战中取得了显著的分割效果。

Details Motivation: 儿童脑肿瘤因其罕见性和异质性在分割任务中面临独特挑战,临床诊断和治疗规划亟需高效的分割方法。

Contribution: 提出了一个集成三种模型(nnU-Net、Swin UNETR、HFF-Net)的新方法,通过调整初始化尺度、迁移学习和频域分解优化模型性能。

Method: 方法包括:1) 调整nnU-Net的初始化复杂度,2) 从BraTS 2021预训练模型迁移学习以增强Swin UNETR的泛化能力,3) HFF-Net通过频域分解分离高低频信息。

Result: 在BraTS-PED 2025数据集上,分别取得了不同肿瘤区域的Dice分数:ET(72.3%)、NET(95.6%)、CC(68.9%)、ED(89.5%)、TC(92.3%)和WT(92.3%)。

Insight: 频域分解和模型集成为儿童脑肿瘤分割提供了新思路,迁移学习和初始化优化可显著提升模型在小数据集上的性能。

Abstract: Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR’s generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble combines nnU-Net ($\gamma=0.7$), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT), respectively.

[108] Ensuring Reliable Participation in Subjective Video Quality Tests Across Platforms

Babak Naderi,Ross Cutler

Main category: eess.IV

TL;DR: 论文探讨了在主观视频质量测试中如何确保跨平台的参与者可靠性,提出了检测远程桌面用户的方法,并比较了两个主流众包平台的易受攻击性和缓解措施。

Details Motivation: 主观视频质量评估是衡量终端用户体验的金标准,但众包测试中参与者可能通过忽略指令或操纵奖励来提供不可靠的数据,尤其是通过远程桌面连接或利用视频元数据的方式。

Contribution: 提出了客观和主观检测器来识别远程桌面用户,并比较了两个众包平台在实际测试条件和任务设计下的易受攻击性和缓解效果。

Method: 结合客观和主观方法检测远程桌面用户,并在两个主流众包平台上进行实验,评估其在实际测试中的表现。

Result: 研究揭示了远程桌面连接的使用对测试结果的偏差影响,并提出了有效的检测方法。

Insight: 通过检测和缓解不可靠参与者的行为,可以显著提升主观视频质量评估的准确性和可靠性。

Abstract: Subjective video quality assessment (VQA) is the gold standard for measuring end-user experience across communication, streaming, and UGC pipelines. Beyond high-validity lab studies, crowdsourcing offers accurate, reliable, faster, and cheaper evaluation-but suffers from unreliable submissions by workers who ignore instructions or game rewards. Recent tests reveal sophisticated exploits of video metadata and rising use of remote-desktop (RD) connections, both of which bias results. We propose objective and subjective detectors for RD users and compare two mainstream crowdsourcing platforms on their susceptibility and mitigation under realistic test conditions and task designs.

cs.CE [Back]

[109] Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series

Ross Koval,Nicholas Andrews,Xifeng Yan

Main category: cs.CE

TL;DR: 该论文提出了一种用于金融预测的多模态语言模型,通过特定模态专家处理文本和时间序列数据的交织序列,并引入跨模态对齐框架,实现了最先进的预测性能和有意义的经济收益。

Details Motivation: 金融市场的文本和时间序列数据提供了互补的信息,但如何有效整合这些交织的多模态数据以改进预测仍是一个挑战。

Contribution: 1. 提出了一种统一神经网络架构,使用特定模态专家处理交织序列;2. 设计了跨模态对齐框架,聚焦最具信息量的token;3. 提供了一个可解释性方法,揭示时间序列上下文的价值。

Method: 采用模态特定专家分别学习文本和时间序列的模式,并通过跨模态对齐和显著token加权机制整合多模态信息。

Result: 在大规模金融预测任务中表现优异,超越了多种单模态和多模态基线,并在投资模拟中实现经济收益。

Insight: 跨模态对齐和显著token加权机制能有效提升多模态模型的预测能力,同时时间序列上下文对金融预测具有重要价值。

Abstract: Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.

cs.RO [Back]

[110] HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

Alessandro Saviolo,Jeffrey Mao,Giuseppe Loianno

Main category: cs.RO

TL;DR: HUNT 是一个实时框架,通过瞬时相对帧实现无人机在非结构化环境中的高速导航和目标跟踪,解决了全局定位缺失下的挑战。

Details Motivation: 搜索和救援任务需要无人机在未知非结构化环境中高速飞行并跟踪目标,但在感知能力受限且无全局定位的情况下,同时实现这两种能力仍是一个开放性问题。

Contribution: HUNT 提出了一种统一的实时框架,将高速导航和目标跟踪结合在一个相对导航范式中,解决了无目标可见时的导航问题。

Method: HUNT 使用机载瞬时观测数据(如姿态、高度和速度)定义导航目标,实现反应式高速飞行;目标检测后,同一感知-控制流水线无缝切换为跟踪模式。

Result: 在密集森林、集装箱堆场和搜索救援任务中的实验表明,HUNT 在全局方法失效的场景中表现出鲁棒性。

Insight: 通过相对导航范式,HUNT 为缺乏全局定位的高速无人机任务提供了可行的解决方案。

Abstract: Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

[111] Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

Sacha Morin,Kumaraditya Gupta,Mahtab Sandhu,Charlie Gauthier,Francesco Argenziano,Kirsty Ellis,Liam Paull

Main category: cs.RO

TL;DR: ASP是一个基于现代场景表示的语言条件机器人策略框架,通过显式推理对象的功能性(affordances)和场景空间语义,实现了零样本开放词汇查询和复杂指令执行。

Details Motivation: 为了解决端到端策略模型在处理复杂指令和新场景时的困难,本文提出了一种显式的场景表示方法,作为机器人与世界之间的可查询接口,以指导运动规划。

Contribution: 提出了Agentic Scene Policies (ASP),利用现代场景表示的语义、空间和功能性查询能力,实现了高效的语言条件机器人政策。

Method: ASP通过显式推理对象的功能性和场景的空间语义,结合高级查询能力,执行开放词汇查询和复杂技能。

Result: 实验表明,ASP在桌面操纵和房间级查询任务中表现优于VLAs,尤其是通过功能性导航和扩展的场景表示。

Insight: 显式场景表示和功能性推理的结合可以显著提升机器人处理复杂指令和新场景的能力。

Abstract: Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)

[112] EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data

Ryan Punamiya,Dhruv Patel,Patcharapong Aphiwetsa,Pranav Kuppili,Lawrence Y. Zhu,Simar Kareer,Judy Hoffman,Danfei Xu

Main category: cs.RO

TL;DR: EgoBridge是一个领域自适应框架,通过对齐人类和机器人数据的潜在空间,实现从人类第一视角数据到机器人模仿学习的知识迁移,显著提升了任务成功率。

Details Motivation: 人类的第一视角数据为机器人模仿学习提供了丰富资源,但由于视觉、传感器和运动学上的领域差异,直接迁移效果不佳。EgoBridge旨在解决这一问题。

Contribution: 提出EgoBridge,一种统一的联合训练框架,通过最优传输(OT)对齐人类和机器人数据的潜在空间,实现了知识迁移和任务泛化。

Method: 利用最优传输度量策略潜在特征和动作之间的差异,学习既对齐领域又保留动作关键信息的观测表示。

Result: 在三个真实世界任务中,EgoBridge比基线方法提升了44%的成功率,并能泛化到仅在人数据中出现的新对象和场景。

Insight: 通过领域对齐和动作信息保留,EgoBridge证明了从人类数据中提取知识用于机器人任务的潜力。

Abstract: Egocentric human experience data presents a vast resource for scaling up end-to-end imitation learning for robotic manipulation. However, significant domain gaps in visual appearance, sensor modalities, and kinematics between human and robot impede knowledge transfer. This paper presents EgoBridge, a unified co-training framework that explicitly aligns the policy latent spaces between human and robot data using domain adaptation. Through a measure of discrepancy on the joint policy latent features and actions based on Optimal Transport (OT), we learn observation representations that not only align between the human and robot domain but also preserve the action-relevant information critical for policy learning. EgoBridge achieves a significant absolute policy success rate improvement by 44% over human-augmented cross-embodiment baselines in three real-world single-arm and bimanual manipulation tasks. EgoBridge also generalizes to new objects, scenes, and tasks seen only in human data, where baselines fail entirely. Videos and additional information can be found at https://ego-bridge.github.io

cs.LG [Back]

[113] VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

Guochao Jiang,Wenfeng Feng,Guofeng Quan,Chuzhan Hao,Yuewei Zhang,Guohua Liu,Hao Wang

Main category: cs.LG

TL;DR: VCRL是一种基于方差的课程强化学习框架,通过动态调整训练样本的难度,提升了大型语言模型(LLM)在数学推理任务中的表现。

Details Motivation: 现有基于rollout的强化学习方法(如GRPO、DAPO、GSPO)未能明确考虑LLM对不同难度样本的学习能力,而人类认知过程是从易到难的。

Contribution: 提出VCRL框架,通过利用rollout组奖励的方差动态控制训练样本难度,更符合LLM的学习规律。

Method: 基于rollout组奖励的方差判断样本难度(方差中等对应适当难度),动态调整样本难度进行训练。

Result: 在五个数学基准数据集和两种模型上的实验显示,VCRL优于现有的LLM强化学习基线方法。

Insight: 样本的难度可以通过reward的方差量化,动态调整难度能更高效地提升LLM的性能。

Abstract: Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs’ learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group’s reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.

[114] PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

Xueliang Zhao,Wei Wu,Jian Guan,Zhuocheng Gong,Lingpeng Kong

Main category: cs.LG

TL;DR: PromptCoT 2.0是一个可扩展的框架,通过EM循环优化提示合成,生成更难且更多样的训练问题,显著提升大语言模型的推理能力。

Details Motivation: 高质量训练问题的缺乏限制了LLMs的推理能力提升,PromptCoT 2.0旨在通过自动合成更具挑战性的提示来解决这一问题。

Contribution: 1. 提出PromptCoT 2.0框架,利用EM循环优化提示合成;2. 提供两种后训练方案(Self-Play和SFT)验证其有效性;3. 在多个基准测试中实现SOTA性能。

Method: 通过EM循环迭代优化提示合成,生成更难且更多样的问题;支持Self-Play和SFT两种训练机制。

Result: 在30B规模的Qwen3-30B-A3B-Thinking-2507模型上实现了显著的性能提升,SFT训练的Qwen2.5-7B-Instruct也超越了基于人类数据的模型。

Insight: 提示合成可成为扩展LLMs推理能力的新维度,PromptCoT 2.0为开源模型提供了可扩展的基础。

Abstract: Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

cs.DB [Back]

[115] STARQA: A Question Answering Dataset for Complex Analytical Reasoning over Structured Databases

Mounica Maddela,Lingjue Xie,Daniel Preotiuc-Pietro,Mausam

Main category: cs.DB

TL;DR: STARQA是首个针对复杂分析推理问题的公开数据集,专注于需要聚合分析、时间序列分析等复杂操作的问答任务,并通过结合SQL和Python的方法提升性能。

Details Motivation: 现有文本转SQL的基准测试问题复杂度受限于查询语言的表达能力,缺乏对复杂分析推理问题的关注。STARQA弥补了这一空白。

Contribution: 1) 提出首个复杂分析推理问题的公开数据集STARQA;2) 提出结合SQL(数据获取)和Python(推理)的Text2SQLCode方法。

Method: 将任务分解为SQL(数据提取)和Python(复杂推理)的组合,利用两者的优势。

Result: 该方法比仅用SQL表现更好,但对当前最先进的LLM仍具挑战性。

Insight: 结合SQL和Python能更自然地处理复杂分析任务,凸显了跨语言协作的潜力。

Abstract: Semantic parsing methods for converting text to SQL queries enable question answering over structured data and can greatly benefit analysts who routinely perform complex analytics on vast data stored in specialized relational databases. Although several benchmarks measure the abilities of text to SQL, the complexity of their questions is inherently limited by the level of expressiveness in query languages and none focus explicitly on questions involving complex analytical reasoning which require operations such as calculations over aggregate analytics, time series analysis or scenario understanding. In this paper, we introduce STARQA, the first public human-created dataset of complex analytical reasoning questions and answers on three specialized-domain databases. In addition to generating SQL directly using LLMs, we evaluate a novel approach (Text2SQLCode) that decomposes the task into a combination of SQL and Python: SQL is responsible for data fetching, and Python more naturally performs reasoning. Our results demonstrate that identifying and combining the abilities of SQL and Python is beneficial compared to using SQL alone, yet the dataset still remains quite challenging for the existing state-of-the-art LLMs.