Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 64]
- cs.AI [Total: 2]
- cs.IR [Total: 2]
- cs.LG [Total: 7]
- cs.DB [Total: 1]
- cs.CR [Total: 1]
- eess.AS [Total: 1]
- cs.RO [Total: 2]
- cs.MA [Total: 1]
- eess.IV [Total: 8]
cs.CL [Back]
[1] MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered
Imran Mirza,Cole Huang,Ishwara Vasista,Rohan Patil,Asli Akalin,Sean O’Brien,Kevin Zhu
Main category: cs.CL
TL;DR: 该论文提出了MALIBU基准测试,用于评估多智能体LLM系统如何隐含地强化社会偏见和刻板印象,揭示了偏见的量化结果及其对边缘群体的影响。
Details
Motivation: 多智能体系统在基于角色的交互中越来越常见,但如果不精心设计,可能会强化LLM中的隐含偏见,引发公平性和代表性不平等的担忧。Contribution: 提出了MALIBU基准测试,通过场景化评估量化多智能体LLM系统中的偏见,并发现偏见缓解可能更倾向于边缘群体而非真正的公平性。
Method: MALIBU采用基于场景的双阶段评估:首先根据特定人口统计角色标签对响应评分,随后比较不同角色的配对响应以选择更优者。
Result: 研究表明LLM生成的输出中存在偏见,并指出偏见缓解策略可能偏向边缘群体,而非实现真正的平衡。
Insight: 需要更细致的偏见检测、平衡的公平性策略以及透明的评估基准,以确保多智能体系统的公平性和代表性。
Abstract: Multi-agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona-based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM-based multi-agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM-based multi-agent systems through scenario-based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM-based multi-agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi-agent systems.
[2] Matching and Linking Entries in Historical Swedish Encyclopedias
Simon Börjesson,Erik Ersmark,Pierre Nugues
Main category: cs.CL
TL;DR: 本文研究了19至20世纪瑞典百科全书《Nordisk familjebok》的两版版本,通过语义句子嵌入匹配条目,并利用基于transformer的分类器提取地理条目,分析地理趋势变化,发现从第一版到第二版,地理焦点从欧洲转向北美、非洲等地区。
Details
Motivation: 研究动机在于分析百科全书条目随时间和知识变迁的变化,尤其是地理条目如何反映社会和政治背景的变化。Contribution: 主要贡献包括:1)开发了一种方法,通过语义句子嵌入匹配和分类地理条目;2)揭示了地理焦点的显著变化,反映了历史事件的影响。
Method: 方法包括:1)对原始文本重新分段为条目;2)使用语义句子嵌入匹配不同版本条目;3)基于transformer的分类器提取和链接地理条目至Wikidata。
Result: 研究发现从第一版(1876-1899)到第二版(1904-1926),地理焦点从欧洲转向北美、非洲等地区,反映了第一次世界大战和新势力崛起的影响。
Insight: 研究展示了百科全书如何作为知识变迁的镜子,同时提供了一种利用现代NLP技术分析历史文本的有效方法。
Abstract: The \textit{Nordisk familjebok} is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the \textit{Nordisk familjebok} had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden. In this paper, we used digitized versions from \textit{Project Runeberg}. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876-1899 and 1904-1926, respectively. Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at https://github.com/sibbo/nordisk-familjebok.
[3] Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening
Cindy Lie Tabuse,David Restepo,Carolina Gracitelli,Fernando Korn Malerbi,Caio Regatieri,Luis Filipe Nakayama
Main category: cs.CL
TL;DR: 评估GPT-4在多模态模拟眼科决策(糖尿病视网膜病变和青光眼筛查)中的表现,发现其在基本任务上表现中等,但在复杂任务中缺乏精确性。
Details
Motivation: 探索大型语言模型(LLMs)在眼科领域中的临床推理能力,尤其是基于视网膜眼底照片描述的糖尿病视网膜病变(DR)和青光眼筛查任务。Contribution: 首次评估GPT-4在模拟眼科临床决策中的性能,分析了其在不同任务中的表现,并验证了添加临床元数据的影响。
Method: 使用300张标注的视网膜眼底照片,GPT-4通过结构化的文本描述任务(如ICDR评分、DR转诊建议和青光眼杯盘比估计)进行评估。通过准确性、F1分数和Cohen’s kappa等指标量化性能。
Result: GPT-4在ICDR分类(准确性67.5%)和DR转诊任务(准确性82.3%)中表现中等,但在青光眼转诊任务中表现较差(准确性78%)。添加元数据对结果无显著影响。
Insight: 虽然GPT-4不适用于临床,但在教育、文档生成或图像标注等辅助工作流中可能有潜在应用。
Abstract: Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4’s ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen’s kappa. McNemar’s test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 <0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes (McNemar p > 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.
[4] Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
Xinxi Lyu,Michael Duan,Rulin Shao,Pang Wei Koh,Sewon Min
Main category: cs.CL
TL;DR: 该论文提出了一个简单但高效的检索增强生成(RAG)管道,利用CompactDS数据存储,显著提升了推理密集型基准任务的性能。
Details
Motivation: 现有工作在推理密集型基准任务上的检索增强生成(RAG)表现有限,作者认为这是因为缺少一个高覆盖、高质量且与预训练数据对齐的数据存储。Contribution: 1. 设计并发布了CompactDS,一个多样且高质量的Web规模数据存储;2. 展示了简单RAG管道在多个基准任务上的显著性能提升。
Method: 1. 通过过滤Web内容构建CompactDS;2. 结合内存近似最近邻(ANN)和磁盘精确检索,平衡速度和召回率;3. 在多个基准任务上测试RAG管道。
Result: RAG管道在所有基准任务和模型规模上均实现一致性提升,例如MMLU提升10%,GPQA提升14%,MATH提升19%。
Insight: 1. 数据源的多样性至关重要;2. 简单设计的RAG管道可以超越复杂系统,同时保持简洁性和可复现性。
Abstract: Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B–70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems–all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.
[5] Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs
Nifu Dan,Yujun Cai,Yiwei Wang
Main category: cs.CL
TL;DR: 论文探讨了指令微调推理模型(如Deepseek-R1)在解决SciBench基准测试中的物理问题时的表现,发现其在准确性和符号推导方面表现优异,少数示例提示还能进一步提升性能。
Details
Motivation: 解决物理问题需要结合概念理解和问题求解技巧,研究旨在探索大型语言模型(LLMs)在这类任务中的能力提升潜力。Contribution: 1. 展示了推理模型在复杂物理问题上的卓越性能;2. 揭示了其独特的符号推导推理模式;3. 发现少数示例提示仍能显著提升模型精度。
Method: 使用指令微调推理模型(如Deepseek-R1)在SciBench基准测试上进行实验,结合少数示例提示策略。
Result: 模型在物理问题求解中达到最先进精度,且符号推导能力突出,少数示例提示进一步提高了性能。
Insight: 符号推导可能是LLMs解决物理问题的关键能力,少数示例提示的潜力值得进一步挖掘。
Abstract: Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
[6] LEDOM: An Open and Fundamental Reverse Language Model
Xunjian Yin,Sitao Cheng,Yuxi Xie,Xinyu Hu,Li Lin,Xinyi Wang,Liangming Pan,William Yang Wang,Xiaojun Wan
Main category: cs.CL
TL;DR: LEDOM是首个纯粹的逆向语言模型,通过逆向时序处理序列,并展示了其在通用任务中的基础模型潜力。通过逆向奖励应用,显著提升了数学推理任务的性能。
Details
Motivation: 当前语言模型主要基于正向时序处理,LEDOM探索逆向时序处理的可能性,揭示逆向推理的独特优势。Contribution: 1) 首个纯粹逆向语言模型;2) 提出逆向奖励应用,通过逆向推理优化生成质量;3) 开源模型和代码。
Method: 基于435B token数据训练2B和7B参数变体,采用逆向自回归预测机制,提出逆向奖励方法。
Result: 逆向奖励显著提升数学推理任务性能,显示逆向推理的广泛潜力。
Insight: 逆向语言模型具备独特推理能力,可作为基础模型支持多种任务,特别是需要后验优化的场景。
Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM’s unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
[7] Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Chris Yuhao Liu,Liang Zeng,Yuzhen Xiao,Jujie He,Jiacai Liu,Chaojie Wang,Rui Yan,Wei Shen,Fuxiang Zhang,Jiacheng Xu,Yang Liu,Yahui Zhou
Main category: cs.CL
TL;DR: 该论文提出了一种人机协同的两阶段数据标注流程,构建了包含4000万偏好对的大规模数据集SynPref-40M,并基于此训练了从0.6B到8B参数不等的8个奖励模型Skywork-Reward-V2,这些模型在多个评测基准上实现了SOTA性能。
Details
Motivation: 当前开源的奖励模型在大多数评测基准上表现不佳,无法捕捉人类偏见的复杂性。这主要是由于偏好数据集的范围狭窄、标注质量低或缺乏严格的质控。Contribution: 1. 提出SynPref-40M,一个包含4000万偏好对的大规模数据集;2. 设计了一种人机协同的两阶段数据标注流程;3. 发布了Skywork-Reward-V2系列奖励模型,参数规模从0.6B到8B不等,在多个评测基准上表现优异。
Method: 1. 通过人机协同流程构建SynPref-40M数据集:人类提供标注验证,大语言模型根据人类指导自动标注;2. 从SynPref-40M中筛选2600万高质量偏好对,训练8个不同规模的奖励模型。
Result: Skywork-Reward-V2在7个主要奖励模型评测基准上实现了SOTA性能,涵盖人类偏好对齐、安全性、抗风格偏见等多方面能力。
Insight: 研究表明,奖励模型的性能提升不仅依赖于数据规模,高质量的标注流程同样关键,人机协同的标注策略是解锁数据集潜力的有效方法。
Abstract: Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
[8] Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction
Ting Xu,Xiaoxiao Deng,Xiandong Meng,Haifeng Yang,Yan Wu
Main category: cs.CL
TL;DR: 提出了一种基于注意力机制的深度学习模型,用于处理电子健康记录文本,实现信息提取和多标签疾病预测。该模型在MIMIC-IV数据集上表现优于现有方法,具有强泛化能力。
Details
Motivation: 电子健康记录文本的非结构化和高维语义复杂性是主要挑战。需要一种统一的建模方法,既能提取关键信息,又能预测多标签疾病。Contribution: 1. 提出了一种基于Transformer的模型,结合多层自注意力机制捕获医学实体及其上下文关系;2. 引入了上下文感知的语义对齐机制,提升模型在标签共现和稀疏信息场景下的表现;3. 在MIMIC-IV数据集上验证了模型的优越性。
Method: 采用Transformer架构进行临床文本表示学习,通过多层自注意力机制提取关键信息,并使用基于Sigmoid的多标签分类器预测疾病。引入上下文感知语义对齐机制增强模型表示能力。
Result: 模型在多个性能指标上优于现有方法,并在不同数据规模、干扰水平和模型深度下表现出强泛化能力。
Insight: 注意力机制在临床文本处理中能有效捕获关键医学信息,上下文对齐机制进一步提升了复杂场景下的表现。该框架为实际临床文本处理提供了高效算法基础。
Abstract: This paper addresses the challenges posed by the unstructured nature and high-dimensional semantic complexity of electronic health record texts. A deep learning method based on attention mechanisms is proposed to achieve unified modeling for information extraction and multi-label disease prediction. The study is conducted on the MIMIC-IV dataset. A Transformer-based architecture is used to perform representation learning over clinical text. Multi-layer self-attention mechanisms are employed to capture key medical entities and their contextual relationships. A Sigmoid-based multi-label classifier is then applied to predict multiple disease labels. The model incorporates a context-aware semantic alignment mechanism, enhancing its representational capacity in typical medical scenarios such as label co-occurrence and sparse information. To comprehensively evaluate model performance, a series of experiments were conducted, including baseline comparisons, hyperparameter sensitivity analysis, data perturbation studies, and noise injection tests. Results demonstrate that the proposed method consistently outperforms representative existing approaches across multiple performance metrics. The model maintains strong generalization under varying data scales, interference levels, and model depth configurations. The framework developed in this study offers an efficient algorithmic foundation for processing real-world clinical texts and presents practical significance for multi-label medical text modeling tasks.
[9] Is External Information Useful for Stance Detection with LLMs?
Quang Minh Nguyen,Taegyoon Kim
Main category: cs.CL
TL;DR: 论文探讨了外部信息(如维基百科摘录)对大型语言模型(LLMs)在立场检测任务中的影响,发现多数情况下外部信息会降低性能,原因是LLMs倾向于基于提供的信息而非文本的真实立场进行预测。
Details
Motivation: 研究动机是验证外部信息是否对LLMs的立场检测任务有帮助,尤其是与之前的BERT-based系统研究结论(外部信息提升性能)形成对比。Contribution: 主要贡献是通过系统评估发现外部信息通常损害LLMs的立场检测性能,揭示了LLMs易受信息偏差的影响。
Method: 方法包括在八个LLMs和三个数据集的12个目标上进行实验,比较使用维基百科和网络搜索的外部信息的效果,并分析LLMs的行为。
Result: 结果表明,外部信息在多数情况下会降低性能(宏F1分数最多下降27.9%),且即使使用链式思考提示也无法完全解决。微调可缓解但无法完全消除问题。
Insight: 研究揭示了LLMs在立场检测任务中容易与外部信息的立场和情感对齐,而非文本的真实立场,强调了信息偏差的风险。
Abstract: In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9%. We explain this through experiments showing LLMs’ tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at https://github.com/ngqm/acl2025-stance-detection.
[10] Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation
Shutong Feng,Hsien-chin Lin,Nurul Lubis,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Renato Vukovic,Milica Gašić
Main category: cs.CL
TL;DR: 该论文提出了一个名为LUSTER的基于大型语言模型(LLM)的任务导向对话系统,通过端到端强化学习优化任务完成和情感响应能力。
Details
Motivation: 尽管LLM在语言流畅性和上下文理解方面取得了进展,但构建既高效又情感智能的任务导向对话系统仍然复杂且具有挑战性。Contribution: 提出LUSTER系统,结合LLM能力和结构化奖励建模,优化短期(用户情感)和长期(任务完成)目标,提升了对话系统的鲁棒性和情感响应能力。
Method: 采用端到端强化学习框架,结合用户模拟器和自然语言理解模块,设计了一个统一的系统架构。
Result: 实验证明LUSTER在任务完成和情感响应方面表现优于传统方法,为下一代对话系统提供了可行方案。
Insight: 情感智能和任务导向的结合是未来对话系统的重要方向,结构化奖励设计能够有效平衡短期与长期目标。
Abstract: Task-oriented dialogue (ToD) systems are designed to help users achieve specific goals through natural language interaction. While recent advances in large language models (LLMs) have significantly improved linguistic fluency and contextual understanding, building effective and emotionally intelligent ToD systems remains a complex challenge. Effective ToD systems must optimise for task success, emotional understanding and responsiveness, and precise information conveyance, all within inherently noisy and ambiguous conversational environments. In this work, we investigate architectural, representational, optimisational as well as emotional considerations of ToD systems. We set up systems covering these design considerations with a challenging evaluation environment composed of a natural-language user simulator coupled with an imperfect natural language understanding module. We propose \textbf{LUSTER}, an \textbf{L}LM-based \textbf{U}nified \textbf{S}ystem for \textbf{T}ask-oriented dialogue with \textbf{E}nd-to-end \textbf{R}einforcement learning with both short-term (user sentiment) and long-term (task success) rewards. Our findings demonstrate that combining LLM capability with structured reward modelling leads to more resilient and emotionally responsive ToD systems, offering a practical path forward for next-generation conversational agents.
[11] Chart Question Answering from Real-World Analytical Narratives
Maeve Hutchinson,Radu Jianu,Aidan Slingsby,Jo Wood,Pranava Madhyastha
Main category: cs.CL
TL;DR: 论文介绍了一个基于现实可视化笔记本构建的新图表问答数据集,该数据集包含多视图图表和基于分析叙述的自然语言问题,反映了真实的推理工作流程。实验显示现有多模态大模型在此任务上表现有限,GPT-4.1准确率仅为69.3%。
Details
Motivation: 现有图表问答数据集多基于合成或简化场景,缺乏反映真实分析流程的复杂性。作者希望通过从可视化笔记本中构建数据集,填补这一空白。Contribution: 1. 提出了一个基于真实分析叙述的图表问答数据集;2. 数据集包含多视图图表和自然语言问题,更贴近实际推理场景;3. 验证了当前多模态大模型在此任务上的表现,揭示了其局限性。
Method: 1. 从可视化笔记本中采集图表和问题;2. 构建多视图图表与自然语言问题的配对;3. 评测多模态大模型(如GPT-4.1)的表现。
Result: 实验结果显示,GPT-4.1在数据集上的准确率为69.3%,表明现有模型在处理真实图表问答任务时仍面临挑战。
Insight: 真实场景的图表问答任务复杂度更高,需更强的推理和多模态理解能力。未来研究需更关注生态效度高的数据集和模型优化。
Abstract: We present a new dataset for chart question answering (CQA) constructed from visualization notebooks. The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives. Unlike prior benchmarks, our data reflects ecologically valid reasoning workflows. Benchmarking state-of-the-art multimodal large language models reveals a significant performance gap, with GPT-4.1 achieving an accuracy of 69.3%, underscoring the challenges posed by this more authentic CQA setting.
[12] Confidence and Stability of Global and Pairwise Scores in NLP Evaluation
Georgii Levtsov,Dmitry Ustalov
Main category: cs.CL
TL;DR: 该论文比较了NLP评估中全局分数和成对比较的优缺点,发现全局分数更可靠但可能低估强模型,而成对比较适合识别低分但强表现的模型,但需要更多比较以收敛。
Details
Motivation: 随着指令调优神经网络语言模型的兴起,NLP基准测试逐渐从传统的全局分数(如GLUE)转向成对比较排行榜(如LMSYS Arena)。作者希望通过实验比较这两种方法的优缺点,为选择评估策略提供依据。Contribution: 1. 通过实验比较全局分数和成对比较的优缺点;2. 发现全局分数更可靠但可能低估强模型,而成对比较适合识别低分但强表现的模型;3. 为NLP评估策略选择提供了实用建议。
Method: 作者在合成和真实数据集上进行了计算实验,使用了标准全局指标(如GLUE)和流行的成对比较模型(如Bradley-Terry模型),分析了两种方法的性能和收敛性。
Result: 全局分数提供了更可靠的总体排名,但可能低估某些强模型;成对比较能有效识别低分但强表现的模型,但需要更多比较以避免频繁的平局。
Insight: 在NLP评估中,选择全局分数或成对比较取决于具体需求:全局分数适合总体排名,而成对比较更适合识别特定强模型。
Abstract: With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.
[13] AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness
Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Zhen Ye,Guang Chen,Zhiyong Huang,Jing Ma
Main category: cs.CL
TL;DR: 本文提出了AdamMeme,一个基于多智能体的自适应评估框架,用于动态测试多模态大语言模型(mLLMs)在有害表情包理解上的推理能力。
Details
Motivation: 社交媒体中多模态表情包的泛滥要求mLLMs能够有效理解其有害性,但现有基于静态数据集的评估方法无法动态适应快速演变的内容。Contribution: 1. 提出了AdamMeme框架,通过多智能体协作和动态更新挑战性样本,全面评估mLLMs在有害表情包理解上的推理能力。2. 揭示了不同mLLMs在理解有害性时的具体局限。
Method: 采用多智能体协作机制,迭代更新表情包数据,生成具有挑战性的样本,以测试mLLMs的推理能力。
Result: 实验表明,AdamMeme能系统性地揭示不同mLLMs的性能差异,并提供细粒度的模型弱点分析。
Insight: 动态评估框架比静态数据集更能揭示模型的实际推理能力,尤其是在快速演变的社交媒体内容中。
Abstract: The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.
[14] How Do Vision-Language Models Process Conflicting Information Across Modalities?
Tianze Hua,Tian Yun,Ellie Pavlick
Main category: cs.CL
TL;DR: 该研究探讨了视觉语言模型如何处理跨模态的冲突信息,发现模型会偏向某一模态,并识别了影响模态偏好的内部机制和注意力头。
Details
Motivation: 随着AI模型越来越需要处理多模态输入,理解它们在面对冲突模态信息时的行为变得至关重要。Contribution: 揭示了视觉语言模型在冲突信息下的行为模式,并发现了影响模态偏好的内部注意力机制和路由头。
Method: 通过提供不一致的视觉-语言输入(例如图片与标题冲突),测试模型的响应,并分析其内部表征和注意力机制。
Result: 模型倾向于偏袒某一模态(如图像),且不同模型的偏好不同;发现模态无关的“路由头”可调节模态偏好。
Insight: 模型的模态偏好与其内部结构和注意力机制相关,为多模态模型的行为控制和优化提供了理论基础。
Abstract: AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption “A photo of a cat”) and ask the model to report the information present in one of the specific modalities (e.g., “What does the caption say / What is in the image?”). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic “router heads” which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
[15] The Anatomy of Evidence: An Investigation Into Explainable ICD Coding
Katharina Beckh,Elisa Studeny,Sujan Sai Gannamaneni,Dario Antweiler,Stefan Rüping
Main category: cs.CL
TL;DR: 该论文对可解释的ICD编码进行研究,分析了MDACE数据集,评估了当前可解释医疗编码系统的合理性,并提出了改进建议。
Details
Motivation: 自动医疗编码可能简化文档和计费流程,而透明性对医疗编码员和监管机构至关重要。目前缺乏标注数据,限制了可解释性方法的评估。Contribution: 深入分析MDACE数据集,评估现有可解释医疗编码系统的合理性,提出匹配度量方法,并为开发与评估提供建议。
Method: 通过分析MDACE数据集,评估现有方法的证据提取合理性,并提出新的匹配度量标准。
Result: 研究发现真实证据与编码描述有一定对齐性,现有方法与真实证据高度重叠。成功和失败案例被明确展示。
Insight: 论文强调了真实证据与编码描述的一致性,为未来可解释医疗编码系统的开发提供了实用建议。
Abstract: Automatic medical coding has the potential to ease documentation and billing processes. For this task, transparency plays an important role for medical coders and regulatory bodies, which can be achieved using explainability methods. However, the evaluation of these approaches has been mostly limited to short text and binary settings due to a scarcity of annotated data. Recent efforts by Cheng et al. (2023) have introduced the MDACE dataset, which provides a valuable resource containing code evidence in clinical records. In this work, we conduct an in-depth analysis of the MDACE dataset and perform plausibility evaluation of current explainable medical coding systems from an applied perspective. With this, we contribute to a deeper understanding of automatic medical coding and evidence extraction. Our findings reveal that ground truth evidence aligns with code descriptions to a certain degree. An investigation into state-of-the-art approaches shows a high overlap with ground truth evidence. We propose match measures and highlight success and failure cases. Based on our findings, we provide recommendations for developing and evaluating explainable medical coding systems.
[16] Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes
Nikita Neveditsin,Pawan Lingras,Vijay Mago
Main category: cs.CL
TL;DR: 论文比较了小型语言模型在临床笔记开放属性值提取任务中结构化输出的可解析性,发现JSON格式表现最佳,结构化鲁棒性通过针对性提示和更大模型可提升,但长文档和某些笔记类型会降低其表现。
Details
Motivation: 研究动机在于为隐私敏感的临床环境中部署语言模型提供实际指导,特别是在结构化输出的格式选择和提示设计方面。Contribution: 主要贡献包括对不同序列化格式(JSON、YAML、XML)的可解析性比较,以及发现JSON格式的一致性优势。同时,提出了针对长文档和特定笔记类型的性能挑战。
Method: 研究方法包括对三种序列化格式的实际性能评估,结合文本长度和笔记类型的变异性分析,并进行了错误模式分析。
Result: 结果显示JSON格式始终具有最高的可解析性,结构化鲁棒性可通过提示和更大模型优化,但长文档和部分笔记类型会显著降低性能。
Insight: 论文的洞察在于为临床领域的小型语言模型部署提供了实用建议,强调了JSON格式的适用性和提示设计的重要性。
Abstract: We present a comparative analysis of the parseability of structured outputs generated by small language models for open attribute-value extraction from clinical notes. We evaluate three widely used serialization formats: JSON, YAML, and XML, and find that JSON consistently yields the highest parseability. Structural robustness improves with targeted prompting and larger models, but declines for longer documents and certain note types. Our error analysis identifies recurring format-specific failure patterns. These findings offer practical guidance for selecting serialization formats and designing prompts when deploying language models in privacy-sensitive clinical settings.
[17] Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
Samridhi Raj Sinha,Rajvee Sheth,Abhishek Upperwal,Mayank Singh
Main category: cs.CL
TL;DR: EKA-EVAL是一个针对印度语言的大语言模型(LLM)评估框架,整合了35个基准测试(包括10个印度特有数据集),支持分布式推理、量化和多GPU使用,旨在降低多语言基准测试的门槛。
Details
Motivation: 现有评估框架多为英语中心,缺乏对多语言(尤其是印度语言)的支持,因此需要开发一个全面的、生产就绪的框架。Contribution: 提出首个端到端、可扩展的评估套件EKA-EVAL,涵盖多种任务类别,支持多语言和大规模评估,开源并集成于EKA计划。
Method: 整合35个基准测试(包括10个印度特有数据集),支持分布式推理、量化和多GPU使用,提供统一评估框架。
Result: EKA-EVAL显著扩展了印度语言评估的覆盖范围,降低了多语言基准测试的技术门槛,成为首个专为全球和印度LLM设计的评估工具。
Insight: 多语言评估需要更广泛的基准和支持工具,EKA-EVAL填补了印度语言评估的空白,推动了LLM在多样化语言环境中的应用。
Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at https://github.com/lingo-iitgn/ eka-eval and a part of ongoing EKA initiative (https://eka.soket.ai), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.
[18] DIY-MKG: An LLM-Based Polyglot Language Learning System
Kenan Tang,Yanhong Li,Yao Qin
Main category: cs.CL
TL;DR: DIY-MKG是一个基于LLM的多语言学习系统,通过个性化词汇知识图谱和自适应复习模块支持多语言学习者。
Details
Motivation: 现有语言学习工具在多语言词汇连接、个性化定制和减少认知负荷方面存在不足。Contribution: 提出了DIY-MKG系统,支持用户构建个性化词汇图谱,利用LLM进行词汇扩展和动态测验生成。
Method: 通过LLM生成词汇关联建议,构建个性化知识图谱,并结合用户反馈优化提示和测验。
Result: 评估显示词汇扩展在多语言中可靠且公平,生成的测验准确率高。
Insight: LLM可以有效支持多语言学习的个性化和动态需求,用户反馈能提升系统鲁棒性。
Abstract: Existing language learning tools, even those powered by Large Language Models (LLMs), often lack support for polyglot learners to build linguistic connections across vocabularies in multiple languages, provide limited customization for individual learning paces or needs, and suffer from detrimental cognitive offloading. To address these limitations, we design Do-It-Yourself Multilingual Knowledge Graph (DIY-MKG), an open-source system that supports polyglot language learning. DIY-MKG allows the user to build personalized vocabulary knowledge graphs, which are constructed by selective expansion with related words suggested by an LLM. The system further enhances learning through rich annotation capabilities and an adaptive review module that leverages LLMs for dynamic, personalized quiz generation. In addition, DIY-MKG allows users to flag incorrect quiz questions, simultaneously increasing user engagement and providing a feedback loop for prompt refinement. Our evaluation of LLM-based components in DIY-MKG shows that vocabulary expansion is reliable and fair across multiple languages, and that the generated quizzes are highly accurate, validating the robustness of DIY-MKG.
[19] MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants
Dongyi Ding,Tiannan Wang,Chenghao Zhu,Meiling Tao,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: MiCoTA通过引入中间规模的教师助理和中等长度的CoT序列,有效解决小语言模型(SLM)在长链推理任务中的学习能力不足问题,显著提升了模型的性能。
Details
Motivation: 大语言模型(LLM)在处理复杂推理任务时表现优异,但其庞大的规模和计算成本限制了实际应用;而小语言模型(SLM)由于容量有限,难以学习长链推理(CoT),形成了所谓的“SLM学习能力鸿沟”。Contribution: 提出了MiCoTA框架,利用中间规模的教师助理和中等长度的CoT序列,填补了SLM在容量和推理长度上的不足,显著提升了其在复杂任务中的表现。
Method: MiCoTA采用中间规模模型作为教师助理,生成中等长度的CoT序列,通过蒸馏技术将这些序列传输给SLM,帮助其更好地学习长链推理。
Result: 在多个基准测试(如AIME2024、AMC等)中,MiCoTA显著提升了SLM的性能,例如Qwen2.5-7B-Instruct和Qwen2.5-3B-Instruct的平均得分分别提高了3.47和3.93。
Insight: MiCoTA生成的数据更符合SLM的分布,这为未来针对SLM的长链推理数据蒸馏研究提供了重要启示。
Abstract: Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the “SLMs Learnability Gap”. To address this, we introduce \textbf{Mi}d-\textbf{Co}T \textbf{T}eacher \textbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.
[20] AI4Research: A Survey of Artificial Intelligence for Scientific Research
Qiguang Chen,Mingda Yang,Libo Qin,Jinhao Liu,Zheng Yan,Jiannan Guan,Dengyun Peng,Yiyan Ji,Hanjing Li,Mengkang Hu,Yimeng Zhang,Yihao Liang,Yuhang Zhou,Jiaqi Wang,Zhi Chen,Wanxiang Che
Main category: cs.CL
TL;DR: 该论文是对人工智能在科学研究中的应用(AI4Research)的全面综述,提出了系统分类法、指出了研究空白和未来方向,并整理了丰富的资源和工具。
Details
Motivation: 近年来,人工智能(尤其是大型语言模型)在逻辑推理和实验编码等领域展现出强大能力,推动了AI在科学研究中的应用探索。然而,缺乏系统化的综述阻碍了进一步的发展。Contribution: 1. 提出了AI4Research的系统分类法;2. 指出了研究空白和未来方向;3. 整理了多学科应用、数据资源和工具。
Method: 采用综述方法,系统分类现有AI4Research任务,分析研究现状和挑战,并提出未来方向。
Result: 提供了一个统一的视角和丰富的资源库,促进AI4Research领域的研究和创新。
Insight: AI4Research的潜力巨大,但需关注自动化实验的严谨性和可扩展性,同时需评估其社会影响。
Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.
[21] Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models
Chengao Li,Hanyu Zhang,Yunkun Xu,Hongyan Xue,Xiang Ao,Qing He
Main category: cs.CL
TL;DR: 论文提出了梯度自适应策略优化(GAPO),通过多梯度下降解决大型语言模型(LLM)与多样人类偏好的多目标对齐问题,并进一步引入P-GAPO以结合用户偏好。实验证明其在性能上超越现有方法。
Details
Motivation: 传统RLHF方法难以有效对齐多样且可能冲突的人类偏好,因此需要一种新方法来解决多目标优化问题。Contribution: 1. 提出GAPO,通过自适应梯度缩放优化多目标对齐;2. 引入P-GAPO结合用户偏好;3. 理论证明其收敛到帕累托最优解;4. 实验验证其在Mistral-7B上的优越性。
Method: 采用多梯度下降法,动态调整各目标的梯度以实现平衡,并结合用户偏好生成帕累托最优解。
Result: 在Mistral-7B上,GAPO在帮助性和无害性上均优于现有方法。
Insight: 多目标优化框架在LLM对齐问题中具有潜力,自适应梯度调整是解决冲突目标的有效手段。
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user’s specific needs. Our theoretical analysis demonstrates that GAPO converges towards a Pareto optimal solution for multiple objectives. Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness.
[22] NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
Yang Li,Youssef Emad,Karthik Padthe,Jack Lanchantin,Weizhe Yuan,Thao Nguyen,Jason Weston,Shang-Wen Li,Dong Wang,Ilia Kulikov,Xian Li
Main category: cs.CL
TL;DR: 论文《NaturalThoughts》研究了如何从教师模型中选择和蒸馏高质量的推理轨迹(NaturalThoughts),以提升学生模型的泛推理能力。通过系统分析发现,选择合适的困难样本比单纯扩大数据规模更高效。
Details
Motivation: 当前研究缺乏对教师模型中何种推理示范最有效的系统性研究。本文旨在填补这一空白,通过选择和蒸馏高质量的推理轨迹提升学生模型的推理能力。Contribution: 1. 提出了NaturalThoughts,从教师模型中精选高质量的推理轨迹;2. 系统性分析了影响推理能力蒸馏的关键因素;3. 证明了选择困难样本比随机扩增数据更高效。
Method: 1. 从NaturalReasoning数据集中筛选问题;2. 基于教师模型生成推理轨迹;3. 通过选择困难和多样化的样本优化蒸馏效果。
Result: 在Llama和Qwen模型上,NaturalThoughts在GPQA-Diamond、MMLU-Pro和SuperGPQA等基准测试中优于现有数据集(如OpenThoughts、LIMO)。
Insight: 选择合适的困难样本(需多样化推理策略)是提升蒸馏效率的关键,单纯扩大数据规模效果有限。
Abstract: Recent work has shown that distilling reasoning traces from a larger teacher model via supervised finetuning outperforms reinforcement learning with the smaller student model alone (Guo et al. 2025). However, there has not been a systematic study of what kind of reasoning demonstrations from the teacher are most effective in improving the student model’s reasoning capabilities. In this work we curate high-quality “NaturalThoughts” by selecting reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning (Yuan et al. 2025). We first conduct a systematic analysis of factors that affect distilling reasoning capabilities, in terms of sample efficiency and scalability for general reasoning tasks. We observe that simply scaling up data size with random sampling is a strong baseline with steady performance gains. Further, we find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model’s reasoning skills. Evaluated on both Llama and Qwen models, training with NaturalThoughts outperforms existing reasoning datasets such as OpenThoughts, LIMO, etc. on general STEM reasoning benchmarks including GPQA-Diamond, MMLU-Pro and SuperGPQA.
[23] The Thin Line Between Comprehension and Persuasion in LLMs
Adrian de Wynter,Tangming Yuan
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)在辩论中的表现,发现其能维持连贯且具有说服力的对话,但缺乏对对话深层结构的理解。
Details
Motivation: 随着LLMs在敏感领域(如同行评审和心理健康应用)的快速部署,亟需评估其对话理解能力。Contribution: 揭示了LLMs在辩论中的说服力与理解力之间的脱节,并指出其对对话深层结构的理解不足。
Method: 通过评估LLMs在辩论中的表现,分析其对话理解能力,并测量其说服力与理解力的关系。
Result: LLMs能维持连贯且说服力强的辩论,但无法展示对对话深层结构的理解。
Insight: LLMs的有效性不依赖于其对话题的理解,建模语用上下文和连贯性对有效性是次要的。
Abstract: Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs’ ability to maintain a debate–one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.
cs.CV [Back]
[24] Geometry-aware 4D Video Generation for Robot Manipulation
Zeyi Liu,Shuang Li,Eric Cousineau,Siyuan Feng,Benjamin Burchfiel,Shuran Song
Main category: cs.CV
TL;DR: 这篇论文提出了一种几何感知的4D视频生成模型,通过多视角3D一致性监督,提升了视频生成的时空连贯性和几何一致性,支持机器人操作和泛化。
Details
Motivation: 现有视频生成模型在动态场景建模中表现优异,但难以保证多视角下视频的时空连贯性和几何一致性,限制了机器人操作的应用效果。Contribution: 提出了一个通过跨视角点图对齐监督训练的4D视频生成模型,能够学习场景的共享3D表示,并预测未来视频序列,无需相机姿态输入,且支持机器人操作任务。
Method: 利用跨视角点图对齐作为几何监督,训练模型学习共享3D表示,从而生成时空连贯且几何一致的4D视频。
Result: 在模拟和真实机器人数据集上,模型生成的视频在视觉稳定性和空间对齐性上优于现有基线,并能用于恢复机器人末端执行器轨迹。
Insight: 几何监督是提升多视角视频生成一致性的有效手段,生成的4D视频可直接支持机器人操作任务,展示了从感知到控制的潜在应用价值。
Abstract: Understanding and predicting the dynamics of the physical world can enhance a robot’s ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training. This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints based solely on the given RGB-D observations, without requiring camera poses as inputs. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, supporting robust robot manipulation and generalization to novel camera viewpoints.
[25] Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions
Rahul A. Burange,Harsh K. Shinde,Omkar Mutyalwar
Main category: cs.CV
TL;DR: 该论文提出了一种结合多源卫星数据和深度学习模型的方法,用于提高地质灾害滑坡的识别和预测能力。研究利用Sentinel-2多光谱数据和ALOS PALSAR衍生的地形特征,并通过地理空间分析评估影响因素。同时比较了多种深度学习分割模型在滑坡检测中的表现。
Details
Motivation: 滑坡对基础设施、经济和人类生命构成严重威胁,需要一种跨地理区域的准确检测和预测方法,以提升灾害风险管理能力。Contribution: 提出了一种综合方法,结合多源卫星数据和深度学习模型,增强了滑坡识别的准确性和预测的可扩展性,为灾害预警系统提供了可靠支持。
Method: 使用Sentinel-2和ALOS PALSAR数据提取环境特征,采用地理空间分析评估滑坡影响因素,并比较了U-Net、DeepLabV3+和Res-Net等深度学习模型的表现。
Result: 研究结果表明,多源数据结合深度学习方法能够显著提高滑坡检测的准确性,并且模型具有较强的可迁移性和扩展性。
Insight: 深度学习与多源遥感数据的结合为滑坡预测提供了更稳健的解决方案,支持可持续土地利用规划和灾害风险管理。
Abstract: Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.
[26] cp_measure: API-first feature extraction for image-based profiling workflows
Alán F. Muñoz,Tim Treis,Alexandr A. Kalinin,Shatavisha Dasgupta,Fabian Theis,Anne E. Carpenter,Shantanu Singh
Main category: cs.CV
TL;DR: cp_measure是一个Python库,将CellProfiler的核心测量功能模块化,支持编程化的特征提取,便于机器学习和可重复的生物图像分析。
Details
Motivation: 传统生物图像分析工具(如CellProfiler)在自动化、可重复性和机器学习集成方面存在瓶颈,cp_measure旨在解决这些问题。Contribution: cp_measure将CellProfiler的特征提取功能模块化,提供API优先的工具,无缝集成Python生态系统,支持自动化流程。
Method: 通过提取CellProfiler的核心测量能力,设计为模块化、API优先的Python库,验证其与原始特征的高保真度。
Result: 在3D星形胶质细胞成像和空间转录组学中展示了cp_measure的高效、可扩展性和与机器学习工具的兼容性。
Insight: 模块化设计和API优先策略可显著提升生物图像分析的自动化和可重复性,推动计算生物学中的机器学习应用。
Abstract: Biological image analysis has traditionally focused on measuring specific visual properties of interest for cells or other entities. A complementary paradigm gaining increasing traction is image-based profiling - quantifying many distinct visual features to form comprehensive profiles which may reveal hidden patterns in cellular states, drug responses, and disease mechanisms. While current tools like CellProfiler can generate these feature sets, they pose significant barriers to automated and reproducible analyses, hindering machine learning workflows. Here we introduce cp_measure, a Python library that extracts CellProfiler’s core measurement capabilities into a modular, API-first tool designed for programmatic feature extraction. We demonstrate that cp_measure features retain high fidelity with CellProfiler features while enabling seamless integration with the scientific Python ecosystem. Through applications to 3D astrocyte imaging and spatial transcriptomics, we showcase how cp_measure enables reproducible, automated image-based profiling pipelines that scale effectively for machine learning applications in computational biology.
[27] Rapid Salient Object Detection with Difference Convolutional Neural Networks
Zhuo Su,Li Liu,Matthias Müller,Jiehua Zhang,Diana Wofk,Ming-Ming Cheng,Matti Pietikäinen
Main category: cs.CV
TL;DR: 该论文提出了一种高效的目标检测方法,结合了传统显著性检测的对比线索和现代CNN的表征能力,通过像素差卷积(PDC)和差异卷积重参数化(DCR)策略,显著提升了模型在资源受限设备上的实时性能。
Details
Motivation: 为了解决现有显著性目标检测(SOD)模型在资源受限设备上计算开销大、难以实现实时性能的问题,论文结合传统SOD方法和现代CNN,提出了高效且准确的解决方案。Contribution: 1. 提出像素差卷积(PDC)以捕捉特征对比线索;2. 引入差异卷积重参数化(DCR)策略,降低推理时的计算和参数量;3. 设计时空差卷积(STDC)扩展至视频SOD;4. 提出的SDNet和STDNet模型在效率和精度上均显著优于现有轻量级模型。
Method: 1. 使用PDC编码特征对比;2. 通过DCR将PDC嵌入标准卷积,优化推理效率;3. 在视频SOD中引入STDC,增强时空对比捕捉。
Result: 在Jetson Orin设备上,模型参数量小于100万,图像SOD速度达46 FPS,视频SOD速度达150 FPS,速度和精度均显著优于其他轻量级模型。
Insight: 结合传统方法和现代深度学习能有效提升模型效率,为资源受限设备上的实时应用提供了新思路。
Abstract: This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with $<$ 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than $2\times$ and $3\times$ in speed with superior accuracy. Code will be available at https://github.com/hellozhuo/stdnet.git.
[28] Robust Brain Tumor Segmentation with Incomplete MRI Modalities Using Hölder Divergence and Mutual Information-Enhanced Knowledge Transfer
Runze Cheng,Xihang Qiu,Ming Li,Ye Zhang,Chun Li,Fei Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于Hölder散度和互信息增强知识转移的鲁棒脑肿瘤分割方法,解决了多模态MRI数据中部分模态缺失的问题。
Details
Motivation: 多模态MRI通常提供互补信息以支持精确的脑肿瘤分割,但在实际临床场景中,由于图像质量、协议不一致、患者过敏或经济限制,部分模态可能缺失。现有方法在此情况下表现不佳,亟需一种鲁棒性强的解决方案。Contribution: 1)提出了一种单模态并行处理框架,能够在模态缺失情况下保持高分割精度;2)引入Hölder散度和互信息作为损失函数,动态调整网络参数;3)在BraTS数据集上验证了方法的优越性。
Method: 采用Hölder散度和互信息构建损失函数,动态量化预测与真实标签之间的差异,并通过并行处理框架自适应调整网络参数。
Result: 在BraTS 2018和2020数据集上,方法显著优于现有技术,特别是在模态缺失情况下。
Insight: Hölder散度和互信息能有效捕捉模态间关系,动态调整策略增强了模型对缺失模态的鲁棒性,为不完备数据下的分割任务提供了新思路。
Abstract: Multimodal MRI provides critical complementary information for accurate brain tumor segmentation. However, conventional methods struggle when certain modalities are missing due to issues such as image quality, protocol inconsistencies, patient allergies, or financial constraints. To address this, we propose a robust single-modality parallel processing framework that achieves high segmentation accuracy even with incomplete modalities. Leveraging Holder divergence and mutual information, our model maintains modality-specific features while dynamically adjusting network parameters based on the available inputs. By using these divergence- and information-based loss functions, the framework effectively quantifies discrepancies between predictions and ground-truth labels, resulting in consistently accurate segmentation. Extensive evaluations on the BraTS 2018 and BraTS 2020 datasets demonstrate superior performance over existing methods in handling missing modalities.
[29] AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation
Xiao Liu,Jiawei Zhang
Main category: cs.CV
TL;DR: 论文提出AIGVE-MACS模型,通过视觉-语言模型结合新损失函数和动态帧采样策略,为AI生成视频提供评分和多方面语言评论,显著提升与人类评估的对齐性。
Details
Motivation: 现有AI生成视频评估方法仅能提供数值评分,缺乏解释性评论,导致低解释性和与人类评估的不一致。Contribution: 1. 提出AIGVE-MACS模型,支持评分和多方面评论反馈;2. 发布AIGVE-BENCH 2基准数据集;3. 引入多代理框架推动视频生成迭代改进。
Method: 结合视觉-语言模型,采用token-wise加权损失和动态帧采样策略,提升模型与人类评估的对齐性。
Result: 在监督和零样本基准测试中,AIGVE-MACS在评分相关性和评论质量上优于现有基线(如GPT-4o和VideoScore),并通过多代理框架实现53.5%的质量提升。
Insight: 全面、人类对齐的评估框架能显著提升AI生成视频的质量和解释性。
Abstract: The rapid advancement of AI-generated video models has created a pressing need for robust and interpretable evaluation frameworks. Existing metrics are limited to producing numerical scores without explanatory comments, resulting in low interpretability and human evaluation alignment. To address those challenges, we introduce AIGVE-MACS, a unified model for AI-Generated Video Evaluation(AIGVE), which can provide not only numerical scores but also multi-aspect language comment feedback in evaluating these generated videos. Central to our approach is AIGVE-BENCH 2, a large-scale benchmark comprising 2,500 AI-generated videos and 22,500 human-annotated detailed comments and numerical scores across nine critical evaluation aspects. Leveraging AIGVE-BENCH 2, AIGVE-MACS incorporates recent Vision-Language Models with a novel token-wise weighted loss and a dynamic frame sampling strategy to better align with human evaluators. Comprehensive experiments across supervised and zero-shot benchmarks demonstrate that AIGVE-MACS achieves state-of-the-art performance in both scoring correlation and comment quality, significantly outperforming prior baselines including GPT-4o and VideoScore. In addition, we further showcase a multi-agent refinement framework where feedback from AIGVE-MACS drives iterative improvements in video generation, leading to 53.5% quality enhancement. This work establishes a new paradigm for comprehensive, human-aligned evaluation of AI-generated videos. We release the AIGVE-BENCH 2 and AIGVE-MACS at https://huggingface.co/xiaoliux/AIGVE-MACS.
[30] Advancements in Weed Mapping: A Systematic Review
Mohammad Jahanbakht,Alex Olsen,Ross Marchant,Emilie Fillols,Mostafa Rahimi Azghadi
Main category: cs.CV
TL;DR: 这篇论文系统综述了杂草测绘领域的最新进展,重点关注数据获取、处理和映射技术的全流程方法,填补了该领域缺乏全面文献综述的空白。
Details
Motivation: 杂草测绘对精准农业至关重要,但现有研究未全面覆盖从数据获取到处理的完整流程,限制了该领域的进展。本文旨在填补这一空缺。Contribution: 论文的主要贡献是首次系统分析了杂草测绘的全流程方法,包括数据获取技术(如RGB相机、遥感)、数据处理(如大数据分析和机器学习)及映射工具(如时空分析和决策支持)。
Method: 通过PRISMA指南系统梳理文献,涵盖了传感器与平台技术、数据标注与建模、以及时空分析与决策支持工具等关键技术。
Result: 综述总结了杂草测绘的最新技术进展,提出未来研究方向,为高效、可持续的杂草管理系统开发提供了指导。
Insight: 论文指出,未来的杂草测绘需要进一步整合多源数据,并利用机器学习和AI优化决策支持系统,以实现更精准的可持续管理。
Abstract: Weed mapping plays a critical role in precision management by providing accurate and timely data on weed distribution, enabling targeted control and reduced herbicide use. This minimizes environmental impacts, supports sustainable land management, and improves outcomes across agricultural and natural environments. Recent advances in weed mapping leverage ground-vehicle Red Green Blue (RGB) cameras, satellite and drone-based remote sensing combined with sensors such as spectral, Near Infra-Red (NIR), and thermal cameras. The resulting data are processed using advanced techniques including big data analytics and machine learning, significantly improving the spatial and temporal resolution of weed maps and enabling site-specific management decisions. Despite a growing body of research in this domain, there is a lack of comprehensive literature reviews specifically focused on weed mapping. In particular, the absence of a structured analysis spanning the entire mapping pipeline, from data acquisition to processing techniques and mapping tools, limits progress in the field. This review addresses these gaps by systematically examining state-of-the-art methods in data acquisition (sensor and platform technologies), data processing (including annotation and modelling), and mapping techniques (such as spatiotemporal analysis and decision support tools). Following PRISMA guidelines, we critically evaluate and synthesize key findings from the literature to provide a holistic understanding of the weed mapping landscape. This review serves as a foundational reference to guide future research and support the development of efficient, scalable, and sustainable weed management systems.
[31] Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing
Chengxu Liu,Lu Qi,Jinshan Pan,Xueming Qian,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 该论文提出了一种基于频域的扩散模型(FD-Diffusion),用于解决无配对图像去雾任务,通过充分利用频域特性提升去雾效果。
Details
Motivation: 现有基于对比学习的去雾方法引入了与雾无关的内容信息,且忽略了雾在频域中的特性(如幅度谱中的雾相关退化)。因此,作者提出了一种频域建模方法。Contribution: 1. 提出首个频域扩散模型(FD-Diffusion);2. 设计了幅度残差编码器(ARE)补偿幅度谱差异;3. 提出相位校正模块(PCM)通过注意力机制优化相位谱。
Method: 1. 在频域中训练扩散模型,生成与清晰图像幅度谱一致的分布;2. ARE提取幅度残差监督扩散模型;3. PCM通过注意力机制校正相位谱。
Result: 在合成和真实数据集上均优于其他SOTA方法。
Insight: 频域建模能有效捕捉雾相关退化特性,扩散模型的生成能力在去雾任务中具有潜力。
Abstract: Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named \ours, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our \ours outperforms other state-of-the-art methods on both synthetic and real-world datasets.
[32] Learning an Ensemble Token from Task-driven Priors in Facial Analysis
Sunyong Seo,Semin Kim,Jongha Lee
Main category: cs.CV
TL;DR: 文章提出ET-Fuser,通过任务驱动的先验信息,利用注意力机制学习统一的集成令牌(ensemble token),以提升面部分析任务中的特征表示。
Details
Motivation: 尽管CNN和ViT分别在空间信息和语义信息表示上取得了成功,但在单任务学习中如何统一特征表示仍缺乏研究。ET-Fuser旨在通过整合任务先验信息和注意力机制,解决这一问题。Contribution: 提出ET-Fuser方法,通过任务先验和注意力机制生成集成令牌,实现高效且统一的特征表示。
Method: 利用预训练模型的先验信息,设计基于自注意力机制的集成令牌生成方法,共享编码器的互信息。
Result: 在多个面部分析任务中取得了统计显著的改进,且计算成本极低。
Insight: 通过集成任务先验信息,可以在不增加显著计算开销的情况下,显著提升面部分析的性能。
Abstract: Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. Although the generalization of conventional methodologies has advanced visual interpretability, there remains paucity of research that preserves the unified feature representation on single task learning during the training process. In this work, we introduce ET-Fuser, a novel methodology for learning ensemble token by leveraging attention mechanisms based on task priors derived from pre-trained models for facial analysis. Specifically, we propose a robust prior unification learning method that generates a ensemble token within a self-attention mechanism, which shares the mutual information along the pre-trained encoders. This ensemble token approach offers high efficiency with negligible computational cost. Our results show improvements across a variety of facial analysis, with statistically significant enhancements observed in the feature representations.
[33] Physics-informed Ground Reaction Dynamics from Human Motion Capture
Cuong Le,Huy-Phuong Le,Duc Le,Minh-Thien Duong,Van-Binh Nguyen,My-Ha Le
Main category: cs.CV
TL;DR: 该论文提出了一种基于物理约束的方法,直接从运动捕捉数据中估计地面反作用力,替代了依赖专业测力板的传统方式,通过欧拉积分和PD算法实现了高精度计算,并在GroundLink数据集上验证了其优越性。
Details
Motivation: 传统方法依赖实验室中的测力板获取地面反作用力,限制了数据的广泛性和适用性。论文旨在通过物理约束和运动捕捉数据,实现无需测力板的动态力估计。Contribution: 提出了一种直接从运动捕捉数据中估计地面反作用力的方法,结合了欧拉积分和PD算法,通过物理约束提高了估计精度。
Method: 利用欧拉积分和PD算法,以物理规律和计算模拟为约束,从运动捕捉数据中计算地面反作用力,并将物理信息融入学习模型中。
Result: 在GroundLink数据集上,该方法在反作用力估计精度和模拟轨迹精度上均优于基线模型。
Insight: 通过物理约束提升运动动态估计的精度,展示了在没有测力板的情况下仍能实现可靠的动力学分析。
Abstract: Body dynamics are crucial information for the analysis of human motions in important research fields, ranging from biomechanics, sports science to computer vision and graphics. Modern approaches collect the body dynamics, external reactive force specifically, via force plates, synchronizing with human motion capture data, and learn to estimate the dynamics from a black-box deep learning model. Being specialized devices, force plates can only be installed in laboratory setups, imposing a significant limitation on the learning of human dynamics. To this end, we propose a novel method for estimating human ground reaction dynamics directly from the more reliable motion capture data with physics laws and computational simulation as constrains. We introduce a highly accurate and robust method for computing ground reaction forces from motion capture data using Euler’s integration scheme and PD algorithm. The physics-based reactive forces are used to inform the learning model about the physics-informed motion dynamics thus improving the estimation accuracy. The proposed approach was tested on the GroundLink dataset, outperforming the baseline model on: 1) the ground reaction force estimation accuracy compared to the force plates measurement; and 2) our simulated root trajectory precision. The implementation code is available at https://github.com/cuongle1206/Phys-GRD
[34] Learning Camera-Agnostic White-Balance Preferences
Luxi Zhao,Mahmoud Afifi,Michael S. Brown
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的相机无关白平衡偏好学习方法,通过学习后照明估计映射,将中性白平衡转换为美学偏好的效果,实现跨相机的一致性和低计算开销。
Details
Motivation: 商业自动白平衡(AWB)系统通常追求美学效果而非中性颜色校正,而现有学习方法难以跨相机泛化。本文旨在解决跨相机的美学一致性白平衡问题。Contribution: 1. 首次提出学习相机无关的美学白平衡映射;2. 设计了一个轻量级模型(约500参数),兼容现有跨相机AWB技术;3. 在智能手机图像数据集上取得最优性能。
Method: 通过后照明估计映射,将中性白平衡结果转换为美学偏好的效果,模型设计轻量且高效,仅需0.024毫秒处理时间。
Result: 在包含三种相机传感器的771张智能手机图像数据集上,实现了最先进的性能,计算开销极低。
Insight: 美学白平衡可通过轻量级模型实现跨相机一致性,无需依赖特定相机传感器,为多相机设备提供了实用解决方案。
Abstract: The image signal processor (ISP) pipeline in modern cameras consists of several modules that transform raw sensor data into visually pleasing images in a display color space. Among these, the auto white balance (AWB) module is essential for compensating for scene illumination. However, commercial AWB systems often strive to compute aesthetic white-balance preferences rather than accurate neutral color correction. While learning-based methods have improved AWB accuracy, they typically struggle to generalize across different camera sensors – an issue for smartphones with multiple cameras. Recent work has explored cross-camera AWB, but most methods remain focused on achieving neutral white balance. In contrast, this paper is the first to address aesthetic consistency by learning a post-illuminant-estimation mapping that transforms neutral illuminant corrections into aesthetically preferred corrections in a camera-agnostic space. Once trained, our mapping can be applied after any neutral AWB module to enable consistent and stylized color rendering across unseen cameras. Our proposed model is lightweight – containing only $\sim$500 parameters – and runs in just 0.024 milliseconds on a typical flagship mobile CPU. Evaluated on a dataset of 771 smartphone images from three different cameras, our method achieves state-of-the-art performance while remaining fully compatible with existing cross-camera AWB techniques, introducing minimal computational and memory overhead.
[35] Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation
Andrei Jelea,Ahmed Nabil Belbachir,Marius Leordeanu
Main category: cs.CV
TL;DR: 论文提出了广义测试时间增强(GTTA)方法,通过随机扰动测试输入的PCA子空间投影生成鲁棒的集成模型,并结合自监督蒸馏降低计算成本。
Details
Motivation: 现有测试时间增强方法缺乏通用性,且在计算成本上较高。GTTA旨在提出一种通用的、高效的方法,适用于多种视觉和非视觉任务。Contribution: 1. 提出GTTA方法,通用性强,适用于分类、回归、分割等任务;2. 引入自监督蒸馏阶段,显著降低测试时间计算成本;3. 在多个数据集和任务上验证了方法的有效性。
Method: 1. 通过随机扰动PCA子空间投影生成多个测试输入;2. 构建鲁棒的集成模型;3. 使用集成输出作为无监督教师,对初始模型进行自监督训练。
Result: 在图像分类、分割、语音识别等任务上表现优于现有方法,并在低能见度水下视频的三文鱼分割任务中验证了有效性。
Insight: 随机子空间扰动能有效过滤数据中的噪声,自监督蒸馏进一步提升了效率,为通用测试时间增强提供了新思路。
Abstract: We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA forms robust ensembles at test time in which, due to sound statistical properties, the structural and systematic noises in the initial input data is filtered out and final estimator errors are reduced. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost, at no loss in accuracy. Our tests and comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.
[36] Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
Chaoxiang Cai,Longrong Yang,Kaibing Chen,Fan Yang,Xi Li
Main category: cs.CV
TL;DR: 该论文提出了一种长尾分布感知的路由器(LTDR),用于视觉语言模型中的专家混合(MoE)架构,通过针对模态特定的路由策略和增强视觉尾部token的专家激活,解决了现有MoE框架中忽略视觉与语言分布差异的问题。
Details
Motivation: 现有的MoE框架在视觉语言模型中主要关注token到专家的路由(TER),但忽略了视觉和语言模态之间的固有分布差异(语言TER呈均匀分布,而视觉TER呈长尾分布)。这导致需要针对不同模态设计特定的路由策略。Contribution: 提出了长尾分布感知的路由器(LTDR),包含两个主要贡献:1) 针对模态特定的分布感知路由策略;2) 通过类似于过采样的策略增强视觉尾部token的专家激活。
Method: LTDR包括两部分:1) 对语言和视觉模态分别设计分布感知的路由策略;2) 对视觉尾部token使用更多的专家激活以提高其处理效果。
Result: 在广泛的基准测试中,LTDR的有效性得到了验证,显著提升了视觉语言模型中专家混合架构的性能。
Insight: 视觉和语言token的分布差异需要被纳入MoE框架的路由设计中,而视觉尾部token的处理通过增加专家激活可以显著提升模型效果;这为未来的多模态MoE架构提供了重要启示。
Abstract: The mixture-of-experts (MoE), which replaces dense models with sparse architectures, has gained attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE frameworks for LVLMs focus on token-to-expert routing (TER), encouraging different experts to specialize in processing distinct tokens. However, these frameworks often rely on the load balancing mechanism, overlooking the inherent distributional differences between vision and language. To this end, we propose a Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, tackling two challenges: (1) Distribution-aware router for modality-specific routing. We observe that language TER follows a uniform distribution, whereas vision TER exhibits a long-tailed distribution. This discrepancy necessitates distinct routing strategies tailored to each modality. (2) Enhancing expert activation for vision tail tokens. Recognizing the importance of vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts for these tokens. Experiments on extensive benchmarks validate the effectiveness of our approach.
[37] Activation Reward Models for Few-Shot Model Alignment
Tianning Chai,Chancharik Mitra,Brandon Huang,Gautam Rajendrakumar Gare,Zhiqiu Lin,Assaf Arbelle,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Deva Ramanan,Roei Herzig
Main category: cs.CV
TL;DR: 论文提出了一种名为Activation Reward Models (Activation RMs)的新方法,用于小样本奖励建模,通过激活导向构建对齐的奖励信号,无需额外微调,优于现有方法,并在奖励黑客问题上表现优异。
Details
Motivation: 传统奖励建模方法难以适应新偏好,需要大量数据和单独训练奖励模型。论文旨在解决这一问题,提出一种无需额外微调、仅需少量监督即可构建奖励信号的方法。Contribution: 1. 提出Activation RMs,一种小样本奖励建模方法;2. 引入PreferenceHack基准,测试奖励模型在奖励黑客问题上的表现;3. 在多个基准上验证Activation RMs的优越性,甚至超越GPT-4o。
Method: 利用激活导向(activation steering)技术,通过模型内部的激活状态构建奖励信号,无需额外训练。
Result: Activation RMs在标准奖励建模基准和PreferenceHack基准上表现优异,优于现有小样本方法,并能有效缓解奖励黑客行为。
Insight: 激活导向技术可以在少量监督下高效构建奖励信号,为模型对齐提供了新的思路,尤其在安全关键应用中具有潜力。
Abstract: Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models’ generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) – a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.
[38] MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
Langyu Wang,Bingke Zhu,Yingying Chen,Yiyuan Zhang,Ming Tang,Jinqiao Wang
Main category: cs.CV
TL;DR: 论文提出了一种基于伪标签增强的音视频Mamba网络(MUG),用于弱监督音视频视频解析(AVVP),通过生成新数据和模态特征处理,显著提升了分段和事件级别的预测性能。
Details
Motivation: 现有的弱监督AVVP方法在分段和事件级别预测上存在不足,主要由于模型架构缺陷和弱监督限制。Contribution: 1. 提出了一种基于伪标签增强的音视频Mamba网络(MUG);2. 通过交叉模态随机组合生成新数据,增强模型对分段事件的解析能力;3. 在LLP数据集上实现了SOTA性能。
Method: 1. 基于伪标签生成新数据;2. 使用音视频Mamba网络(AV-Mamba)处理特征并排除噪声干扰。
Result: 在LLP数据集上,MUG在视觉分段和音频分段指标上分别提升了2.1%和1.2%。
Insight: 伪标签增强和模态特征交互能有效提升模型在弱监督任务中的性能。
Abstract: The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model’s ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at https://github.com/WangLY136/MUG.
[39] FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases
Shuai Tan,Bill Gong,Bin Ji,Ye Pan
Main category: cs.CV
TL;DR: FixTalk 是一个解决高质量说话头生成中身份泄漏和渲染伪影问题的新框架,通过解耦身份信息和运动特征,同时利用泄漏的身份信息补充细节。
Details
Motivation: 现有方法在极端情况下存在身份泄漏(IL)和渲染伪影(RA)问题,限制了生成视频的质量和可信度。Contribution: 提出了 FixTalk 框架,通过增强运动指示器(EMI)解耦身份信息,并利用增强细节指示器(EDI)修复伪影。
Method: 1. EMI 用于从运动特征中分离身份信息;2. EDI 利用泄漏的身份信息补充缺失细节。
Result: 实验证明 FixTalk 在减少 IL 和 RA 方面优于现有方法。
Insight: 身份信息既是问题的来源(IL),也是解决问题的关键(RA修复)。
Abstract: Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an Enhanced Motion Indicator (EMI) to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an Enhanced Detail Indicator (EDI), which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.
[40] Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps
Khanh Son Pham,Christian Witte,Jens Behley,Johannes Betz,Cyrill Stachniss
Main category: cs.CV
TL;DR: 该论文提出了一种利用标准定义(SD)地图在线估计和推理道路拓扑结构的连贯方法,用于自动驾驶车辆的高清(HD)地图生成,其网络架构结合了先验信息和去噪技术,并通过实验验证了其优越性。
Details
Motivation: 自动驾驶汽车通常依赖高清(HD)地图,但其生成和在线更新具有挑战性。现有方法难以统一且连贯地建模复杂的道路拓扑关系。本研究旨在利用标准定义(SD)地图提供的先验信息,解决这一难题。Contribution: 1. 提出了一种连贯的方法,利用SD地图预测车道段及其拓扑关系与道路边界;2. 设计了结合先验信息和去噪技术的网络架构,提升训练稳定性和性能;3. 引入时间一致性机制,利用过去帧信息提升模型表现。
Method: 方法包括:1. 使用SD地图的先验信息;2. 提出混合车道段编码,结合先验与去噪技术;3. 利用过去帧实现时间一致性。网络架构优化了训练和推理过程。
Result: 实验表明,该方法显著优于先前方法,验证了其建模方案的有效性。
Insight: 利用SD地图的先验信息可以显著提升HD地图的在线生成能力,同时结合时间一致性机制能够进一步优化模型的连贯性。这一方法为自动驾驶中的地图更新问题提供了新的解决思路。
Abstract: Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.
[41] Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound
Huanwen Liang,Jingxian Xu,Yuanji Zhang,Yuhao Huang,Yuhan Zhang,Xin Yang,Ran Li,Xuedong Deng,Yanjun Liu,Guowei Tao,Yun Wu,Sheng Zhao,Xinru Gao,Dong Ni
Main category: cs.CV
TL;DR: 提出了一种基于多示例学习(MIL)的方法,无需标准平面定位,用于产前超声中胎儿腹部异常的病例级分类。方法包括混合注意力专家模块(MoAE)、医学知识驱动的特征选择模块(MFS)和基于提示的原型学习(PPL)。
Details
Motivation: 胎儿腹部畸形是严重的先天异常,需要准确诊断以指导妊娠管理。AI在医学诊断中潜力巨大,但在产前腹部异常中的应用仍有限。现有研究多关注图像级分类,缺乏对病例级诊断的重视。Contribution: 1. 提出MoAE模块,加权不同平面的注意力头;2. 设计MFS模块,将图像特征与医学知识对齐;3. 引入PPL方法增强MFS。
Method: 采用多示例学习框架,结合MoAE、MFS和PPL模块,实现病例级分类。
Result: 在包含2,419例病例和24,748张图像的数据集上表现优于现有方法。
Insight: 医学知识驱动的特征选择和多专家注意力机制能有效提升产前腹部异常的诊断准确性。
Abstract: Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emphasis on case-level diagnosis. In this paper, we develop a case-level multiple instance learning (MIL)-based method, free of standard plane localization, for classifying fetal abdominal anomalies in prenatal ultrasound. Our contribution is three-fold. First, we adopt a mixture-of-attention-experts module (MoAE) to weight different attention heads for various planes. Secondly, we propose a medical-knowledge-driven feature selection module (MFS) to align image features with medical knowledge, performing self-supervised image token selection at the case-level. Finally, we propose a prompt-based prototype learning (PPL) to enhance the MFS. Extensively validated on a large prenatal abdominal ultrasound dataset containing 2,419 cases, with a total of 24,748 images and 6 categories, our proposed method outperforms the state-of-the-art competitors. Codes are available at:https://github.com/LL-AC/AAcls.
[42] CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
Kuniaki Saito,Donghyun Kim,Kwanyong Park,Atsushi Hashimoto,Yoshitaka Ushiku
Main category: cs.CV
TL;DR: 该论文提出了一种名为CaptionSmiths的新方法,用于在图像描述生成中灵活控制语言模式(如描述性和长度),通过量化三种属性并利用插值技术实现平滑过渡。
Details
Motivation: 由于现有图像描述模型无法在训练中利用语言属性作为条件,且无法平滑切换语言模式,作者提出了一种新方法以实现对生成描述的细粒度控制。Contribution: 提出了CaptionSmiths方法,通过量化描述的三种属性(长度、描述性和单词独特性)并利用端点向量插值技术,实现了对语言模式的灵活控制。
Method: 方法包括:(1)量化描述属性为连续标量值;(2)通过端点向量插值表示条件;(3)训练单一模型以支持多样语言模式。
Result: 实验表明,CaptionSmiths能平滑调整输出描述属性,并在词汇对齐上优于基线模型,例如在控制长度时误差减少了506%。
Insight: 量化语言属性并利用插值技术是一种有效的细粒度控制方法,无需人工标注即可实现灵活的语言模式切换。
Abstract: An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.
[43] Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention
Jiawei Gu,Ziyue Qiao,Zechao Li
Main category: cs.CV
TL;DR: 该论文提出了一种高效的外分布(OOD)检测方法,通过分析ID和OOD样本的梯度方向差异,利用特征干预技术阻断虚假梯度,同时保持ID分类的准确性。
Details
Motivation: 在开放世界环境中,外分布(OOD)输入的出现可能导致深度模型的不安全行为。作者观察到,OOD样本的梯度方向与ID样本不同,这启发了他们开发一种基于梯度方向的OOD检测方法。Contribution: 论文的主要贡献是基于梯度方向的特征干预技术(Gradient Short-Circuit),能够高效地区分ID和OOD样本,并提出了无需二次前向传播的局部一阶近似方法。
Method: 通过分析ID和OOD样本的梯度方向差异,阻断虚假梯度对特征坐标的利用。为了降低计算成本,使用局部一阶近似直接估计干预后的输出。
Result: 在标准OOD基准测试中,该方法显著提高了检测性能,且计算轻量,对推理流程的改动极小。
Insight: 梯度方向可以作为区分ID和OOD样本的有效信号,特征干预是一种高效的OOD检测手段,适用于实际部署。
Abstract: Out-of-Distribution (OOD) detection is critical for safely deploying deep models in open-world environments, where inputs may lie outside the training distribution. During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient gradient phenomenon: around an ID sample, the local gradient directions for “enhancing” that sample’s predicted class remain relatively consistent, whereas OOD samples–unseen in training–exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to short-circuit those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. To circumvent the expense of recomputing the logits after this gradient short-circuit, we further introduce a local first-order approximation that accurately captures the post-modification outputs without a second forward pass. Experiments on standard OOD benchmarks show our approach yields substantial improvements. Moreover, the method is lightweight and requires minimal changes to the standard inference pipeline, offering a practical path toward robust OOD detection in real-world applications.
[44] DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal
Wenjie Liu,Bingshu Wang,Ze Wang,C. L. Philip Chen
Main category: cs.CV
TL;DR: 该论文提出了DocShaDiffusion,一种基于潜在空间的扩散模型,用于文档图像阴影去除,通过软掩模生成和掩模引导扩散模块,有效解决了颜色阴影问题,并提出了阴影鲁棒的感知特征损失以保留细节。
Details
Motivation: 现有文档图像阴影去除方法通常忽略颜色阴影或仅针对单一背景色,导致效果不佳。Contribution: 提出了DocShaDiffusion模型,包括软掩模生成模块和掩模引导扩散模块,并开发了一个大规模合成数据集SDCSRD。
Method: 采用扩散模型在潜在空间处理阴影,SSGM生成精确软掩模,SMGDM引导扩散过程,并通过感知特征损失优化细节。
Result: 在三个公开数据集上验证了方法的优越性,超越了现有技术。
Insight: 潜在空间处理和掩模引导能显著提升阴影去除效果,尤其是颜色阴影。
Abstract: Document shadow removal is a crucial task in the field of document image enhancement. However, existing methods tend to remove shadows with constant color background and ignore color shadows. In this paper, we first design a diffusion model in latent space for document image shadow removal, called DocShaDiffusion. It translates shadow images from pixel space to latent space, enabling the model to more easily capture essential features. To address the issue of color shadows, we design a shadow soft-mask generation module (SSGM). It is able to produce accurate shadow mask and add noise into shadow regions specially. Guided by the shadow mask, a shadow mask-aware guided diffusion module (SMGDM) is proposed to remove shadows from document images by supervising the diffusion and denoising process. We also propose a shadow-robust perceptual feature loss to preserve details and structures in document images. Moreover, we develop a large-scale synthetic document color shadow removal dataset (SDCSRD). It simulates the distribution of realistic color shadows and provides powerful supports for the training of models. Experiments on three public datasets validate the proposed method’s superiority over state-of-the-art. Our code and dataset will be publicly available.
[45] DiffMark: Diffusion-based Robust Watermark Against Deepfakes
Chen Sun,Haiyang Sun,Zhiqing Guo,Yunfeng Diao,Liejun Wang,Dan Ma,Gaobo Yang,Keqin Li
Main category: cs.CV
TL;DR: DiffMark是一种基于扩散模型的鲁棒水印框架,旨在通过改进扩散模型的训练和采样策略,结合条件引导和水印融合模块,生成抗Deepfake攻击的水印图像。
Details
Motivation: Deepfake技术对安全和隐私构成威胁,现有水印方法在对抗Deepfake攻击时鲁棒性不足。Contribution: 1. 提出基于扩散模型的水印框架DiffMark;2. 引入时间步依赖的面部条件加权策略;3. 设计交叉信息融合模块;4. 结合冻结自编码器和对抗性引导增强抗Deepfake能力。
Method: 1. 修改扩散模型的训练和采样流程,将面部图像和水印作为条件;2. 提出CIF模块融合水印特征;3. 通过冻结自编码器模拟Deepfake攻击;4. 引入对抗性引导优化水印生成。
Result: 实验证明DiffMark在典型Deepfake攻击下表现优异。
Insight: 扩散模型可通过条件引导和特征融合生成高鲁棒性水印,对抗Deepfake攻击。
Abstract: Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark.
[46] NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Max Gandyra,Alessandro Santonicola,Michael Beetz
Main category: cs.CV
TL;DR: NOCTIS 是一种无需重新训练的实例分割框架,利用 Grounded-SAM 2 和 DINOv2 处理未知对象的实例分割任务,通过在 BOP 2023 挑战中表现优异。
Details
Motivation: 现有方法在处理未知对象的实例分割任务时通常需要重新训练,限制了模型的应用范围。NOCTIS 旨在设计一种通用且无需重新训练的框架。Contribution: NOCTIS 提出了基于循环阈值的实例分割方法,改进了对象匹配评分和特征嵌入的质量,显著提升了未知对象分割的性能。
Method: NOCTIS 使用 Grounded-SAM 2 生成对象提议和分割掩码,利用 DINOv2 的零样本能力生成图像嵌入。通过循环阈值过滤和加权评分完成了高效的对象匹配。
Result: NOCTIS 在 BOP 2023 挑战的七个核心数据集上,无需额外训练,表现优于其他 RGB 和 RGB-D 方法。
Insight: NOCTIS 展示了通过基础模型的组合和简单的匹配技巧,可以在无需重新训练的情况下高效处理未知对象的实例分割任务。
Abstract: Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed, for all kinds of novel objects, without (re-) training, has proven to be a difficult task. To handle this, we propose a simple, yet powerful, framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). This work stems from and improves upon previous ones like CNOS, SAM-6D and NIDS-Net; thus, it also leverages on recent vision foundation models, namely: Grounded-SAM 2 and DINOv2. It utilises Grounded-SAM 2 to obtain object proposals with precise bounding boxes and their corresponding segmentation masks; while DINOv2’s zero-shot capabilities are employed to generate the image embeddings. The quality of those masks, together with their embeddings, is of vital importance to our approach; as the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings. Differently to SAM-6D, calculating the latter involves a prior patch filtering based on the distance between each patch and its corresponding cyclic/roundtrip patch in the image grid. Furthermore, the average confidence of the proposals’ bounding box and mask is used as an additional weighting factor for the object matching score. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the “Model-based 2D segmentation of unseen objects” task.
[47] Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
Ge Wu,Shen Zhang,Ruijing Shi,Shanghua Gao,Zhenyuan Chen,Lei Wang,Zhaowei Chen,Hongcheng Gao,Yao Tang,Jian Yang,Ming-Ming Cheng,Xiang Li
Main category: cs.CV
TL;DR: 本文提出了一种称为“表示纠缠生成(REG)”的简单方法,通过将低层图像潜在变量与预训练基础模型的高层类别标记纠缠,显著提高了生成质量和训练效率,且推理开销极小。
Details
Motivation: 现有方法(如REPA)通过外部视觉表示对齐缓解扩散模型训练难题,但未能充分利用判别性表示的潜力。Contribution: 提出REG方法,利用高层类别标记与图像潜在变量的纠缠,直接生成一致性图像-类别对,显著提升生成质量和训练效率。
Method: REG将低层图像潜在变量与预训练模型的高层类别标记纠缠,仅需一个额外标记进行去噪(FLOPs和延迟增加<0.5%)。
Result: 在ImageNet 256×256上,SiT-XL/2 + REG训练速度比SiT-XL/2和SiT-XL/2 + REPA分别快63倍和23倍,且仅训练400K次即超越REPA训练4M次的性能。
Insight: 高层语义知识可主动指导图像生成过程,验证了表示纠缠的潜力,为扩散模型的高效训练提供了新思路。
Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
[48] What Really Matters for Robust Multi-Sensor HD Map Construction?
Xiaoshuai Hao,Yuting Zhao,Yuheng Ji,Luanyuan Dai,Peng Hao,Dingzhe Li,Shuai Cheng,Rong Yin
Main category: cs.CV
TL;DR: 本文探讨了如何通过数据增强、多模态融合模块和模态丢失训练策略,提升高精度地图(HD map)构建的鲁棒性,并在NuScenes数据集上验证了其有效性。
Details
Motivation: 现有的Camera-LiDAR融合方法主要关注精度提升,而忽略了鲁棒性,这对于自动驾驶系统的实际应用至关重要。Contribution: 提出三种关键组件:数据增强、新型多模态融合模块和模态丢失训练策略,显著提升了高精度地图构建的鲁棒性。
Method: 结合数据增强、多模态融合模块和模态丢失训练策略,优化多模态融合方法的鲁棒性和精度。
Result: 在NuScenes数据集上,提出的方法显著提升了基线模型的鲁棒性,并在干净验证集上达到最先进性能。
Insight: 鲁棒性是多传感器高精度地图构建中不可忽视的维度,结合数据增强和模态丢失训练可有效提升模型在实际场景中的可靠性。
Abstract: High-definition (HD) map construction methods are crucial for providing precise and comprehensive static environmental information, which is essential for autonomous driving systems. While Camera-LiDAR fusion techniques have shown promising results by integrating data from both modalities, existing approaches primarily focus on improving model accuracy and often neglect the robustness of perception models, which is a critical aspect for real-world applications. In this paper, we explore strategies to enhance the robustness of multi-modal fusion methods for HD map construction while maintaining high accuracy. We propose three key components: data augmentation, a novel multi-modal fusion module, and a modality dropout training strategy. These components are evaluated on a challenging dataset containing 10 days of NuScenes data. Our experimental results demonstrate that our proposed methods significantly enhance the robustness of baseline methods. Furthermore, our approach achieves state-of-the-art performance on the clean validation set of the NuScenes dataset. Our findings provide valuable insights for developing more robust and reliable HD map construction models, advancing their applicability in real-world autonomous driving scenarios. Project website: https://robomap-123.github.io.
[49] AVC-DPO: Aligned Video Captioning via Direct Preference Optimization
Jiyang Tang,Hengyi Li,Yifan Du,Wayne Xin Zhao
Main category: cs.CV
TL;DR: AVC-DPO通过直接偏好优化技术对齐视频描述生成,解决了视频多模态大语言模型在根据人类偏好调整焦点上的挑战,显著提升了性能并在VDC基准上取得领先。
Details
Motivation: 尽管视频多模态大语言模型在视频描述任务上取得了进展,但其生成的内容往往难以符合人类对时空动态信息的偏好需求。Contribution: 提出了AVC-DPO框架,通过直接偏好优化技术实现视频描述的偏好对齐,并设计了针对时空动态信息的增强提示词。
Method: 利用同一基础模型在不同提示条件下的响应进行偏好感知训练和对齐,生成符合人类偏好的视频描述。
Result: 在LOVE@CVPR’25 Workshop Track 1A的Video Detailed Captioning Challenge中表现优异,VDC基准上取得第一名。
Insight: 结合人类偏好的直接优化能显著提升视频描述的生成质量,尤其是时空动态信息的表达。
Abstract: Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment. Our approach designs enhanced prompts that specifically target temporal dynamics and spatial information-two key factors that humans care about when watching a video-thereby incorporating human-centric preferences. AVC-DPO leverages the same foundation model’s caption generation responses under varied prompt conditions to conduct preference-aware training and caption alignment. Using this framework, we have achieved exceptional performance in the LOVE@CVPR’25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning (VDC) benchmark according to the VDCSCORE evaluation metric.
[50] Crop Pest Classification Using Deep Learning Techniques: A Review
Muhammad Hassam Ejaz,Muhammad Bilal,Usman Habib
Main category: cs.CV
TL;DR: 这篇综述探讨了基于深度学习的作物害虫分类方法,对比了CNN、ViT和混合模型的性能,并指出当前领域的挑战和未来方向。
Details
Motivation: 传统害虫监测方法效率低且难以扩展,亟需利用深度学习技术实现自动化害虫检测,以提高效率和准确性。Contribution: 1. 系统性综述了37篇相关研究;2. 按作物类型、害虫种类和模型架构分类;3. 总结了数据集和技术挑战。
Method: 通过整理2018-2025年的研究成果,重点分析了CNN、ViT和混合模型在害虫分类中的应用与性能。
Result: 研究表明混合模型和ViT比传统CNN表现更好,但仍面临数据集不平衡、小目标检测难、泛化性差和边缘计算问题。
Insight: 未来应关注数据增强、小目标检测优化和轻量化模型研究,以推动AI在农业害虫监测中的实际应用。
Abstract: Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.
[51] ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
Jimyeong Kim,Jungwon Park,Yeji Song,Nojun Kwak,Wonjong Rhee
Main category: cs.CV
TL;DR: 本文提出了一种基于Rectified Flow(ReFlow)的文本引导真实图像编辑方法,通过分析中间表示并提取关键特征,实现了无需训练、无需用户掩码的编辑,性能优于现有方法。
Details
Motivation: 尽管Rectified Flow在图像质量和文本对齐上优于扩散模型,但如何将其应用于真实图像的编辑仍具挑战性。本文旨在解决这一问题。Contribution: 1. 提出了一种基于ReFlow的真实图像编辑方法;2. 通过提取中间步骤的特征和调整注意力机制,提升编辑效果;3. 无需训练或用户掩码,支持无源提示的应用。
Method: 1. 分析多模态Transformer块的中间表示,提取三个关键特征;2. 利用中间步骤的隐变量保留结构信息;3. 通过注意力调整提升编辑能力与文本对齐。
Result: 在两个基准测试中优于九种基线方法,人类评估进一步验证了用户对该方法的偏好。
Insight: 中间步骤的特征提取和注意力调整是提升ReFlow在真实图像编辑中性能的关键。
Abstract: Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.
[52] Integrating Traditional and Deep Learning Methods to Detect Tree Crowns in Satellite Images
Ozan Durgut,Beril Kallfelz-Sirmacek,Cem Unsalan
Main category: cs.CV
TL;DR: 该论文提出了一种结合传统方法和深度学习的规则算法,用于提高卫星图像中树冠检测的鲁棒性和准确性。
Details
Motivation: 全球变暖、生物多样性丧失和空气污染等问题亟需森林监测解决方案,传统和深度学习方法的单独使用各有局限,因此需要结合两者优势以实现更好的监测效果。Contribution: 提出了一种新颖的规则方法,整合传统方法(特征提取和分割)与深度学习(树冠检测),并通过后处理提升检测结果。
Method: 1. 传统方法用于森林区域的特征提取和分割;2. 深度学习方法用于检测树冠;3. 规则方法对结果进行后处理,通过邻域树木和局部操作提升检测数量。
Result: 与其他方法相比,新方法在树冠检测数量上表现更优,但仍有改进空间。
Insight: 结合传统与深度学习方法可以提高算法的鲁棒性和准确性,尤其是在复杂场景中。
Abstract: Global warming, loss of biodiversity, and air pollution are among the most significant problems facing Earth. One of the primary challenges in addressing these issues is the lack of monitoring forests to protect them. To tackle this problem, it is important to leverage remote sensing and computer vision methods to automate monitoring applications. Hence, automatic tree crown detection algorithms emerged based on traditional and deep learning methods. In this study, we first introduce two different tree crown detection methods based on these approaches. Then, we form a novel rule-based approach that integrates these two methods to enhance robustness and accuracy of tree crown detection results. While traditional methods are employed for feature extraction and segmentation of forested areas, deep learning methods are used to detect tree crowns in our method. With the proposed rule-based approach, we post-process these results, aiming to increase the number of detected tree crowns through neighboring trees and localized operations. We compare the obtained results with the proposed method in terms of the number of detected tree crowns and report the advantages, disadvantages, and areas for improvement of the obtained outcomes.
[53] Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence
Robert Aufschläger,Youssef Shoeb,Azarm Nowzad,Michael Heigl,Fabian Bally,Martin Schramm
Main category: cs.CV
TL;DR: 论文提出了一种名为cRID的跨模态框架,结合大型视觉语言模型和图注意力网络,通过检测文本可描述的PII线索来增强行人重识别(Re-ID),并在实验中展示了其在实际跨数据集场景中的有效性。
Details
Motivation: 现有的街景数据作为开放数据推动了自动驾驶和AI研究,但也带来了隐私风险,尤其是行人的个人身份信息(PII)问题。论文旨在通过跨模态智能方法解决这一问题。Contribution: 1. 提出了cRID框架,结合视觉语言模型和图注意力网络,检测语义上有意义的PII线索。2. 对行人图像数据集中的PII存在进行了系统性评估。
Method: 1. 使用大型视觉语言模型和图注意力网络提取跨模态特征。2. 通过表示学习识别文本可描述的PII线索。3. 在Market-1501到CUHK03-np数据集上验证了方法的有效性。
Result: 实验表明,cRID在跨数据集行人重识别场景中性能显著提升,特别是在Market-1501到CUHK03-np的数据迁移任务中。
Insight: 跨模态智能可以用于检测和利用语义上的PII线索,而不仅依赖低层次的外观特征,从而提升行人重识别的实用性和隐私保护能力。
Abstract: The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework’s practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
[54] Mamba Guided Boundary Prior Matters: A New Perspective for Generalized Polyp Segmentation
Tapas K. Dutta,Snehashis Majhi,Deepak Ranjan Nayak,Debesh Jha
Main category: cs.CV
TL;DR: SAM-MaGuP是一种新型的结肠息肉分割方法,通过结合边界蒸馏模块和1D-2D Mamba适配器,显著提升了弱边界分割的性能。
Details
Motivation: 结肠息肉分割因形状、大小和颜色的多样性以及边界模糊而极具挑战性,现有方法难以稳定处理弱边界问题。Contribution: 提出了SAM-MaGuP,引入了Mamba引导的边界先验和1D-2D Mamba块,显著提升了分割精度和鲁棒性。
Method: 结合了边界蒸馏模块和1D-2D Mamba适配器,增强了全局上下文交互和边界特征学习。
Result: 在五个数据集中超越现有方法,实现了更高的分割准确性和鲁棒性。
Insight: Mamba结构在处理弱边界问题时表现出色,全局上下文交互对分割任务至关重要。
Abstract: Polyp segmentation in colonoscopy images is crucial for early detection and diagnosis of colorectal cancer. However, this task remains a significant challenge due to the substantial variations in polyp shape, size, and color, as well as the high similarity between polyps and surrounding tissues, often compounded by indistinct boundaries. While existing encoder-decoder CNN and transformer-based approaches have shown promising results, they struggle with stable segmentation performance on polyps with weak or blurry boundaries. These methods exhibit limited abilities to distinguish between polyps and non-polyps and capture essential boundary cues. Moreover, their generalizability still falls short of meeting the demands of real-time clinical applications. To address these limitations, we propose SAM-MaGuP, a groundbreaking approach for robust polyp segmentation. By incorporating a boundary distillation module and a 1D-2D Mamba adapter within the Segment Anything Model (SAM), SAM-MaGuP excels at resolving weak boundary challenges and amplifies feature learning through enriched global contextual interactions. Extensive evaluations across five diverse datasets reveal that SAM-MaGuP outperforms state-of-the-art methods, achieving unmatched segmentation accuracy and robustness. Our key innovations, a Mamba-guided boundary prior and a 1D-2D Mamba block, set a new benchmark in the field, pushing the boundaries of polyp segmentation to new heights.
[55] TrackingMiM: Efficient Mamba-in-Mamba Serialization for Real-time UAV Object Tracking
Bingxi Liu,Calvin Chen,Junhao Li,Guyang Yu,Haoqian Song,Xuchen Liu,Jinqiang Cui,Hong Zhang
Main category: cs.CV
TL;DR: 论文提出了一种名为TrackingMiM的高效Mamba-in-Mamba架构,用于实时无人机(UAV)目标跟踪,解决现有方法中因时间连续性未被考虑而导致的时序不一致问题。
Details
Motivation: Vision Transformer(ViT)在处理无人机跟踪任务时面临二次复杂度的挑战,尤其是实时性要求高的场景。研究通过利用State-Space Model Mamba的计算效率和长序列建模能力来改进这一问题。Contribution: 提出了TrackingMiM架构,以嵌套方式执行Mamba扫描,独立处理时间和空间相关的patch tokens,并将模板帧编码为查询token用于跟踪。
Method: 采用Mamba-in-Mamba结构,嵌套执行Mamba扫描,分别处理时间和空间相关的patch tokens,同时利用模板帧作为查询token进行跟踪。
Result: 在五个无人机跟踪基准测试中,TrackingMiM实现了最先进的精度和显著更高的速度。
Insight: 通过考虑时序连续性,改进Mamba扫描机制,可以有效提升无人机跟踪任务的实时性和准确性。
Abstract: The Vision Transformer (ViT) model has long struggled with the challenge of quadratic complexity, a limitation that becomes especially critical in unmanned aerial vehicle (UAV) tracking systems, where data must be processed in real time. In this study, we explore the recently proposed State-Space Model, Mamba, leveraging its computational efficiency and capability for long-sequence modeling to effectively process dense image sequences in tracking tasks. First, we highlight the issue of temporal inconsistency in existing Mamba-based methods, specifically the failure to account for temporal continuity in the Mamba scanning mechanism. Secondly, building upon this insight,we propose TrackingMiM, a Mamba-in-Mamba architecture, a minimal-computation burden model for handling image sequence of tracking problem. In our framework, the mamba scan is performed in a nested way while independently process temporal and spatial coherent patch tokens. While the template frame is encoded as query token and utilized for tracking in every scan. Extensive experiments conducted on five UAV tracking benchmarks confirm that the proposed TrackingMiM achieves state-of-the-art precision while offering noticeable higher speed in UAV tracking.
[56] Interpolation-Based Event Visual Data Filtering Algorithms
Marcin Kowlaczyk,Tomasz Kryjak
Main category: cs.CV
TL;DR: 该论文提出了一种基于插值的事件视觉数据过滤算法,通过四种基于无限冲激响应(IIR)滤波器矩阵的方法,显著减少事件相机数据中的噪声,同时保留有效信号。
Details
Motivation: 事件相机在神经形态视觉领域应用广泛,但其数据流伴随显著噪声,需要高效的实时过滤方法。Contribution: 提出了四种基于IIR滤波器矩阵的算法,能够在嵌入式设备中高效去除约99%的噪声,同时保留大部分有效信号。
Method: 使用IIR滤波器矩阵设计四种插值算法,在合成噪声和真实动态视觉传感器噪声的数据集上进行测试。
Result: 算法在1280x720分辨率传感器上仅需约30KB内存,适合嵌入式实现,且噪声去除效果显著。
Insight: 插值方法结合IIR滤波器可高效处理事件视觉数据噪声,为嵌入式实时应用提供可行方案。
Abstract: The field of neuromorphic vision is developing rapidly, and event cameras are finding their way into more and more applications. However, the data stream from these sensors is characterised by significant noise. In this paper, we propose a method for event data that is capable of removing approximately 99% of noise while preserving the majority of the valid signal. We have proposed four algorithms based on the matrix of infinite impulse response (IIR) filters method. We compared them on several event datasets that were further modified by adding artificially generated noise and noise recorded with dynamic vision sensor. The proposed methods use about 30KB of memory for a sensor with a resolution of 1280 x 720 and is therefore well suited for implementation in embedded devices.
[57] A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation
Hao Wang,Keyan Hu,Xin Guo,Haifeng Li,Chao Tao
Main category: cs.CV
TL;DR: 该论文提出了一个结合判别性和扩散生成学习的框架(IDGBR),用于改进遥感语义分割中的边界精度,解决了现有方法在高频细节学习上的不足。
Details
Motivation: 现有遥感语义分割方法过度依赖判别性学习,擅长捕捉低频特征但忽视高频边界细节。扩散生成模型虽擅长生成高频细节,但语义推理能力不足,因此需要结合两者的优势。Contribution: 提出了IDGBR框架,将判别性学习和扩散生成学习结合,通过迭代去噪过程优化边界分割精度。
Method: 使用判别性主干生成粗分割图,再通过条件引导网络和扩散过程联合优化边界细节。
Result: 在五个遥感语义分割数据集上验证了其边界优化能力,适用于二元和多类分割任务。
Insight: 结合判别性和生成性学习的优势可以显著提升语义分割中的边界精度,填补了低频和高频特征学习的空白。
Abstract: Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model’s ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework’s capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at https://github.com/KeyanHu-git/IDGBR.
[58] SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation
Bryan Constantine Sadihin,Michael Hua Wang,Shei Pern Chua,Hang Su
Main category: cs.CV
TL;DR: SketchColour是一个基于扩散变换器(DiT)的草图到上色流程,专为2D动画设计,通过轻量级通道连接适配器和LoRA微调注入草图信息,显著减少参数和GPU内存使用,并在性能上超越现有方法。
Details
Motivation: 传统2D动画制作需要大量手工绘制和上色,耗时耗力。现有的上色方法虽然能自动化部分工作,但参数和内存开销大,且难以保证时间一致性。Contribution: 1. 首个基于DiT的草图到上色流程;2. 引入轻量级通道连接适配器和LoRA微调,高效集成草图信息;3. 在SAKUGA数据集上性能优于现有方法,且仅需一半训练数据。
Method: 用DiT架构替换U-Net去噪器,通过通道连接适配器注入草图信息,并结合LoRA微调实现高效条件集成。
Result: 在SAKUGA数据集上表现优异,时间一致性高,且减少了色彩溢出和物体变形等伪影。
Insight: DiT架构在视频上色任务中具有潜力,轻量级条件集成方法可大幅降低计算资源需求。
Abstract: The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch information via lightweight channel-concatenation adapters accompanied with LoRA finetuning, our method natively integrates conditioning without the parameter and memory bloat of a duplicated ControlNet, greatly reducing parameter count and GPU memory usage. Evaluated on the SAKUGA dataset, SketchColour outperforms previous state-of-the-art video colourization methods across all metrics, despite using only half the training data of competing models. Our approach produces temporally coherent animations with minimal artifacts such as colour bleeding or object deformation. Our code is available at: https://bconstantine.github.io/SketchColour .
[59] Autonomous AI Surveillance: Multimodal Deep Learning for Cognitive and Behavioral Monitoring
Ameer Hamza,Zuhaib Hussain But,Umar Arif,Samiya,M. Abdullah Asad,Muhammad Naeem
Main category: cs.CV
TL;DR: 该论文提出了一种多模态深度学习系统,用于实时监控学生的注意力状态,整合了睡意检测、手机使用追踪和人脸识别,通过YOLOv8、LResNet Occ FC等技术实现高精度监控,并在实际环境中展示了优异性能。
Details
Motivation: 传统教室监控方法无法全面评估学生的注意力状态和行为,需要一种多模态、自动化的解决方案。Contribution: 1) 提出了一种整合多种模态的深度学习系统,2) 实现了高精度的睡意、手机使用和人脸检测,3) 开发了基于PHP和ESP32-CAM的实际应用框架。
Method: 使用YOLOv8模型检测手机和睡意,LResNet Occ FC结合YOLO和MTCNN实现人脸识别,并在RMFD和Roboflow数据集上训练。
Result: 睡意检测mAP@50为97.42%,人脸识别准确率为86.45%,手机检测mAP@50为85.89%。
Insight: 多模态融合提升了监控系统的全面性和准确性,适用于教育环境中的实时行为分析。
Abstract: This study presents a novel classroom surveillance system that integrates multiple modalities, including drowsiness, tracking of mobile phone usage, and face recognition,to assess student attentiveness with enhanced precision.The system leverages the YOLOv8 model to detect both mobile phone and sleep usage,(Ghatge et al., 2024) while facial recognition is achieved through LResNet Occ FC body tracking using YOLO and MTCNN.(Durai et al., 2024) These models work in synergy to provide comprehensive, real-time monitoring, offering insights into student engagement and behavior.(S et al., 2023) The framework is trained on specialized datasets, such as the RMFD dataset for face recognition and a Roboflow dataset for mobile phone detection. The extensive evaluation of the system shows promising results. Sleep detection achieves 97. 42% mAP@50, face recognition achieves 86. 45% validation accuracy and mobile phone detection reach 85. 89% mAP@50. The system is implemented within a core PHP web application and utilizes ESP32-CAM hardware for seamless data capture.(Neto et al., 2024) This integrated approach not only enhances classroom monitoring, but also ensures automatic attendance recording via face recognition as students remain seated in the classroom, offering scalability for diverse educational environments.(Banada,2025)
[60] DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation
Yue-Jiang Dong,Wang Zhao,Jiale Xu,Ying Shan,Song-Hai Zhang
Main category: cs.CV
TL;DR: DepthSync是一种无需训练的框架,通过扩散引导实现视频深度估计中的尺度和几何一致性,解决了长视频深度预测中的尺度累积差异和几何不一致问题。
Details
Motivation: 现有视频深度估计方法通常将视频分割为重叠滑动窗口处理,导致尺度差异累积,且仅依赖2D扩散先验忽略了3D几何结构,预测结果几何不一致。Contribution: 提出了DepthSync框架,通过尺度引导和几何引导协同优化,实现长视频深度估计的尺度和几何一致性,无需额外训练。
Method: 结合尺度引导同步窗口间深度尺度,以及几何引导基于3D约束优化窗口内的几何对齐,协同指导降噪过程。
Result: 在多个数据集上验证了DepthSync的有效性,显著提升了长视频深度估计的尺度和几何一致性。
Insight: 通过引入3D几何约束和跨窗口尺度同步,DepthSync展示了无需训练的扩散引导在复杂任务中的潜力。
Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
[61] Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems
Quentin Le Roux,Yannick Teglia,Teddy Furon,Philippe Loubet-Moundi,Eric Bourbao
Main category: cs.CV
TL;DR: 本文首次系统研究了深度学习人脸识别系统中的后门攻击,提出两种针对人脸检测任务的新型攻击,并展示了特征提取器的后门漏洞,同时为利益相关者提供了防范建议。
Details
Motivation: 深度学习人脸识别系统的广泛使用引发了安全担忧,而现有研究对复杂环境下的后门攻击关注不足。本文旨在填补这一空白。Contribution: 1. 首次系统研究了人脸识别系统中的后门攻击;2. 提出两种新型人脸检测后门攻击(人脸生成和关键点偏移);3. 发现基于大间隔损失的特征提取器存在漏洞;4. 通过实验展示了后门攻击的整体威胁。
Method: 通过实验设计20种系统配置和15种攻击案例,验证了后门攻击在真实无约束系统中的可行性。
Result: 研究表明,单一后门攻击可绕过整个系统的功能,威胁严重。
Insight: 本文揭示了深度学习人脸识别系统的安全盲区,并为防御后门攻击提供了实用建议。
Abstract: The widespread use of deep learning face recognition raises several security concerns. Although prior works point at existing vulnerabilities, DNN backdoor attacks against real-life, unconstrained systems dealing with images captured in the wild remain a blind spot of the literature. This paper conducts the first system-level study of backdoors in deep learning-based face recognition systems. This paper yields four contributions by exploring the feasibility of DNN backdoors on these pipelines in a holistic fashion. We demonstrate for the first time two backdoor attacks on the face detection task: face generation and face landmark shift attacks. We then show that face feature extractors trained with large margin losses also fall victim to backdoor attacks. Combining our models, we then show using 20 possible pipeline configurations and 15 attack cases that a single backdoor enables an attacker to bypass the entire function of a system. Finally, we provide stakeholders with several best practices and countermeasures.
[62] Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference
Xu Zhang,Ming Lu,Yan Chen,Zhan Ma
Main category: cs.CV
TL;DR: 该论文提出了感知导向的潜在编码(POLC)方法,通过丰富潜在特征的语义内容,提升压缩域语义推理性能,减少微调开销。
Details
Motivation: 传统的压缩域语义推理方法主要基于MSE优化的图像编码模型,但其潜在空间语义有限,且需要大量计算资源进行微调,限制了性能提升。Contribution: 提出POLC方法,通过感知导向的潜在编码,优化潜在空间的语义丰富性,仅需轻量级适配器即可完成微调,显著降低计算开销。
Method: POLC基于感知优化目标设计潜在编码,生成语义丰富的特征空间,并通过插拔式适配器实现高效微调。
Result: 实验表明,POLC在压缩域语义推理任务中性能与生成式图像编码方法相当,同时显著提升了视觉任务的性能。
Insight: 感知导向的潜在编码能够在不牺牲压缩效率的情况下,显著提升语义推理能力,为高效视觉任务提供新思路。
Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC.
[63] Depth Anything at Any Condition
Boyuan Sun,Modi Jin,Bowen Yin,Qibin Hou
Main category: cs.CV
TL;DR: DepthAnything-AC 是一个能够处理多样化环境条件的基础单目深度估计模型,通过无监督一致性正则化微调范式和小量未标记数据提升性能。
Details
Motivation: 现有的基础单目深度估计模型在复杂开放世界环境(如光照变化、恶劣天气和传感器失真)中表现不佳,主要由于数据稀缺以及无法从损坏图像生成高质量伪标签。Contribution: 提出了一种无监督一致性正则化微调范式,仅需少量未标记数据;并引入空间距离约束以学习块级相对关系,从而提升深度估计的准确性和边界清晰度。
Method: 采用无监督一致性正则化微调方法,并结合空间距离约束明确学习块级相对关系。
Result: 在多样化基准测试(包括现实恶劣天气、合成损坏和通用场景)中展示了零样本能力。
Insight: 通过无监督学习和空间约束,模型在数据稀缺和复杂环境下仍能生成高质量的深度估计,展现出较强的泛化能力。
Abstract: We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks. Project Page: https://ghost233lism.github.io/depthanything-AC-page Code: https://github.com/HVision-NKU/DepthAnythingAC
[64] ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving
Kai Chen,Ruiyuan Gao,Lanqing Hong,Hang Xu,Xu Jia,Holger Caesar,Dengxin Dai,Bingbing Liu,Dzmitry Tsishkou,Songcen Xu,Chunjing Xu,Qiang Xu,Huchuan Lu,Dit-Yan Yeung
Main category: cs.CV
TL;DR: 本文介绍了ECCV 2024第一届W-CODA研讨会,聚焦于通过多模态感知与理解技术解决自动驾驶中边角案例的问题。
Details
Motivation: 自动驾驶在面对边角案例(corner cases)时表现不足,需要更先进的解决方案来提高系统的鲁棒性和可靠性。Contribution: 1. 组织了首届W-CODA研讨会,邀请学术界和工业界的专家分享最新进展;2. 设立了双重赛道挑战,涵盖边角案例的场景理解与生成任务。
Method: 通过研讨会的形式,结合学术讨论和挑战赛,推动多模态感知与理解技术的发展。
Result: 会议汇集了前沿研究和实践成果,为自动驾驶边角案例问题提供了多样化解决方案。
Insight: 边角案例是自动驾驶技术的关键难题,多模态感知与理解技术有望显著提升系统的应对能力。
Abstract: In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.
[65] SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement
Weijie Yin,Dingkang Yang,Hongyuan Dong,Zijian Kang,Jiacong Wang,Xiao Liang,Chao Feng,Jiao Ran
Main category: cs.CV
TL;DR: 论文SAILViT提出了一种渐进特征优化的Vision Transformer(ViT),用于提升多模态大语言模型(MLLMs)的视觉理解能力,通过分阶段的特征对齐和知识注入解决现有ViT与LLMs联合训练的冲突和语义鸿沟问题。
Details
Motivation: 现有ViT通过图像-文本对比学习或自监督机制表现优异,但与LLMs的联合训练存在参数初始化冲突和模态语义鸿沟。SAILViT旨在解决这些问题,提升MLLMs在复杂多模态交互中的性能。Contribution: 提出了SAILViT,一种渐进特征优化的ViT,通过从粗到细的特征对齐和世界知识注入,显著提升MLLMs的性能;并验证了其在参数规模、架构、训练策略和数据规模上的鲁棒性和泛化性。
Method: 采用渐进特征优化策略,分阶段实现特征对齐和知识注入,以适应MLLMs的训练需求。方法在OpenCompass基准测试中验证了其有效性。
Result: 实验表明,SAILViT显著提升MLLMs在下游任务中的性能,且在不同维度(如参数规模、架构等)上表现出强大的鲁棒性和泛化性。
Insight: 渐进特征优化是一种有效的方式,可以弥合ViT与LLMs之间的语义鸿沟,为多模态模型的联合训练提供了新思路。
Abstract: Vision Transformers (ViTs) are essential as foundation backbones in establishing the visual comprehension capabilities of Multimodal Large Language Models (MLLMs). Although most ViTs achieve impressive performance through image-text pair-based contrastive learning or self-supervised mechanisms, they struggle to engage in connector-based co-training directly with LLMs due to potential parameter initialization conflicts and modality semantic gaps. To address the above challenges, this paper proposes SAILViT, a gradual feature learning-enhanced ViT for facilitating MLLMs to break through performance bottlenecks in complex multimodal interactions. SAILViT achieves coarse-to-fine-grained feature alignment and world knowledge infusion with gradual feature refinement, which better serves target training demands. We perform thorough empirical analyses to confirm the powerful robustness and generalizability of SAILViT across different dimensions, including parameter sizes, model architectures, training strategies, and data scales. Equipped with SAILViT, existing MLLMs show significant and consistent performance improvements on the OpenCompass benchmark across extensive downstream tasks. SAILViT series models are released at https://huggingface.co/BytedanceDouyinContent.
[66] SPoT: Subpixel Placement of Tokens in Vision Transformers
Martine Hjelkrem-Tan,Marius Aasan,Gabriel Y. Arteaga,Adín Ramírez Rivera
Main category: cs.CV
TL;DR: SPoT提出了一种新的视觉Transformer(ViT)标记化策略,通过将标记连续放置在图像内,避免了基于网格的限制,从而更有效地利用稀疏性。
Details
Motivation: 标准标记化方法将特征限制在离散的补丁网格中,限制了模型在稀疏场景中的表现能力。SPoT旨在通过连续标记放置优化稀疏性利用。Contribution: 提出了SPoT(Subpixel Placement of Tokens),一种能够连续放置标记的新策略,显著减少推理所需的标记数量。
Method: 采用oracle-guided搜索方法,探索理想的子像素标记位置,提升了性能并减少了计算开销。
Result: 实验表明,SPoT能够在减少标记数量的同时保持高准确性,为ViT架构提供了更高效的设计方向。
Insight: SPoT将稀疏性视为一种战略优势,而非限制,为ViT的灵活性和效率提供了新的思路。
Abstract: Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.
[67] What does really matter in image goal navigation?
Gianluca Monaci,Philippe Weinzaepfel,Christian Wolf
Main category: cs.CV
TL;DR: 论文探讨了图像目标导航任务能否通过端到端强化学习训练解决,分析了架构选择和模拟器设置的影响,并证明了导航性能与相对位姿估计能力的相关性。
Details
Motivation: 研究图像目标导航任务是否可以通过端到端强化学习训练解决,以验证从导航任务中学习相对位姿估计的可行性。Contribution: 1. 研究了端到端强化学习在图像目标导航任务中的潜力;2. 分析了架构选择对导航性能的影响;3. 揭示了模拟器设置对结果的潜在影响;4. 展示了导航性能与相对位姿估计能力的相关性。
Method: 通过大规模实验研究了多种架构选择(如延迟融合、通道堆叠、空间到深度投影和交叉注意力)的作用,并使用强化学习训练全代理模型。
Result: 发现模拟器设置可能导致捷径学习,但部分能力可迁移到更真实的场景;导航性能与相对位姿估计能力相关。
Insight: 端到端强化学习可以用于图像目标导航任务,但需注意模拟器设置的潜在偏差;导航性能的提升可能依赖于相对位姿估计等子技能的涌现。
Abstract: Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In a large study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extend. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.
[68] Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition
Muzammil Behzad
Main category: cs.CV
TL;DR: 论文提出了一种名为FACET-VLM的视觉-语言框架,用于3D/4D面部表情识别,通过多视角融合和文本语义引导提升情感计算的性能。
Details
Motivation: 3D/4D面部表情识别在情感计算中具有挑战性,但其在人类行为理解、医疗监测和人机交互中的应用至关重要。论文旨在解决多视角面部动态的复杂性。Contribution: 1. 提出FACET-VLM框架,结合多视角面部表示学习和自然语言提示的语义引导;2. 引入CVSA、MTGF和多视角一致性损失三个关键组件;3. 在多个基准数据集上取得最佳性能,并扩展到4D微表情识别。
Method: 1. CVSA实现视角一致性的融合;2. MTGF通过文本引导对齐面部情感;3. 多视角一致性损失确保结构连贯性。
Result: 在BU-3DFE、Bosphorus、BU-4DFE和BP4D-Spontaneous等数据集上达到SOTA性能,并在4DME数据集上验证了微表情识别的有效性。
Insight: 结合视觉和语言模态的多视角融合能显著提升面部表情识别性能,尤其是在复杂动态场景中。
Abstract: Facial expression recognition (FER) in 3D and 4D domains presents a significant challenge in affective computing due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. In this work, we propose FACET-VLM, a vision-language framework for 3D/4D FER that integrates multiview facial representation learning with semantic guidance from natural language prompts. FACET-VLM introduces three key components: Cross-View Semantic Aggregation (CVSA) for view-consistent fusion, Multiview Text-Guided Fusion (MTGF) for semantically aligned facial emotions, and a multiview consistency loss to enforce structural coherence across views. Our model achieves state-of-the-art accuracy across multiple benchmarks, including BU-3DFE, Bosphorus, BU-4DFE, and BP4D-Spontaneous. We further extend FACET-VLM to 4D micro-expression recognition (MER) on the 4DME dataset, demonstrating strong performance in capturing subtle, short-lived emotional cues. The extensive experimental results confirm the effectiveness and substantial contributions of each individual component within the framework. Overall, FACET-VLM offers a robust, extensible, and high-performing solution for multimodal FER in both posed and spontaneous settings.
[69] Using Wavelet Domain Fingerprints to Improve Source Camera Identification
Xinle Tian,Matthew Nunes,Emiko Dupont,Shaunagh Downing,Freddie Lichtenstein,Matt Burns
Main category: cs.CV
TL;DR: 该论文提出了一种基于小波域的指纹提取方法,用于改进相机源识别,避免了传统方法中的反演步骤,提高了检测精度和速度。
Details
Motivation: 传统的传感器模式噪声(SPN)提取方法需要将指纹构建为图像并进行反演,步骤复杂且效率较低。Contribution: 提出了小波域指纹的概念,直接在小波域进行指纹比对,简化了提取和比较流程。
Method: 修改了基于小波的SPN提取方法,保留小波域特征而非转换为图像,避免了反演步骤。
Result: 实验表明,该方法在真实数据集上实现了更高的检测精度和显著提升的处理速度。
Insight: 直接在小波域操作可以减少计算复杂度,同时保持甚至提高特征的判别能力。
Abstract: Camera fingerprint detection plays a crucial role in source identification and image forensics, with wavelet denoising approaches proving to be particularly effective in extracting sensor pattern noise (SPN). In this article, we propose a modification to wavelet-based SPN extraction. Rather than constructing the fingerprint as an image, we introduce the notion of a wavelet domain fingerprint. This avoids the final inversion step of the denoising algorithm and allows fingerprint comparisons to be made directly in the wavelet domain. As such, our modification streamlines the extraction and comparison process. Experimental results on real-world datasets demonstrate that our method not only achieves higher detection accuracy but can also significantly improve processing speed.
[70] Soft Self-labeling and Potts Relaxations for Weakly-Supervised Segmentation
Zhongwen Zhang,Yuri Boykov
Main category: cs.CV
TL;DR: 该论文提出了一种软自标注方法,通过优化未标注像素上的CRF/Potts损失松弛项,改进了弱监督分割的性能,并在标准架构下超越了更复杂的专用系统。
Details
Motivation: 传统弱监督分割方法依赖于硬伪标签,无法表示类别不确定性或错误,这限制了性能。软自标注能更好地处理这些问题。Contribution: 1. 提出了软自标注方法;2. 系统地评估了CRF松弛项(凸和非凸);3. 提出了通用的连续子问题求解器。
Method: 1. 通过优化CRF/Potts损失松弛项实现自标注;2. 引入软伪标签表示类别不确定性;3. 采用连续子问题求解器。
Result: 软自标注在标准架构下表现优异,甚至超越了全像素监督的方法。
Insight: 软伪标签和灵活的CRF松弛项结合能显著提升弱监督分割的性能,且通用性强。
Abstract: We consider weakly supervised segmentation where only a fraction of pixels have ground truth labels (scribbles) and focus on a self-labeling approach optimizing relaxations of the standard unsupervised CRF/Potts loss on unlabeled pixels. While WSSS methods can directly optimize such losses via gradient descent, prior work suggests that higher-order optimization can improve network training by introducing hidden pseudo-labels and powerful CRF sub-problem solvers, e.g. graph cut. However, previously used hard pseudo-labels can not represent class uncertainty or errors, which motivates soft self-labeling. We derive a principled auxiliary loss and systematically evaluate standard and new CRF relaxations (convex and non-convex), neighborhood systems, and terms connecting network predictions with soft pseudo-labels. We also propose a general continuous sub-problem solver. Using only standard architectures, soft self-labeling consistently improves scribble-based training and outperforms significantly more complex specialized WSSS systems. It can outperform full pixel-precise supervision. Our general ideas apply to other weakly-supervised problems/systems.
[71] When Does Pruning Benefit Vision Representations?
Enrico Cassano,Riccardo Renzulli,Andrea Bragagnolo,Marco Grangetto
Main category: cs.CV
TL;DR: 本文研究了剪枝如何影响视觉模型的三个关键维度:可解释性、无监督目标发现和与人类感知的一致性,揭示了在特定稀疏度下模型表现最佳的“甜点”现象,并指出这种效应高度依赖于网络架构和参数量。
Details
Motivation: 剪枝常用于降低深度学习模型复杂度,但其对表示学习、可解释性和人类感知对齐的影响尚不明确,本文旨在探究这些关系。Contribution: 1. 分析了剪枝对视觉模型在三个维度的综合影响;2. 发现稀疏度“甜点”现象;3. 揭示了网络架构和参数量对剪枝效果的关键作用。
Method: 通过实验分析不同稀疏度下网络的可解释性(如特征归因方法)、无监督目标发现能力以及与人类感知的对齐程度。
Result: 稀疏模型在特定条件下(甜点)表现出更高的可解释性、泛化能力和人类对齐性,但这种效果因架构和参数量而异。
Insight: 剪枝与视觉表示的效果存在复杂关系,需结合具体网络设计权衡稀疏度与应用场景。
Abstract: Pruning is widely used to reduce the complexity of deep learning models, but its effects on interpretability and representation learning remain poorly understood. This paper investigates how pruning influences vision models across three key dimensions: (i) interpretability, (ii) unsupervised object discovery, and (iii) alignment with human perception. We first analyze different vision network architectures to examine how varying sparsity levels affect feature attribution interpretability methods. Additionally, we explore whether pruning promotes more succinct and structured representations, potentially improving unsupervised object discovery by discarding redundant information while preserving essential features. Finally, we assess whether pruning enhances the alignment between model representations and human perception, investigating whether sparser models focus on more discriminative features similarly to humans. Our findings also reveal the presence of sweet spots, where sparse models exhibit higher interpretability, downstream generalization and human alignment. However, these spots highly depend on the network architectures and their size in terms of trainable parameters. Our results suggest a complex interplay between these three dimensions, highlighting the importance of investigating when and how pruning benefits vision representations.
[72] DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
Ming Dai,Wenxuan Cheng,Jiang-jiang Liu,Sen Yang,Wenxiao Cai,Yanpeng Sun,Wankou Yang
Main category: cs.CV
TL;DR: DeRIS提出了一种新颖的框架,通过解耦感知(perception)和认知(cognition)来提升Referring Image Segmentation(RIS)性能,并揭示了当前RIS的主要瓶颈在于多模态认知能力不足。通过引入Loopback Synergy机制和数据增强方法,DeRIS显著提升了性能,并具备对非目标和多目标场景的天然适应性。
Details
Motivation: 现有RIS研究主要关注于视觉-语言的交互和细粒度定位,但对系统瓶颈的系统性分析不足。DeRIS旨在通过解耦感知和认知,深入分析并解决RIS的核心限制。Contribution: 1. 提出模块化解耦感知和认知的DeRIS框架;2. 揭示当前RIS的核心瓶颈是认知能力不足;3. 引入Loopback Synergy机制提升多模态协同;4. 提出针对长尾分布的数据增强方法。
Method: 1. 将RIS任务分解为感知和认知两个模块;2. 设计Loopback Synergy机制,增强模块间的协同;3. 引入非目标样本转换数据增强,解决长尾分布问题。
Result: DeRIS在精确分割和图像-文本理解方面表现出色,且无需专门调整即可适应非目标和多目标场景。代码和模型已开源。
Insight: 解耦感知和认知不仅有助于系统性分析瓶颈,还能通过增强协同机制显著提升性能。同时,数据增强能有效缓解长尾分布问题。
Abstract: Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
[73] Calibrated Self-supervised Vision Transformers Improve Intracranial Arterial Calcification Segmentation from Clinical CT Head Scans
Benjamin Jin,Grant Mair,Joanna M. Wardlaw,Maria del C. Valdés Hernández
Main category: cs.CV
TL;DR: 论文提出了一种自监督校准的ViT模型,用于临床CT头部扫描中颅内动脉钙化(IAC)的自动分割任务,并在性能和临床实用性上显著优于传统监督学习方法。
Details
Motivation: 颅内动脉钙化(IAC)是一种与中风和痴呆等神经血管疾病相关的影像生物标志物,自动IAC量化可以支持大规模风险评估。3D Vision Transformers(ViT)在3D医学图像分割中表现欠佳,但其自监督训练(如MAE框架)可以减少对昂贵人工标注的依赖。Contribution: 1. 首次将MAE预训练的ViT用于IAC分割任务,性能超出监督学习的nnU-Net基线3.2点Dice分数;2. 发现小patch大小和插值上采样对ViT分割效果至关重要;3. 模型对高切片厚度更具鲁棒性,并在临床风险分类中提升46%。
Method: 采用MAE框架对ViT进行自监督预训练,随后在IST-3临床试验的异构数据上微调,用于IAC分割。分析了patch大小和上采样方法对性能的影响。
Result: 校准的自监督ViT在Dice分数上显著优于基线模型,且在高切片厚度下表现更稳定,临床风险分类能力提升46%。
Insight: 自监督ViT在医学图像分割中具有潜力,尤其是对数据标注稀缺的任务;小patch和插值上采样是优化ViT分割性能的关键因素。
Abstract: Vision Transformers (ViTs) have gained significant popularity in the natural image domain but have been less successful in 3D medical image segmentation. Nevertheless, 3D ViTs are particularly interesting for large medical imaging volumes due to their efficient self-supervised training within the masked autoencoder (MAE) framework, which enables the use of imaging data without the need for expensive manual annotations. intracranial arterial calcification (IAC) is an imaging biomarker visible on routinely acquired CT scans linked to neurovascular diseases such as stroke and dementia, and automated IAC quantification could enable their large-scale risk assessment. We pre-train ViTs with MAE and fine-tune them for IAC segmentation for the first time. To develop our models, we use highly heterogeneous data from a large clinical trial, the third International Stroke Trial (IST-3). We evaluate key aspects of MAE pre-trained ViTs in IAC segmentation, and analyse the clinical implications. We show: 1) our calibrated self-supervised ViT beats a strong supervised nnU-Net baseline by 3.2 Dice points, 2) low patch sizes are crucial for ViTs for IAC segmentation and interpolation upsampling with regular convolutions is preferable to transposed convolutions for ViT-based models, and 3) our ViTs increase robustness to higher slice thicknesses and improve risk group classification in a clinical scenario by 46%. Our code is available online.
[74] SSL4SAR: Self-Supervised Learning for Glacier Calving Front Extraction from SAR Imagery
Nora Gourmelon,Marcel Dreier,Martin Mayr,Thorsten Seehaus,Dakota Pyles,Matthias Braun,Andreas Maier,Vincent Christlein
Main category: cs.CV
TL;DR: 论文提出了一种自监督学习方法SSL4SAR,用于从SAR图像中提取冰川崩解前沿,结合新的数据集和混合模型架构,显著提升了性能。
Details
Motivation: 冰川冰量流失加剧,亟需准确监测其崩解前沿。现有基于ImageNet的预训练模型在自然图像与遥感图像之间存在领域偏移问题。Contribution: 1. 提出两种新型自监督多模态预训练技术;2. 引入新的无标签数据集SSL4SAR;3. 设计Swin Transformer与CNN解码器的混合模型架构。
Method: 利用自监督学习预训练模型,结合光学与SAR图像的多模态数据;采用Swin Transformer编码器与残差CNN解码器的混合架构。
Result: 在CaFFe基准数据集上,模型平均距离误差为293米,超越之前最佳模型67米;集成模型误差降至75米,接近人类水平(38米)。
Insight: 自监督学习有效缓解领域偏移问题;混合架构结合全局与局部特征提取能力,显著提升冰川崩解前沿检测精度。
Abstract: Glaciers are losing ice mass at unprecedented rates, increasing the need for accurate, year-round monitoring to understand frontal ablation, particularly the factors driving the calving process. Deep learning models can extract calving front positions from Synthetic Aperture Radar imagery to track seasonal ice losses at the calving fronts of marine- and lake-terminating glaciers. The current state-of-the-art model relies on ImageNet-pretrained weights. However, they are suboptimal due to the domain shift between the natural images in ImageNet and the specialized characteristics of remote sensing imagery, in particular for Synthetic Aperture Radar imagery. To address this challenge, we propose two novel self-supervised multimodal pretraining techniques that leverage SSL4SAR, a new unlabeled dataset comprising 9,563 Sentinel-1 and 14 Sentinel-2 images of Arctic glaciers, with one optical image per glacier in the dataset. Additionally, we introduce a novel hybrid model architecture that combines a Swin Transformer encoder with a residual Convolutional Neural Network (CNN) decoder. When pretrained on SSL4SAR, this model achieves a mean distance error of 293 m on the “CAlving Fronts and where to Find thEm” (CaFFe) benchmark dataset, outperforming the prior best model by 67 m. Evaluating an ensemble of the proposed model on a multi-annotator study of the benchmark dataset reveals a mean distance error of 75 m, approaching the human performance of 38 m. This advancement enables precise monitoring of seasonal changes in glacier calving fronts.
[75] Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging
Montasir Shams,Chashi Mahiul Islam,Shaeke Salman,Phat Tran,Xiuwen Liu
Main category: cs.CV
TL;DR: 本文通过梯度投影方法研究发现,视觉变换器(ViT)在医学图像中的表示缺乏语义意义,易受微小变化影响,导致分类结果不可靠。
Details
Motivation: 探索视觉变换器(ViT)在医学图像任务中的表示是否具有语义意义,揭示了其潜在的脆弱性。Contribution: 首次系统地证明了ViT在医学图像分类中表示的语义意义缺失,提出其在高安全性系统中的部署挑战。
Method: 使用基于梯度投影的算法分析ViT表示,研究微小变化对表示和分类结果的影响。
Result: 研究显示ViT对微小变化非常敏感,图像差异不可察觉时可能导致分类准确率下降超过60%。
Insight: ViT表示缺乏语义意义,可能限制其在医学等安全关键领域的实际应用。
Abstract: Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.
[76] FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization
Peng Zheng,Ye Wang,Rui Ma,Zuxuan Wu
Main category: cs.CV
TL;DR: FreeLoRA 是一种无需训练的框架,通过融合特定主题的 LoRA 模块实现多主题个性化生成,解决了现有方法在组合独立模块时需复杂调整的问题。
Details
Motivation: 现有的主题驱动生成方法在多主题个性化时效果不佳,往往需要复杂的联合优化或重新调整,FreeLoRA 旨在实现无需训练的多主题模块融合。Contribution: 提出了一种无需训练的 LoRA 模块融合框架 FreeLoRA,通过 Full Token Tuning 和 Subject-Aware Inference 实现多主题个性化生成。
Method: 使用 Full Token Tuning 策略对每个主题的 LoRA 模块进行适配,并在推理时通过 Subject-Aware Inference 激活对应主题的模块。
Result: 实验表明,FreeLoRA在主题保真度和提示一致性方面表现优异。
Insight: 通过分离模块的适配与推理激活,FreeLoRA避免了多主题间的相互干扰和过拟合问题,为多主题生成提供了一种简洁有效的解决方案。
Abstract: Subject-driven image generation plays a crucial role in applications such as virtual try-on and poster design. Existing approaches typically fine-tune pretrained generative models or apply LoRA-based adaptations for individual subjects. However, these methods struggle with multi-subject personalization, as combining independently adapted modules often requires complex re-tuning or joint optimization. We present FreeLoRA, a simple and generalizable framework that enables training-free fusion of subject-specific LoRA modules for multi-subject personalization. Each LoRA module is adapted on a few images of a specific subject using a Full Token Tuning strategy, where it is applied across all tokens in the prompt to encourage weakly supervised token-content alignment. At inference, we adopt Subject-Aware Inference, activating each module only on its corresponding subject tokens. This enables training-free fusion of multiple personalized subjects within a single image, while mitigating overfitting and mutual interference between subjects. Extensive experiments show that FreeLoRA achieves strong performance in both subject fidelity and prompt consistency.
[77] HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision
Shengli Zhou,Jianuo Zhu,Qilin Huang,Fangjing Wang,Yanfu Zhang,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出HCNQA模型,通过层级聚焦监督方法改进3D视觉问答任务,模仿人类逐渐聚焦的推理过程,避免浅层捷径推理。
Details
Motivation: 现有的3D VQA模型通常仅监督最终输出,导致模型可能通过浅层模式学习答案,而无法发展合理的推理路径。此外,慢思考方法虽有助于大语言模型,但仍存在推理不充分的问题。Contribution: 提出HCNQA模型,引入层级聚焦监督方法,通过分三个阶段(从广域到具体对象)的监督,确保模型发展合理且有效的推理路径。
Method: 采用层级聚焦监督,分三个阶段引导模型逐步聚焦:从广域搜索到具体对象。监督关键检查点,确保推理路径的有效性。
Result: 实验表明,HCNQA能有效引导模型发展合理推理路径,并在3D VQA任务中表现更优。
Insight: 模仿人类逐步聚焦的推理过程,能显著提升模型的泛化能力和任务表现,避免浅层捷径学习。
Abstract: 3D Visual Question-Answering (3D VQA) is pivotal for models to perceive the physical world and perform spatial reasoning. Answer-centric supervision is a commonly used training method for 3D VQA models. Many models that utilize this strategy have achieved promising results in 3D VQA tasks. However, the answer-centric approach only supervises the final output of models and allows models to develop reasoning pathways freely. The absence of supervision on the reasoning pathway enables the potential for developing superficial shortcuts through common patterns in question-answer pairs. Moreover, although slow-thinking methods advance large language models, they suffer from underthinking. To address these issues, we propose \textbf{HCNQA}, a 3D VQA model leveraging a hierarchical concentration narrowing supervision method. By mimicking the human process of gradually focusing from a broad area to specific objects while searching for answers, our method guides the model to perform three phases of concentration narrowing through hierarchical supervision. By supervising key checkpoints on a general reasoning pathway, our method can ensure the development of a rational and effective reasoning pathway. Extensive experimental results demonstrate that our method can effectively ensure that the model develops a rational reasoning pathway and performs better. The code is available at https://github.com/JianuoZhu/HCNQA.
[78] Modulate and Reconstruct: Learning Hyperspectral Imaging from Misaligned Smartphone Views
Daniil Reutsky,Daniil Vladimirov,Yasin Mamedov,Georgy Perevozchikov,Nancy Mehta,Egor Ershov,Radu Timofte
Main category: cs.CV
TL;DR: 该论文提出一种利用多图像重建高光谱图像(MI-HSR)的新框架,通过智能手机的三摄像头系统和精心选择的滤光片,解决了单幅RGB图像重建高光谱的局限性。
Details
Motivation: 传统高光谱重建(HSR)依赖单幅RGB图像,由于光谱信息丢失严重,重建精度有限。该研究旨在利用多视角和多光谱信息提高重建精度。Contribution: 1) 提出首个多图像高光谱重建框架(MI-HSR),结合智能手机三摄像头系统;2) 发布首个MI-HSR数据集Doomer;3) 实验表明重建精度比单摄像头系统提高30%。
Method: 1) 利用带滤光片的双摄像头采集多光谱信息;2) 通过理论分析和实验验证设计系统配置;3) 提出新的HSR模型处理多视角对齐问题。
Result: 在Doomer数据集上,提出的HSR模型比现有方法表现更好,光谱估计精度提高30%。
Insight: 多视角光谱滤波结合消费级硬件可以实现更实用和精确的高光谱成像,这是传统单摄像头系统无法比拟的。
Abstract: Hyperspectral reconstruction (HSR) from RGB images is a fundamentally ill-posed problem due to severe spectral information loss. Existing approaches typically rely on a single RGB image, limiting reconstruction accuracy. In this work, we propose a novel multi-image-to-hyperspectral reconstruction (MI-HSR) framework that leverages a triple-camera smartphone system, where two lenses are equipped with carefully selected spectral filters. Our configuration, grounded in theoretical and empirical analysis, enables richer and more diverse spectral observations than conventional single-camera setups. To support this new paradigm, we introduce Doomer, the first dataset for MI-HSR, comprising aligned images from three smartphone cameras and a hyperspectral reference camera across diverse scenes. We show that the proposed HSR model achieves consistent improvements over existing methods on the newly proposed benchmark. In a nutshell, our setup allows 30% towards more accurately estimated spectra compared to an ordinary RGB camera. Our findings suggest that multi-view spectral filtering with commodity hardware can unlock more accurate and practical hyperspectral imaging solutions.
[79] Future Slot Prediction for Unsupervised Object Discovery in Surgical Video
Guiqiu Liao,Matjaz Jogan,Marcel Hussing,Edward Zhang,Eric Eaton,Daniel A. Hashimoto
Main category: cs.CV
TL;DR: 该论文提出了一种动态时序槽变换器(DTST)模块,用于在手术视频中无监督地发现对象,通过预测未来槽初始化和时序推理提升了性能。
Details
Motivation: 手术视频中的场景复杂且异构,现有的自适应槽数量方法在图像上表现良好,但在手术视频中效果不佳,因此需要一种新的方法来解决这一挑战。Contribution: 提出了动态时序槽变换器(DTST)模块,结合了时序推理和未来槽初始化预测,显著提升了手术视频中无监督对象发现的性能。
Method: 基于对象中心槽注意力范式,引入了DTST模块,通过训练模型进行时序推理和未来槽优化预测,实现了对手术视频的高效解析。
Result: 模型在多个手术数据库上达到了最先进的性能,验证了无监督对象中心方法在真实医疗数据中的适用性。
Insight: 无监督对象学习方法可以通过结合时序信息和动态预测机制,有效地应用于复杂的医疗场景中。
Abstract: Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.
[80] Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning
Qingdong He,Xueqin Chen,Chaoyi Wang,Yanjie Pan,Xiaobin Hu,Zhenye Gan,Yabiao Wang,Chengjie Wang,Xiangtai Li,Jiangning Zhang
Main category: cs.CV
TL;DR: 该论文提出了Reason50K数据集和ReasonBrain框架,专注于解决基于假设性指令的图像编辑任务中复杂的推理问题。通过结合多模态大语言模型和扩散模型,并引入细粒度推理线索提取模块,实现了对物理、时间、因果和故事推理等高难度任务的有效处理。
Details
Motivation: 现有的基于指令的图像编辑方法难以处理需要深度推理的复杂假设性指令,且缺乏支持此类任务的数据集和架构设计。Contribution: 1. 提出Reason50K数据集,包含50K样本,覆盖四种关键推理场景。2. 提出ReasonBrain框架,结合MLLM和扩散模型,引入FRCE模块提取细粒度语义。3. 设计了跨模态增强器(CME)以减少语义损失。
Method: 1. 使用MLLM生成编辑指导。2. 用扩散模型进行图像合成。3. 通过FRCE模块提取详细视觉和文本语义。4. 引入CME增强跨模态交互。
Result: ReasonBrain在推理场景中优于现有基线,并在传统IIE任务中展示了强大的零样本泛化能力。
Insight: 1. 细粒度语义提取对复杂指令推理至关重要。2. 跨模态交互设计可有效缓解语义损失。3. 该方法为复杂图像编辑任务提供了新思路。
Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.
[81] 3D Reconstruction and Information Fusion between Dormant and Canopy Seasons in Commercial Orchards Using Deep Learning and Fast GICP
Ranjan Sapkota,Zhichao Meng,Martin Churuvija,Xiaoqiang Du,Zenghong Ma,Manoj Karkee
Main category: cs.CV
TL;DR: 该论文提出了一种融合多季节果园结构信息的框架,结合深度学习和3D重建技术,解决了茂盛季节树冠遮蔽的问题。
Details
Motivation: 在果园自动化中,茂盛季节的树冠会严重遮蔽树干和树枝,而休眠季节的结构更清晰。研究旨在融合多季节数据以支持全生长季的作物管理。Contribution: 开发了一个信息融合框架,利用YOLOv9-Seg进行实例分割,Kinect Fusion进行3D重建,Fast GICP进行模型对齐,实现了跨季节的高精度结构建模。
Method: 方法包括YOLOv9-Seg的分割、Kinect Fusion的3D重建和Fast GICP的模型对齐,结合RGB-D图像数据。
Result: YOLOv9-Seg在休眠季节的分割mAP@50达0.78,Kinect Fusion重建的RMSE最低为4.50 mm,Fast GICP的配准精度为0.00197,验证了框架的有效性。
Insight: 通过多季节数据融合,解决了茂盛季节的遮蔽问题,为果园自动化操作(如修剪)提供了更精确的结构信息。
Abstract: In orchard automation, dense foliage during the canopy season severely occludes tree structures, minimizing visibility to various canopy parts such as trunks and branches, which limits the ability of a machine vision system. However, canopy structure is more open and visible during the dormant season when trees are defoliated. In this work, we present an information fusion framework that integrates multi-seasonal structural data to support robotic and automated crop load management during the entire growing season. The framework combines high-resolution RGB-D imagery from both dormant and canopy periods using YOLOv9-Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Segmentation outputs from YOLOv9-Seg were used to extract depth-informed masks, which enabled accurate 3D point cloud reconstruction via Kinect Fusion; these reconstructed models from each season were subsequently aligned using Fast GICP to achieve spatially coherent multi-season fusion. The YOLOv9-Seg model, trained on manually annotated images, achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in dormant season dataset. Kinect Fusion enabled accurate reconstruction of tree geometry, validated with field measurements resulting in root mean square errors (RMSE) of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross-seasonal registration with a minimum fitness score of 0.00197, allowing integrated, comprehensive tree structure modeling despite heavy occlusions during the growing season. This fused structural representation enables robotic systems to access otherwise obscured architectural information, improving the precision of pruning, thinning, and other automated orchard operations.
[82] IC-Custom: Diverse Image Customization via In-Context Learning
Yaowei Li,Xiaoyu Li,Zhaoyang Zhang,Yuxuan Bian,Gan Liu,Xinyuan Li,Jiale Xu,Wenbo Hu,Yating Liu,Lingen Li,Jing Cai,Yuexian Zou,Yancheng He,Ying Shan
Main category: cs.CV
TL;DR: IC-Custom是一个通过上下文学习统一位置感知和无位置感知图像自定义的框架,采用多模态注意力机制和高质量数据集,显著提升了多种工业应用的性能。
Details
Motivation: 当前图像自定义方法缺乏统一的框架,限制了多样化场景的应用,因此需要一种通用且高效的解决方案。Contribution: 提出了IC-Custom框架,统一了位置感知和无位置感知自定义,引入了ICMA机制和高质量数据集,显著提升了任务性能。
Method: 采用上下文学习将参考图像与目标图像拼接为多联画,利用DiT的多模态注意力机制和任务导向的寄存器token及边界感知位置嵌入。
Result: 在ProductBench和DreamBench上表现优异,人类偏好提升约73%,仅训练0.4%的原始参数量。
Insight: 上下文学习和多模态注意力机制能有效统一多样化自定义任务,高质量数据对避免合成数据的缺陷至关重要。
Abstract: Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT’s multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance. IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom
[83] evMLP: An Efficient Event-Driven MLP Architecture for Vision
Zhentan Zheng
Main category: cs.CV
TL;DR: 该论文提出了evMLP,一种基于事件驱动的MLP架构,用于高效处理视觉任务,特别是视频数据。通过选择性处理发生变化的图像块(事件),evlibit MLP显著减少了冗余计算。
Details
Motivation: 尽管CNN和ViT在视觉任务中表现优异,但MLP架构的研究仍具有潜力。evMLP的动机是通过事件驱动机制优化MLP的计算效率,尤其是针对视频等时序数据。Contribution: 提出了evMLP架构,引入事件驱动的局部更新机制,选择性处理变化的图像块,显著减少了计算开销,同时保持与基线模型相当的准确性。
Method: evMLP采用MLP处理图像块,通过检测连续帧之间的差异(事件)来选择性更新相关区域,避免对未变化区域进行冗余计算。
Result: 在ImageNet分类任务中,evMLP达到了与前沿模型相当的精度。在视频数据集上,实验表明其计算成本显著降低,同时保持了与基线模型一致的输出。
Insight: 事件驱动机制为高效处理时序视觉数据提供了新思路,未来可扩展至更多动态场景任务中。
Abstract: Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as “events”. Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and trained models are available at https://github.com/i-evi/evMLP.
[84] CI-VID: A Coherent Interleaved Text-Video Dataset
Yiming Ju,Jijin Hu,Zhengxiong Luo,Haoge Deng,hanyu Zhao,Li Du,Chengwei Wu,Donglin Hao,Xinlong Wang,Tengfei Pan
Main category: cs.CV
TL;DR: CI-VID是一个专注于连贯多场景视频序列生成的数据集,超越了传统的孤立文本-视频对,支持文本和视频到视频的生成任务。
Details
Motivation: 现有公开数据集主要由孤立的文本-视频对组成,无法支持连贯多片段视频序列的建模,限制了生成视频的时序一致性。Contribution: 引入CI-VID数据集,包含超过34万样本,每个样本提供连贯的视频片段序列及其文本描述,支持视觉和文本的生成任务。
Method: 设计了一个多维度基准测试,结合人工评估、基于VLM的评估和相似性度量,验证数据集的有效性。
Result: 实验表明,使用CI-VID训练的模型在生成视频序列时,准确性和内容一致性显著提升。
Insight: CI-VID为故事驱动内容的生成提供了高质量数据,推动了视频生成领域的时序建模能力。
Abstract: Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID
[85] LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Nan Chen,Mengqi Huang,Yihao Meng,Zhendong Mao
Main category: cs.CV
TL;DR: 该论文提出了LongAnimation框架,用于解决长动画着色的长期颜色一致性问题,通过动态全局-局部记忆模块和颜色一致性奖励实现高效着色。
Details
Motivation: 长动画着色在动画产业中成本高昂,现有方法仅限于短期着色且忽视全局信息,导致长期颜色一致性不足。Contribution: 提出LongAnimation框架,包括SketchDiT、动态全局-局部记忆模块(DGLM)和颜色一致性奖励,解决长期颜色一致性问题。
Method: 结合动态全局-局部记忆模块和颜色一致性奖励,通过SketchDiT提取混合参考特征,并动态融合全局历史特征与当前生成特征。
Result: 在短期(14帧)和长期(平均500帧)动画上的实验表明,LongAnimation在开放域动画着色任务中有效保持了颜色一致性。
Insight: 动态全局-局部范式能够显著提升长动画着色的颜色一致性,为行业应用提供高效自动化解决方案。
Abstract: Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.
[86] Kwai Keye-VL Technical Report
Kwai Keye Team,Biao Yang,Bin Wen,Changyi Liu,Chenglong Chu,Chengru Song,Chongling Rao,Chuan Yi,Da Li,Dunju Zang,Fan Yang,Guorui Zhou,Hao Peng,Haojie Ding,Jiaming Huang,Jiangxia Cao,Jiankang Chen,Jingyun Hua,Jin Ouyang,Kaibing Chen,Kaiyu Jiang,Kaiyu Tang,Kun Gai,Shengnan Zhang,Siyang Mao,Sui Huang,Tianke Zhang,Tingting Gao,Wei Chen,Wei Yuan,Xiangyu Wu,Xiao Hu,Xingyu Lu,Yang Zhou,Yi-Fan Zhang,Yiping Yang,Yulong Chen,Zhenhua Wu,Zhenyu Li,Zhixin Ling,Ziming Li,Dehua Ma,Di Xu,Haixuan Gao,Hang Li,Jiawei Guo,Jing Wang,Lejian Ren,Muhao Wei,Qianqian Wang,Qigen Hu,Shiyao Wang,Tao Yu,Xinchen Luo,Yan Li,Yiming Liang,Yuhang Hu,Zeyi Lu,Zhuoran Yang,Zixing Zhang
Main category: cs.CV
TL;DR: Kwai Keye-VL是一个8B参数的多模态基础模型,专为短视频理解设计,同时在通用视觉-语言任务中表现优异。其核心创新包括大规模高质量数据集和创新的训练方法。
Details
Motivation: 当前多模态大语言模型在动态、信息密集的短视频理解上表现不足,Kwai Keye-VL旨在填补这一空白。Contribution: 1) 提出Kwai Keye-VL模型,2) 构建600B token的高质量数据集,3) 创新四阶段预训练和两阶段后训练方法,4) 发布KC-MMBench新基准。
Method: 采用四阶段预训练实现视觉-语言对齐,后训练分为两阶段:增强基础能力和高级推理能力,创新使用五模式数据混合和强化学习优化模型行为。
Result: 在公共视频基准测试中达到SOTA,在通用图像任务中保持竞争力,KC-MMBench上表现出显著优势。
Insight: 通过数据多样性和分阶段训练策略,模型在短视频理解中表现优异,同时保留了通用能力,为多模态模型设计提供了新思路。
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today’s digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode cold-start'' data mixture, which includes thinking’’, non-thinking'', auto-think’’, ``think with image’’, and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.
[87] How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Rahul Ramachandran,Ali Garjani,Roman Bachmann,Andrei Atanov,Oğuzhan Fatih Kar,Amir Zamir
Main category: cs.CV
TL;DR: 论文通过将标准计算机视觉任务转换为文本提示兼容的框架,评估了多模态基础模型(如GPT-4o)在视觉理解任务上的表现,发现它们虽不及专用模型,但展现了显著的通用性。
Details
Motivation: 研究多模态基础模型在标准计算机视觉任务中的表现,填补其在视觉理解能力评估上的空白。Contribution: 提出了一个评估多模态基础模型的标准框架,揭示了它们在语义和几何任务上的性能差异,以及模型对提示变化的敏感性。
Method: 通过提示链(prompt chaining)技术将标准视觉任务转换为文本提示兼容的任务,并在多个数据集上评估模型表现。
Result: 多模态模型虽不及专用模型,但展现了通用性;GPT-4o在非推理模型中表现最佳;推理模型在几何任务中表现更优。
Insight: 多模态模型在语义任务上优于几何任务,且提示链技术的效果因模型而异;图像生成模型可能存在幻觉和空间对齐问题。
Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.
cs.AI [Back]
[88] Pensieve Grader: An AI-Powered, Ready-to-Use Platform for Effortless Handwritten STEM Grading
Yoonseok Yang,Minjune Kim,Marlon Rondinelli,Keren Shao
Main category: cs.AI
TL;DR: Pensieve Grader是一个AI驱动的平台,用于自动批改手写STEM作业,通过大型语言模型转录和评分,显著减少教师批改时间。
Details
Motivation: 大学STEM课程中手写开放式作业的批改是一个主要瓶颈,现有工具无法覆盖从转录到反馈的全流程。Contribution: 提出Pensieve平台,整合转录、评分和反馈功能,支持人类介入,已在20多所院校应用,批改30万份作业。
Method: 利用大型语言模型(LLMs)转录和评分,提供与评分标准对齐的分数、转录文本和置信度评分。
Result: 平均减少65%批改时间,高置信度预测的评分与教师评分吻合率达95.4%。
Insight: AI辅助批改工具可显著提升教育效率,但需结合人类介入以确保质量。
Abstract: Grading handwritten, open-ended responses remains a major bottleneck in large university STEM courses. We introduce Pensieve (https://www.pensieve.co), an AI-assisted grading platform that leverages large language models (LLMs) to transcribe and evaluate student work, providing instructors with rubric-aligned scores, transcriptions, and confidence ratings. Unlike prior tools that focus narrowly on specific tasks like transcription or rubric generation, Pensieve supports the entire grading pipeline-from scanned student submissions to final feedback-within a human-in-the-loop interface. Pensieve has been deployed in real-world courses at over 20 institutions and has graded more than 300,000 student responses. We present system details and empirical results across four core STEM disciplines: Computer Science, Mathematics, Physics, and Chemistry. Our findings show that Pensieve reduces grading time by an average of 65%, while maintaining a 95.4% agreement rate with instructor-assigned grades for high-confidence predictions.
[89] T3DM: Test-Time Training-Guided Distribution Shift Modelling for Temporal Knowledge Graph Reasoning
Yuehang Si,Zefan Zeng,Jincai Huang,Qing Cheng
Main category: cs.AI
TL;DR: 论文提出了一种名为T3DM的新方法,用于解决时序知识图推理中的分布偏移问题和低质量负样本生成问题,通过测试时训练和对抗训练提升模型性能。
Details
Motivation: 现有时序知识图推理方法在建模训练与测试样本间的事件分布偏移时表现不足,且依赖随机实体替换生成低质量负样本。Contribution: 1) 提出T3DM方法,通过测试时训练建模分布偏移;2) 设计基于对抗训练的高质量负样本生成策略。
Method: 1) 使用测试时训练调整模型以适应分布偏移;2) 基于对抗训练优化负样本生成。
Result: T3DM在多数情况下优于现有基线方法,提供更稳健的推理结果。
Insight: 建模分布偏移和优化负样本生成对时序知识图推理具有重要意义。
Abstract: Temporal Knowledge Graph (TKG) is an efficient method for describing the dynamic development of facts along a timeline. Most research on TKG reasoning (TKGR) focuses on modelling the repetition of global facts and designing patterns of local historical facts. However, they face two significant challenges: inadequate modeling of the event distribution shift between training and test samples, and reliance on random entity substitution for generating negative samples, which often results in low-quality sampling. To this end, we propose a novel distributional feature modeling approach for training TKGR models, Test-Time Training-guided Distribution shift Modelling (T3DM), to adjust the model based on distribution shift and ensure the global consistency of model reasoning. In addition, we design a negative-sampling strategy to generate higher-quality negative quadruples based on adversarial training. Extensive experiments show that T3DM provides better and more robust results than the state-of-the-art baselines in most cases.
cs.IR [Back]
[90] Can Argus Judge Them All? Comparing VLMs Across Domains
Harsh Joshi,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Niharika Jain,Sarthak Jain,Jiechao Gao,Usman Naseem
Main category: cs.IR
TL;DR: 论文比较了CLIP、BLIP和LXMERT三种视觉语言模型在不同任务中的表现,提出了跨数据集一致性指标(CDC),揭示了模型在泛化与专精之间的权衡。
Details
Motivation: 视觉语言模型(VLMs)在多模态AI领域发展迅速,但其在不同任务中的性能一致性尚未被充分研究,因此需要系统地评估和比较。Contribution: 1. 对比了CLIP、BLIP和LXMERT在多个任务中的表现;2. 提出了跨数据集一致性指标(CDC);3. 揭示了模型在泛化与专精之间的差异。
Method: 在检索、描述生成和推理等多个任务上评测了三种模型,包括任务准确性、生成质量、效率以及CDC指标。
Result: CLIP在泛化性上表现最佳(CDC: 0.92),BLIP在精心策划的数据上表现优异,LXMERT在结构化推理任务中领先。
Insight: 泛化性强的模型(如CLIP)在跨任务中表现稳健,而专精模型(如BLIP和LXMERT)在特定任务中更具优势,为工业部署提供了选择依据。
Abstract: Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.
[91] Embedding-based Retrieval in Multimodal Content Moderation
Hanzhong Liang,Jinghao Shi,Xiang Shen,Zixuan Wang,Vera Wen,Ardalan Mehrani,Zhiqian Chen,Yifan Wu,Zhixin Zhang
Main category: cs.IR
TL;DR: 论文提出了一种基于嵌入的检索(EBR)方法,用于多模态内容审核,通过监督对比学习(SCL)训练嵌入模型,显著提升了性能并降低了运营成本。
Details
Motivation: 传统分类方法在快速响应和成本效率方面表现不足,特别是在处理新兴趋势和紧急事件时。为了解决这一问题,论文提出了一种基于嵌入的检索方法。Contribution: 1)提出了基于监督对比学习的嵌入模型训练框架,表现优于CLIP和MoCo;2)设计了EBR系统,显著提升了内容审核的效率和灵活性。
Method: 使用监督对比学习(SCL)训练单模态和多模态嵌入模型,并构建嵌入生成与视频检索系统。
Result: 离线实验显示,EBR将ROC-AUC从0.85提升至0.99,PR-AUC从0.35提升至0.95;在线实验提高了10.32%的行动率,并降低80%以上的运营成本。
Insight: 嵌入检索方法在内容审核中不仅高效且灵活,还能显著降低成本,优于传统分类方法。
Abstract: Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding generation and video retrieval to enable efficient and effective trend handling. Comprehensive offline experiments on 25 diverse emerging trends show that EBR improves ROC-AUC from 0.85 to 0.99 and PR-AUC from 0.35 to 0.95. Further online experiments reveal that EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility compared to classification-based solutions.
cs.LG [Back]
[92] PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning
Junjie Zhou,Yingli Zuo,Shichang Feng,Peng Wan,Qi Zhu,Daoqiang Zhang,Wei Shao
Main category: cs.LG
TL;DR: PathCoT提出了一种结合病理学专家知识和自评价机制的零样本思维链提示方法,用于提升多模态大语言模型在病理学视觉推理任务中的表现。
Details
Motivation: 现有的多模态大语言模型在病理学视觉推理任务中表现不佳,主要因缺乏领域知识和额外推理步骤引入误差。Contribution: PathCoT的核心贡献是将病理学专家知识融入推理过程,并通过自评价机制减少推理误差。
Method: PathCoT利用专家知识指导模型推理,并结合直接生成和思维链推理的结果自评价以确定最终答案。
Result: 在PathMMU数据集上的实验证明了PathCoT在病理学视觉理解和推理中的有效性。
Insight: 领域知识的引入和自评价机制可以显著提升模型在专业任务中的表现和可靠性。
Abstract: With the development of generative artificial intelligence and instruction tuning techniques, multimodal large language models (MLLMs) have made impressive progress on general reasoning tasks. Benefiting from the chain-of-thought (CoT) methodology, MLLMs can solve the visual reasoning problem step-by-step. However, existing MLLMs still face significant challenges when applied to pathology visual reasoning tasks: (1) LLMs often underperforms because they lack domain-specific information, which can lead to model hallucinations. (2) The additional reasoning steps in CoT may introduce errors, leading to the divergence of answers. To address these limitations, we propose PathCoT, a novel zero-shot CoT prompting method which integrates the pathology expert-knowledge into the reasoning process of MLLMs and incorporates self-evaluation to mitigate divergence of answers. Specifically, PathCoT guides the MLLM with prior knowledge to perform as pathology experts, and provides comprehensive analysis of the image with their domain-specific knowledge. By incorporating the experts’ knowledge, PathCoT can obtain the answers with CoT reasoning. Furthermore, PathCoT incorporates a self-evaluation step that assesses both the results generated directly by MLLMs and those derived through CoT, finally determining the reliable answer. The experimental results on the PathMMU dataset demonstrate the effectiveness of our method on pathology visual understanding and reasoning.
[93] Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning
Wu Fei,Hao Kong,Shuxian Liang,Yang Lin,Yibo Yang,Jing Tang,Lei Chen,Xiansheng Hua
Main category: cs.LG
TL;DR: 论文提出了一种名为SPRO的自引导过程奖励优化框架,通过从策略模型中推导过程奖励并引入掩码步骤优势(MSA),显著提升了过程强化学习的效率和性能。
Details
Motivation: 过程强化学习(PRL)在增强大语言模型(LLMs)的推理能力方面表现出潜力,但现有方法存在计算开销大且缺乏统一的理论框架的问题。Contribution: 1. 理论证明了过程奖励可以从策略模型中推导;2. 提出了掩码步骤优势(MSA)以改进步骤级优势估计。
Method: 通过共享提示采样组中的掩码步骤优势(MSA)和累积过程奖励,实现了高效的过程感知强化学习。
Result: SPRO在训练效率上比GRPO提高了3.4倍,测试准确率提升了17.5%,同时减少了响应长度并保持了稳定的策略熵。
Insight: SPRO在不增加计算开销的情况下实现了高效的探索和奖励优化,适用于工业落地。
Abstract: Process Reinforcement Learning(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward \textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables process-aware RL through two key innovations: (1) we first theoretically demonstrate that process rewards can be derived intrinsically from the policy model itself, and (2) we introduce well-defined cumulative process rewards and \textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which facilitates rigorous step-wise action advantage estimation within shared-prompt sampling groups. Our experimental results demonstrate that SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5% test accuracy improvement. Furthermore, SPRO maintains a stable and elevated policy entropy throughout training while reducing the average response length by approximately $1/3$, evidencing sufficient exploration and prevention of reward hacking. Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
[94] Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
Zeyu Huang,Tianhao Cheng,Zihan Qiu,Zili Wang,Yinghui Xu,Edoardo M. Ponti,Ivan Titov
Main category: cs.LG
TL;DR: 论文提出了一种名为Prefix-RFT的混合方法,结合了监督微调(SFT)和强化微调(RFT)的优势,通过前缀采样统一了两种范式,在数学推理任务中表现优于单独使用SFT或RFT及混合策略RFT方法。
Details
Motivation: 现有的大型语言模型后训练技术(SFT和RFT)各有优缺点:SFT擅长模仿演示数据但容易过拟合,RFT能提升性能但对初始策略敏感且可能学习到意外行为。作者旨在统一这两种方法以发挥互补优势。Contribution: 提出了Prefix-RFT方法,结合了SFT和RFT的优势;证明其在数学推理任务中优于现有方法;强调了SFT和RFT的互补性;验证了方法对演示数据质量和数量的鲁棒性。
Method: 引入Prefix-RFT,通过前缀采样结合SFT的演示学习和RFT的探索学习。前缀采样允许模型在训练中动态平衡两种范式,且仅需对标准RFT流程进行少量修改。
Result: Prefix-RFT在性能上超越了单独的SFT和RFT,以及并行混合策略RFT方法。实验验证了其对演示数据变化的鲁棒性。
Insight: SFT和RFT具有互补性;统一的演示与探索结合范式是未来LLM后训练的有前景方向;Prefix-RFT的简单易用性使其易于集成到现有框架中。
Abstract: Existing post-training techniques for large language models are broadly categorized into Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each paradigm presents a distinct trade-off: SFT excels at mimicking demonstration data but can lead to problematic generalization as a form of behavior cloning. Conversely, RFT can significantly enhance a model’s performance but is prone to learn unexpected behaviors, and its performance is highly sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a testbed, we empirically demonstrate that Prefix-RFT is both simple and effective. It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods. A key advantage is its seamless integration into existing open-source frameworks, requiring only minimal modifications to the standard RFT pipeline. Our analysis highlights the complementary nature of SFT and RFT, and validates that Prefix-RFT effectively harmonizes these two learning paradigms. Furthermore, ablation studies confirm the method’s robustness to variations in the quality and quantity of demonstration data. We hope this work offers a new perspective on LLM post-training, suggesting that a unified paradigm that judiciously integrates demonstration and exploration could be a promising direction for future research.
[95] Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training
Ismail Labiad,Mathurin Videau,Matthieu Kowalski,Marc Schoenauer,Alessandro Leite,Julia Kempe,Olivier Teytaud
Main category: cs.LG
TL;DR: 该论文提出了一种名为BBoxER的黑盒进化方法,用于大型语言模型(LLM)的后训练,通过隐式压缩训练数据引入信息瓶颈,提供了泛化性、差分隐私、抗数据毒化攻击和提取攻击的理论保证。
Details
Motivation: 传统基于梯度的优化方法在隐私和安全性方面存在风险(如数据毒化攻击和过拟合),而黑盒优化方法虽然避免了这些风险,但在高维参数空间(如LLM)中扩展性差且计算成本高。BBoxER旨在解决这些问题。Contribution: 提出BBoxER方法,通过信息瓶颈机制为LLM后训练提供隐私保护、泛化性和抗攻击的理论保证;展示其在推理任务上的实际效果。
Method: 采用进化黑盒优化方法(BBoxER),隐式压缩训练数据以引入信息瓶颈,提供理论分析支持其泛化性、隐私性和鲁棒性。
Result: 实验证明,BBoxER在LLM后训练中能够有效提升性能并实现良好泛化,同时在隐私敏感环境中表现优异。
Insight: BBoxER作为黑盒优化方法,为LLM提供了一种轻量级、模块化的增强方案,尤其在隐私和安全敏感的部署场景中具有优势。
Abstract: Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, its reliance on large volumes of labeled data raises privacy and security concerns such as susceptibility to data poisoning attacks and the risk of overfitting. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. However, black box methods also pose significant challenges, including poor scalability to high-dimensional parameter spaces, as prevalent in large language models (LLMs), and high computational costs due to reliance on numerous model evaluations. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide strong theoretical bounds on generalization, differential privacy, susceptibility to data poisoning attacks, and robustness to extraction attacks. BBoxER operates on top of pre-trained LLMs, offering a lightweight and modular enhancement suitable for deployment in restricted or privacy-sensitive environments, in addition to non-vacuous generalization guarantees. In experiments with LLMs, we demonstrate empirically that Retrofitting methods are able to learn, showing how a few iterations of BBoxER improve performance and generalize well on a benchmark of reasoning datasets. This positions BBoxER as an attractive add-on on top of gradient-based optimization.
[96] LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs
Reza Arabpour,Haitz Sáez de Ocáriz Borde,Anastasis Kratsios
Main category: cs.LG
TL;DR: 该论文提出了一种基于CPU的LoRA(低秩适配器)微调方法,利用预训练适配器的元操作生成新适配器,避免了昂贵的GPU训练。
Details
Motivation: LoRA作为一种参数高效的方法被广泛用于大型语言模型的微调,但其依赖GPU资源限制了普及。本文旨在为仅拥有CPU资源的用户提供实用替代方案。Contribution: 提出了一种理论驱动的元生成框架,通过学习从输入数据分布到LoRA权重的映射,直接在CPU上生成适配器,无需梯度更新。
Method: 利用预训练的适配器库,通过轻量级组合生成新适配器,避免了计算密集的GPU训练。基于Mistral-7B-Instruct-v0.2模型实现。
Result: 生成的适配器性能虽不及GPU训练的版本,但在下游任务中显著优于基础模型,为资源受限用户提供了可行方案。
Insight: 证明了在CPU上高效生成适配器的可行性,拓展了LoRA的应用范围,尤其适用于计算资源有限的环境。
Abstract: Low-Rank Adapters (LoRAs) have transformed the fine-tuning of Large Language Models (LLMs) by enabling parameter-efficient updates. However, their widespread adoption remains limited by the reliance on GPU-based training. In this work, we propose a theoretically grounded approach to LoRA fine-tuning designed specifically for users with limited computational resources, particularly those restricted to standard laptop CPUs. Our method learns a meta-operator that maps any input dataset, represented as a probability distribution, to a set of LoRA weights by leveraging a large bank of pre-trained adapters for the Mistral-7B-Instruct-v0.2 model. Instead of performing new gradient-based updates, our pipeline constructs adapters via lightweight combinations of existing LoRAs directly on CPU. While the resulting adapters do not match the performance of GPU-trained counterparts, they consistently outperform the base Mistral model on downstream tasks, offering a practical and accessible alternative to traditional GPU-based fine-tuning.
[97] Test-Time Scaling with Reflective Generative Model
Zixiao Wang,Yuxin Wang,Xiaorui Wang,Mengting Xing,Jie Gao,Jianjun Xu,Guangcan Liu,Chenhui Jin,Zhuo Wang,Shengzhuo Zhang,Hongtao Xie
Main category: cs.LG
TL;DR: 论文提出了一个反射生成模型MetaStone-S1,通过自监督过程奖励模型(SPRM)整合了策略模型和过程奖励模型(PRM),实现了高效的推理和测试时间扩展(TTS)。
Details
Motivation: 现有的大模型通常需要大量参数和额外标注来完成复杂推理任务,而SPRM通过共享主干网络和任务特定头部,减少了99%的PRM参数,同时提升了性能。Contribution: 1.提出了SPRM模型,统一了策略模型和PRM的接口,无需额外过程标注;2.提供了基于可控思维长度的三种推理模式;3.建立了计算与TTS性能的扩展规律。
Method: SPRM通过共享主干网络和任务特定头部(分别用于下一个Token预测和过程评分),将策略模型和PRM整合为一个统一框架。模型支持测试时间扩展(TTS)和三种推理努力模式。
Result: MetaStone-S1仅用32B参数就达到了与OpenAI-o3-mini系列相当的性能,并开源了模型代码。
Insight: SPRM的共享机制和任务特定头部设计显著减少了参数冗余,同时支持灵活的推理模式,为高效大模型设计提供了新思路。
Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3’s performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini’s series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
[98] Escaping Platos Cave: JAM for Aligning Independently Trained Vision and Language Models
Hyoseo,Yoon,Yisong Yue,Been Kim
Main category: cs.LG
TL;DR: 论文提出了一种名为JAM的框架,用于对齐独立训练的视觉和语言模型的表示空间,通过多目标优化实现模态间的对齐。
Details
Motivation: 现有的视觉和语言模型虽然在各自领域表现优秀,但它们的表示空间是分离的。Platonic Representation Hypothesis提出这些模型可能共享对现实的统计模型,因此需要一种方法明确对齐这些表示。Contribution: 提出了JAM框架,通过联合训练模态特定的自编码器,结合重建和跨模态目标,实现了对齐独立训练的视觉和语言模型表示。
Method: JAM是一个多目标优化框架,包括模态特定的自编码器和跨模态对齐目标(如对比损失和Spread损失),在预训练模型的潜在表示上进行联合训练。
Result: 实验表明,JAM能够有效对齐冻结的独立训练表示,且在不同对齐目标、层深度和模型规模下均表现出色。
Insight: 轻量级的Pareto优化框架可以在不对预训练模型微调的情况下实现表示对齐,为将通用单模态模型转化为多模态模型提供了实用途径。
Abstract: Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. Yet an emerging hypothesis - the Platonic Representation Hypothesis - suggests that such models may nonetheless converge toward a shared statistical model of reality. This compatibility, if it exists, raises a fundamental question: can we move beyond post-hoc statistical detection of alignment and explicitly optimize for it between such disjoint representations? We cast this Platonic alignment problem as a multi-objective optimization task - preserve each modality’s native structure while aligning for mutual coherence. We introduce the Joint Autoencoder Modulator (JAM) framework that jointly trains modality-specific autoencoders on the latent representations of pre-trained single modality models, encouraging alignment through both reconstruction and cross-modal objectives. By analogy, this framework serves as a method to escape Plato’s Cave, enabling the emergence of shared structure from disjoint inputs. We evaluate this framework across three critical design axes: (i) the alignment objective - comparing contrastive loss (Con), its hard-negative variant (NegCon), and our Spread loss, (ii) the layer depth at which alignment is most effective, and (iii) the impact of foundation model scale on representational convergence. Our findings show that our lightweight Pareto-efficient framework reliably induces alignment, even across frozen, independently trained representations, offering both theoretical insight and practical pathways for transforming generalist unimodal foundations into specialist multimodal models.
cs.DB [Back]
[99] Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems
Zhaoyan Sun,Jiayi Wang,Xinyang Zhao,Jiachi Wang,Guoliang Li
Main category: cs.DB
TL;DR: 本文提出了’Data Agent’这一全面架构,旨在通过整合知识理解、推理和规划能力,协调数据+AI生态系统,解决数据相关任务。
Details
Motivation: 传统的数据+AI系统依赖人工专家协调系统流程,无法动态适应数据、查询、任务和环境的变化,而大型语言模型(LLMs)的成功为提升语义理解、推理和规划能力提供了新可能。Contribution: 提出了’Data Agent’架构,并探讨了其设计中的挑战,如数据/查询/环境/工具的理解、流程协调、优化与执行,以及流程自我反思。还展示了多种数据代理系统的实例。
Method: 通过整合大型语言模型(LLMs)的能力,设计了一个全面的架构,用于协调数据+AI应用中的知识理解、推理和规划任务。
Result: 提出了多种数据代理系统的例子,如数据科学代理、数据分析代理、数据库管理员代理等,并指出了设计中的开放挑战。
Insight: 利用LLMs的能力可以显著提升数据系统的语义理解和规划能力,从而实现更高效的数据+AI应用协调。
Abstract: Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficulty arises because existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning. Fortunately, we have witnessed the success of large language models (LLMs) in enhancing semantic understanding, reasoning, and planning abilities. It is crucial to incorporate LLM techniques to revolutionize data systems for orchestrating Data+AI applications effectively. To achieve this, we propose the concept of a ‘Data Agent’ - a comprehensive architecture designed to orchestrate Data+AI ecosystems, which focuses on tackling data-related tasks by integrating knowledge comprehension, reasoning, and planning capabilities. We delve into the challenges involved in designing data agents, such as understanding data/queries/environments/tools, orchestrating pipelines/workflows, optimizing and executing pipelines, and fostering pipeline self-reflection. Furthermore, we present examples of data agent systems, including a data science agent, data analytics agents (such as unstructured data analytics agent, semantic structured data analytics agent, data lake analytics agent, and multi-modal data analytics agent), and a database administrator (DBA) agent. We also outline several open challenges associated with designing data agent systems.
cs.CR [Back]
[100] SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism
Beitao Chen,Xinyu Lyu,Lianli Gao,Jingkuan Song,Heng Tao Shen
Main category: cs.CR
TL;DR: SafePTR 提出了一种无需训练的防御框架,通过选择性剪枝有害多模态令牌来增强 MLLMs 的安全性,同时保留模型效率。
Details
Motivation: 现有的多模态防御方法无法揭示多模态令牌如何触发越狱漏洞的根本原因,导致防御效果不佳或过度防御。Contribution: 1) 分析了有害多模态令牌在 MLLMs 中的关键作用;2) 提出 SafePTR 框架,通过剪枝-恢复机制实现高效防御。
Method: 通过分析发现早期中层有害令牌是关键,提出选择性剪枝-恢复机制,无需训练即可防御多模态越狱攻击。
Result: 在三个 MLLMs 和五个基准测试中,SafePTR 表现出卓越的防御能力,且不影响模型实用性。
Insight: 极少数的有害令牌是引发越狱的关键,通过精确剪枝即可高效提升安全性。
Abstract: By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs’ built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR’s state-of-the-art performance in mitigating jailbreak risks without compromising utility.
eess.AS [Back]
[101] Scalable Offline ASR for Command-Style Dictation in Courtrooms
Kumarmanas Nethil,Vaibhav Mishra,Kriti Anandan,Kavya Manohar
Main category: eess.AS
TL;DR: 这篇论文提出了一种开源的命令行式听写框架,通过结合语音活动检测(VAD)和并行转录技术(如Whisper模型),解决了在线系统资源密集型与批量处理高延迟之间的差距,并在实际法庭场景中验证了其高效性。
Details
Motivation: 现有的语音识别(ASR)系统在命令行式听写场景中,要么资源消耗大(在线系统),要么延迟高(批量处理)。作者旨在开发一种高效、可扩展的解决方案,适用于法庭等实际场景。Contribution: 主要贡献包括:1)开源框架,兼容多种ASR架构(如CTC模型);2)基于VAD和并行转录的高效多路复用技术;3)在实际部署中验证了其性能(如印度15%的法庭使用)。
Method: 方法包括:1)使用VAD对音频分段;2)并行转录各段(Whisper模型);3)设计多路复用技术以提高计算资源利用率。
Result: 实验表明确实降低了延迟,尤其在高并发用户场景下优于顺序批量处理。
Insight: 通过结合VAD和并行处理,可以显著提高ASR系统的效率和扩展性。开源框架的设计使其适用于多样化需求,展示了在资源受限环境中的实用性。
Abstract: We propose an open-source framework for Command-style dictation that addresses the gap between resource-intensive Online systems and high-latency Batch processing. Our approach uses Voice Activity Detection (VAD) to segment audio and transcribes these segments in parallel using Whisper models, enabling efficient multiplexing across audios. Unlike proprietary systems like SuperWhisper, this framework is also compatible with most ASR architectures, including widely used CTC-based models. Our multiplexing technique maximizes compute utilization in real-world settings, as demonstrated by its deployment in around 15% of India’s courtrooms. Evaluations on live data show consistent latency reduction as user concurrency increases, compared to sequential batch processing. The live demonstration will showcase our open-sourced implementation and allow attendees to interact with it in real-time.
cs.RO [Back]
[102] VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process
Cristian Gariboldi,Hayato Tokida,Ken Kinjo,Yuki Asada,Alexander Carballo
Main category: cs.RO
TL;DR: 本文提出了VLAD模型,通过结合微调后的视觉语言模型(VLM)与先进的端到端自动驾驶系统VAD,优化了驾驶决策过程,并提供了可解释的决策说明。在nuScenes数据集上的实验表明,该系统显著降低了碰撞率。
Details
Motivation: 现有端到端自动驾驶系统多为黑箱模型,缺乏透明度和可解释性;而视觉语言模型(VLM)的通用知识为增强自动驾驶感知与决策提供了新机会。Contribution: 1. 提出VLAD框架,结合VLM与VAD系统;2. 通过定制问答数据集微调VLM以提升空间推理能力;3. 生成可解释的自然语言驾驶决策说明。
Method: 1. 设计定制问答数据集微调VLM;2. VLM生成高级导航指令;3. VAD系统执行车辆控制;4. 提供驾驶决策的自然语言解释。
Result: 在nuScenes数据集上,VLAD将平均碰撞率降低31.82%,优于现有基准方法。
Insight: 视觉语言模型的通用知识与端到端系统的结合,不仅能提升性能,还能增强自动驾驶系统的透明度和可解释性。
Abstract: Recent advancements in open-source Visual Language Models (VLMs) such as LLaVA, Qwen-VL, and Llama have catalyzed extensive research on their integration with diverse systems. The internet-scale general knowledge encapsulated within these models presents significant opportunities for enhancing autonomous driving perception, prediction, and planning capabilities. In this paper we propose VLAD, a vision-language autonomous driving model, which integrates a fine-tuned VLM with VAD, a state-of-the-art end-to-end system. We implement a specialized fine-tuning approach using custom question-answer datasets designed specifically to improve the spatial reasoning capabilities of the model. The enhanced VLM generates high-level navigational commands that VAD subsequently processes to guide vehicle operation. Additionally, our system produces interpretable natural language explanations of driving decisions, thereby increasing transparency and trustworthiness of the traditionally black-box end-to-end architecture. Comprehensive evaluation on the real-world nuScenes dataset demonstrates that our integrated system reduces average collision rates by 31.82% compared to baseline methodologies, establishing a new benchmark for VLM-augmented autonomous driving systems.
[103] LANet: A Lane Boundaries-Aware Approach For Robust Trajectory Prediction
Muhammad Atta ur Rahman,Dooseop Choi,KyoungWook Min
Main category: cs.RO
TL;DR: LANet提出了一种基于车道边界和道路边缘的增强型运动预测模型,通过融合多种矢量地图元素,丰富了驾驶环境的表示,并通过剪枝机制提高计算效率。该方法在Argoverse 2数据集上验证了其竞争力。
Details
Motivation: 当前的运动预测模型主要基于车道中心线表示,限制了其对复杂道路环境和交通规则的捕捉能力,因此需要一种更丰富且高效的表示方法。Contribution: 1. 提出了一种基于车道边界和道路边缘的运动预测模型;2. 开发了有效的特征融合策略和剪枝机制,以平衡计算效率和信息完整性。
Method: 1. 使用多种矢量地图元素(如车道边界和道路边缘)表示驾驶环境;2. 通过特征融合策略整合不同地图组件的信息;3. 引入剪枝机制过滤无关地图连接。
Result: 在Argoverse 2数据集上验证了方法的竞争力,同时提升了性能表现。
Insight: 车道边界等细节信息对运动预测至关重要,而剪枝机制可以在保证性能的同时降低计算成本。
Abstract: Accurate motion forecasting is critical for safe and efficient autonomous driving, enabling vehicles to predict future trajectories and make informed decisions in complex traffic scenarios. Most of the current designs of motion prediction models are based on the major representation of lane centerlines, which limits their capability to capture critical road environments and traffic rules and constraints. In this work, we propose an enhanced motion forecasting model informed by multiple vector map elements, including lane boundaries and road edges, that facilitates a richer and more complete representation of driving environments. An effective feature fusion strategy is developed to merge information in different vector map components, where the model learns holistic information on road structures and their interactions with agents. Since encoding more information about the road environment increases memory usage and is computationally expensive, we developed an effective pruning mechanism that filters the most relevant map connections to the target agent, ensuring computational efficiency while maintaining essential spatial and semantic relationships for accurate trajectory prediction. Overcoming the limitations of lane centerline-based models, our method provides a more informative and efficient representation of the driving environment and advances the state of the art for autonomous vehicle motion forecasting. We verify our approach with extensive experiments on the Argoverse 2 motion forecasting dataset, where our method maintains competitiveness on AV2 while achieving improved performance. Index Terms-Autonomous driving, trajectory prediction, vector map elements, road topology, connection pruning, Argoverse 2.
cs.MA [Back]
[104] Automated Vehicles Should be Connected with Natural Language
Xiangbo Gao,Keshu Wu,Hao Zhang,Kexin Tian,Yang Zhou,Zhengzhong Tu
Main category: cs.MA
TL;DR: 本文主张自动驾驶车辆应通过自然语言进行通信,以解决现有感知数据交换的局限性,并提升协作驾驶的安全性和效率。
Details
Motivation: 现有的多智能体协作驾驶通信方式(如原始传感器数据、神经网络特征等)在带宽效率、信息完整性和智能体互操作性方面存在不足,且忽视了决策级融合。Contribution: 提出用自然语言作为通信媒介,直接传递意图和推理过程,从而提升协作驾驶的主动协调能力。
Method: 通过自然语言实现意图、决策和推理的直接通信,取代传统的感知数据共享。
Result: 自然语言通信能够平衡语义密度和带宽,适应实时条件,并支持异构智能体平台的协作。
Insight: 自然语言不仅提高了通信效率,还增强了协作驾驶的透明性和决策能力。
Abstract: Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media – including raw sensor data, neural network features, and perception results – suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have largely ignored decision-level fusion, neglecting critical dimensions of collaborative driving. In this paper we argue that addressing these challenges requires a transition from purely perception-oriented data exchanges to explicit intent and reasoning communication using natural language. Natural language balances semantic density and communication bandwidth, adapts flexibly to real-time conditions, and bridges heterogeneous agent platforms. By enabling the direct communication of intentions, rationales, and decisions, it transforms collaborative driving from reactive perception-data sharing into proactive coordination, advancing safety, efficiency, and transparency in intelligent transportation systems.
eess.IV [Back]
[105] Prompt Mechanisms in Medical Imaging: A Comprehensive Survey
Hao Yang,Xinlong Liang,Zhang Li,Yue Sun,Zheyu Hu,Xinghe Xie,Behdad Dashtbozorg,Jincheng Huang,Shiwei Zhu,Luyi Han,Jiong Zhang,Shanshan Wang,Ritse Mann,Qifeng Yu,Tao Tan
Main category: eess.IV
TL;DR: 这篇综述系统性探讨了医学影像中基于提示(prompt)的深度学习方法,分析了不同类型的提示机制及其在任务中的表现,并指出了未来研究方向。
Details
Motivation: 医学影像中深度学习的临床推广面临数据稀缺、分布偏移和泛化能力不足等挑战,而提示机制能灵活适应不同任务,减少对大量标注数据的依赖。Contribution: 全面总结了医学影像中提示机制的研究现状,阐述了其如何提升模型的准确性、鲁棒性和数据效率,并推动模型的可解释性。
Method: 分析了多种提示形式,如文本指令、视觉提示和可学习的嵌入,并探讨了它们在图像生成、分割和分类任务中的整合方式。
Result: 提示机制在医学影像任务中显著提升了性能,同时减少了人工特征工程的依赖,但优化提示设计和应对数据异质性仍是挑战。
Insight: 未来的研究方向包括发展多模态提示机制和进一步推动临床部署,提示驱动的AI在医学诊断和个性化治疗中有巨大潜力。
Abstract: Deep learning offers transformative potential in medical imaging, yet its clinical adoption is frequently hampered by challenges such as data scarcity, distribution shifts, and the need for robust task generalization. Prompt-based methodologies have emerged as a pivotal strategy to guide deep learning models, providing flexible, domain-specific adaptations that significantly enhance model performance and adaptability without extensive retraining. This systematic review critically examines the burgeoning landscape of prompt engineering in medical imaging. We dissect diverse prompt modalities, including textual instructions, visual prompts, and learnable embeddings, and analyze their integration for core tasks such as image generation, segmentation, and classification. Our synthesis reveals how these mechanisms improve task-specific outcomes by enhancing accuracy, robustness, and data efficiency and reducing reliance on manual feature engineering while fostering greater model interpretability by making the model’s guidance explicit. Despite substantial advancements, we identify persistent challenges, particularly in prompt design optimization, data heterogeneity, and ensuring scalability for clinical deployment. Finally, this review outlines promising future trajectories, including advanced multimodal prompting and robust clinical integration, underscoring the critical role of prompt-driven AI in accelerating the revolution of diagnostics and personalized treatment planning in medicine.
[106] MID-INFRARED (MIR) OCT-based inspection in industry
N. P. García-de-la-Puente,Rocío del Amor,Fernando García-Torres,Niels Møller Israelsen,Coraline Lapre,Christian Rosenberg Petersen,Ole Bang,Dominik Brouczek,Martin Schwentenwein,Kevin Neumann,Niels Benson,Valery Naranjo
Main category: eess.IV
TL;DR: 本文探讨了基于中红外(MIR)光学相干断层扫描(OCT)的工业检测系统,评估其穿透材料并检测次表面异常的能力,同时探索了预处理和AI增强视觉算法在异常检测中的应用。
Details
Motivation: 工业中需要非破坏性检测技术(NDT)来监控生产过程和检测材料内部的缺陷,MIR OCT因其穿透能力成为潜在解决方案。Contribution: 1. 评估MIR OCT系统在复合材料和陶瓷中的穿透与异常检测能力;2. 探索预处理和AI算法在异常检测中的有效性;3. 讨论系统参数选择的标准及局限性。
Method: 1. 在复合材料和陶瓷上进行多次采集实验;2. 结合预处理和AI增强视觉算法处理数据,检测异常区域;3. 分析系统参数选择对结果的影响。
Result: 实验展示了MIR OCT系统在工业检测中的潜力,但需优化参数和算法以提高准确性。
Insight: MIR OCT在工业NDT中具有应用前景,但需进一步研究以解决其局限性和优化性能。
Abstract: This paper aims to evaluate mid-infrared (MIR) Optical Coherence Tomography (OCT) systems as a tool to penetrate different materials and detect sub-surface irregularities. This is useful for monitoring production processes, allowing Non-Destructive Inspection Techniques of great value to the industry. In this exploratory study, several acquisitions are made on composite and ceramics to know the capabilities of the system. In addition, it is assessed which preprocessing and AI-enhanced vision algorithms can be anomaly-detection methodologies capable of detecting abnormal zones in the analyzed objects. Limitations and criteria for the selection of optimal parameters will be discussed, as well as strengths and weaknesses will be highlighted.
[107] Structure and Smoothness Constrained Dual Networks for MR Bias Field Correction
Dong Liang,Xingyu Qiu,Yuzhen Li,Wei Wang,Kuanquan Wang,Suyu Dong,Gongning Luo
Main category: eess.IV
TL;DR: 该论文提出了S2DNets,一种基于结构和平滑度约束的双网络模型,用于自监督的MR图像偏置场校正,通过引入分段结构约束和偏置场平滑性,有效去除强度不均匀性并保留更多结构细节。
Details
Motivation: MR图像中由于设备限制常存在显著强度不均匀性,影响医学分析的定性和定量结果,现有深度学习模型仅关注全局外观学习,忽视了图像结构和偏置场平滑性的约束,导致校正结果失真。Contribution: 提出了S2DNets,首次结合了图像分段结构约束和偏置场平滑性约束,实现了更准确的MR图像偏置场校正。
Method: 使用双网络结构,通过分段结构约束和偏置场平滑性约束进行训练,实现自监督的偏置场校正。
Result: 在临床和模拟MR数据集上的实验表明,S2DNets在视觉指标和下游分割任务中均优于传统和深度学习方法。
Insight: 通过引入结构和平滑性约束,可以在去除强度不均匀性的同时保留更多图像细节,为MR图像处理提供新思路。
Abstract: MR imaging techniques are of great benefit to disease diagnosis. However, due to the limitation of MR devices, significant intensity inhomogeneity often exists in imaging results, which impedes both qualitative and quantitative medical analysis. Recently, several unsupervised deep learning-based models have been proposed for MR image improvement. However, these models merely concentrate on global appearance learning, and neglect constraints from image structures and smoothness of bias field, leading to distorted corrected results. In this paper, novel structure and smoothness constrained dual networks, named S2DNets, are proposed aiming to self-supervised bias field correction. S2DNets introduce piece-wise structural constraints and smoothness of bias field for network training to effectively remove non-uniform intensity and retain much more structural details. Extensive experiments executed on both clinical and simulated MR datasets show that the proposed model outperforms other conventional and deep learning-based models. In addition to comparison on visual metrics, downstream MR image segmentation tasks are also used to evaluate the impact of the proposed model. The source code is available at: https://github.com/LeongDong/S2DNets}{https://github.com/LeongDong/S2DNets.
[108] BronchoGAN: Anatomically consistent and domain-agnostic image-to-image translation for video bronchoscopy
Ahmad Soliman,Ron Keuth,Marian Himstedt
Main category: eess.IV
TL;DR: BronchoGAN提出了一种基于条件GAN的图像到图像转换方法,通过引入解剖学约束和中间深度图像表示,实现了跨域(如虚拟支气管镜、体模等)的鲁棒转换,显著提升了合成图像的解剖学一致性。
Details
Motivation: 支气管镜图像的稀缺性限制了深度学习模型的训练。跨不同域(如虚拟支气管镜、体模、体内外数据)的图像转换对临床应用至关重要,但现有方法在解剖学一致性上表现不足。Contribution: 1. 提出BronchoGAN,通过条件GAN整合解剖学约束(如支气管孔匹配);2. 利用基础模型生成的深度图像作为中间表示,增强域鲁棒性;3. 提出了一种构建配对训练数据的简便方法。
Method: 1. 将解剖学约束(支气管孔匹配)嵌入条件GAN;2. 使用基础模型生成的深度图像作为中间表示;3. 通过交叉域输入(如虚拟支气管镜)生成真实的支气管镜图像。
Result: 实验表明,BronchoGAN能成功转换不同域的输入图像,解剖学结构(如支气管孔)被鲁棒保留,FID、SSIM和Dice系数显著提升(Dice系数最高提升0.43)。
Insight: 1. 解剖学约束和中间深度表示的结合显著提升了跨域转换的鲁棒性;2. 利用公共CT数据生成大规模支气管镜图像,缓解数据稀缺问题。
Abstract: The limited availability of bronchoscopy images makes image synthesis particularly interesting for training deep learning models. Robust image translation across different domains – virtual bronchoscopy, phantom as well as in-vivo and ex-vivo image data – is pivotal for clinical applications. This paper proposes BronchoGAN introducing anatomical constraints for image-to-image translation being integrated into a conditional GAN. In particular, we force bronchial orifices to match across input and output images. We further propose to use foundation model-generated depth images as intermediate representation ensuring robustness across a variety of input domains establishing models with substantially less reliance on individual training datasets. Moreover our intermediate depth image representation allows to easily construct paired image data for training. Our experiments showed that input images from different domains (e.g. virtual bronchoscopy, phantoms) can be successfully translated to images mimicking realistic human airway appearance. We demonstrated that anatomical settings (i.e. bronchial orifices) can be robustly preserved with our approach which is shown qualitatively and quantitatively by means of improved FID, SSIM and dice coefficients scores. Our anatomical constraints enabled an improvement in the Dice coefficient of up to 0.43 for synthetic images. Through foundation models for intermediate depth representations, bronchial orifice segmentation integrated as anatomical constraints into conditional GANs we are able to robustly translate images from different bronchoscopy input domains. BronchoGAN allows to incorporate public CT scan data (virtual bronchoscopy) in order to generate large-scale bronchoscopy image datasets with realistic appearance. BronchoGAN enables to bridge the gap of missing public bronchoscopy images.
[109] Multi Source COVID-19 Detection via Kernel-Density-based Slice Sampling
Chia-Ming Lee,Bo-Cheng Qiu,Ting-Yao Chen,Ming-Han Sun,Fang-Ying Lin,Jung-Tse Tsai,I-An Tsai,Yu-Fan Lin,Chih-Chung Hsu
Main category: eess.IV
TL;DR: 本文提出了一种基于核密度切片采样(KDS)的多源COVID-19检测方法,通过优化预处理流程和模型选择,在CT扫描分类任务中取得了显著效果。
Details
Motivation: 多源数据(来自不同医疗中心的CT扫描)的变异性对COVID-19检测提出了挑战。本文旨在通过改进的切片采样方法和模型选择提升分类性能。Contribution: 主要贡献包括:(1)提出了基于核密度的切片采样方法(KDS);(2)结合了肺区域提取和质量控制的预处理流程;(3)在四中心数据上验证了方法的有效性。
Method: 使用了Spatial-Slice Feature Learning(SSFL)框架,结合KDS对CT扫描进行切片采样,并比较了EfficientNet和Swin Transformer两种模型。
Result: EfficientNet模型在验证集上F1得分为94.68%,优于Swin Transformer的93.34%。证明了KDS流程在多源数据上的有效性。
Insight: 论文强调了数据集平衡在多机构医学影像评估中的重要性,并为多源医学影像分析提供了一种高效预处理方法。
Abstract: We present our solution for the Multi-Source COVID-19 Detection Challenge, which classifies chest CT scans from four distinct medical centers. To address multi-source variability, we employ the Spatial-Slice Feature Learning (SSFL) framework with Kernel-Density-based Slice Sampling (KDS). Our preprocessing pipeline combines lung region extraction, quality control, and adaptive slice sampling to select eight representative slices per scan. We compare EfficientNet and Swin Transformer architectures on the validation set. The EfficientNet model achieves an F1-score of 94.68%, compared to the Swin Transformer’s 93.34%. The results demonstrate the effectiveness of our KDS-based pipeline on multi-source data and highlight the importance of dataset balance in multi-institutional medical imaging evaluation.
[110] Robust brain age estimation from structural MRI with contrastive learning
Carlo Alberto Barbano,Benoit Dufumier,Edouard Duchesnay,Marco Grangetto,Pietro Gori
Main category: eess.IV
TL;DR: 该论文提出了一种基于对比学习的稳健脑龄估计方法,通过新型对比损失函数$\mathcal{L}^{exp}$,在多中心、大规模MRI数据集上验证了其性能和泛化能力。
Details
Motivation: 传统监督学习方法在脑龄估计中面临泛化性和鲁棒性不足的问题,尤其是在多中心数据中存在扫描仪差异等干扰因素。对比学习方法因其对数据多样性的适应能力而成为潜在解决方案。Contribution: 1. 提出新型对比损失函数$\mathcal{L}^{exp}$;2. 在大规模多中心数据集(超过20,000次扫描)中验证方法;3. 发现对比学习在泛化性、鲁棒性和临床相关性上的优势。
Method: 采用对比学习方法,设计$\mathcal{L}^{exp}$损失函数,通过多中心MRI数据进行预训练和测试,分析模型的泛化性、扫描仪无关性及临床相关性。
Result: 1. 预训练规模增大使外部MAE降低近一半;2. $\mathcal{L}^{exp}$对扫描仪差异鲁棒;3. 模型能捕捉认知障碍患者的加速衰老;4. 对比学习与下游诊断性能强相关。
Insight: 对比学习是一种构建泛化性强且临床意义明确的脑影像表征的有前景方法,尤其适用于多中心数据场景。
Abstract: Estimating brain age from structural MRI has emerged as a powerful tool for characterizing normative and pathological aging. In this work, we explore contrastive learning as a scalable and robust alternative to supervised approaches for brain age estimation. We introduce a novel contrastive loss function, $\mathcal{L}^{exp}$, and evaluate it across multiple public neuroimaging datasets comprising over 20,000 scans. Our experiments reveal four key findings. First, scaling pre-training on diverse, multi-site data consistently improves generalization performance, cutting external mean absolute error (MAE) nearly in half. Second, $\mathcal{L}^{exp}$ is robust to site-related confounds, maintaining low scanner-predictability as training size increases. Third, contrastive models reliably capture accelerated aging in patients with cognitive impairment and Alzheimer’s disease, as shown through brain age gap analysis, ROC curves, and longitudinal trends. Lastly, unlike supervised baselines, $\mathcal{L}^{exp}$ maintains a strong correlation between brain age accuracy and downstream diagnostic performance, supporting its potential as a foundation model for neuroimaging. These results position contrastive learning as a promising direction for building generalizable and clinically meaningful brain representations.
[111] Autoadaptive Medical Segment Anything Model
Tyler Ward,Meredith K. Owen,O’Kira Coleman,Brian Noehren,Abdullah-Al-Zubaer Imran
Main category: eess.IV
TL;DR: ADA-SAM提出了一种基于多任务学习的医学图像分割框架,结合了分类和分割任务,通过梯度反馈机制提升性能,在有限标注数据下显著优于基线方法。
Details
Motivation: 医学图像分割通常依赖大量标注数据,但手动标注费时费力且容易出错,因此需要开发一种高效、自动且无需大量标注的方法。Contribution: 提出了ADA-SAM框架,结合类激活图辅助分割任务,并引入梯度反馈机制提升分类与分割的协同学习能力。
Method: 基于Segment Anything (SAM)框架,通过多任务学习结合分类器生成类激活图指导分割任务,利用梯度反馈机制优化分类预测。
Result: 在真实临床数据上验证,ADA-SAM在有限标注数据下性能显著优于全监督和半监督基线方法。
Insight: 梯度反馈机制和多任务学习的结合可以有效提升医学图像分割的性能,尤其在标注数据受限的场景下。
Abstract: Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: https://github.com/tbwa233/ADA-SAM.
[112] A computationally frugal open-source foundation model for thoracic disease detection in lung cancer screening programs
Niccolò McConnell,Pardeep Vasudev,Daisuke Yamada,Daryl Cheng,Mehran Azimbagirad,John McCabe,Shahab Aslani,Ahmed H. Shahin,Yukun Zhou,The SUMMIT Consortium,Andre Altmann,Yipeng Hu,Paul Taylor,Sam M. Janes,Daniel C. Alexander,Joseph Jacob
Main category: eess.IV
TL;DR: 论文提出了一种名为TANGERINE的计算高效、开源的基础模型,专注于通过低剂量CT扫描(LDCT)在肺癌筛查中检测多种胸部疾病。该模型通过自监督学习预训练,能快速适应不同疾病检测任务,且对计算资源和数据需求较低。
Details
Motivation: 肺癌筛查(LCS)程序在全球范围内逐渐普及,但由于放射科医生短缺,大规模扫描解读成为瓶颈。迫切需要一种计算资源友好、易于适应多种任务的自动化解决方案。Contribution: TANGERINE是首个开源的、计算高效的3D医学影像基础模型,支持多种胸部疾病检测任务,并显著降低了对GPU资源和标注数据的需求。
Method: 模型基于3D掩码自编码器(masked autoencoder)框架,通过自监督学习在98,000多例LDCT数据上预训练,支持快速微调和高效标签利用。
Result: 在14种疾病分类任务中达到SOTA性能,包括肺癌和多种呼吸系统疾病,并能泛化到不同临床中心的多样化数据。
Insight: 开源、轻量化的设计为医学影像工具的未来发展提供了基础,有望将肺癌筛查从单一癌症检测转向全面的呼吸系统疾病管理。
Abstract: Low-dose computed tomography (LDCT) imaging employed in lung cancer screening (LCS) programs is increasing in uptake worldwide. LCS programs herald a generational opportunity to simultaneously detect cancer and non-cancer-related early-stage lung disease. Yet these efforts are hampered by a shortage of radiologists to interpret scans at scale. Here, we present TANGERINE, a computationally frugal, open-source vision foundation model for volumetric LDCT analysis. Designed for broad accessibility and rapid adaptation, TANGERINE can be fine-tuned off the shelf for a wide range of disease-specific tasks with limited computational resources and training data. Relative to models trained from scratch, TANGERINE demonstrates fast convergence during fine-tuning, thereby requiring significantly fewer GPU hours, and displays strong label efficiency, achieving comparable or superior performance with a fraction of fine-tuning data. Pretrained using self-supervised learning on over 98,000 thoracic LDCTs, including the UK’s largest LCS initiative to date and 27 public datasets, TANGERINE achieves state-of-the-art performance across 14 disease classification tasks, including lung cancer and multiple respiratory diseases, while generalising robustly across diverse clinical centres. By extending a masked autoencoder framework to 3D imaging, TANGERINE offers a scalable solution for LDCT analysis, departing from recent closed, resource-intensive models by combining architectural simplicity, public availability, and modest computational requirements. Its accessible, open-source lightweight design lays the foundation for rapid integration into next-generation medical imaging tools that could transform LCS initiatives, allowing them to pivot from a singular focus on lung cancer detection to comprehensive respiratory disease management in high-risk populations.