Table of Contents

cs.CL [Back]

[1] ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

Shu Zhao,Tan Yu,Anbang Xu,Japinder Singh,Aaditya Shukla,Rama Akkiraju

Main category: cs.CL

TL;DR: ParallelSearch是一个通过强化学习训练大语言模型(LLMs)以并行分解查询和执行搜索的框架,显著提升了计算效率和在并行化问题上的性能。

Details Motivation: 现有基于强化学习的检索代理(如Search-R1)在处理多步骤信息检索时,采用串行方式执行查询,即使任务中部分查询是相互独立且可并行化的。这种串行瓶颈限制了计算效率,尤其是在需要多重实体比较的查询中。

Contribution: 提出了ParallelSearch框架,通过强化学习训练LLMs识别可并行化的查询结构,并同时执行多个搜索操作。设计了专门的奖励函数,鼓励分解独立查询组件,同时在准确性、查询分解质量和并行执行收益之间取得平衡。

Method: 采用了强化学习(RL)方法,结合专用奖励函数,优化LLMs的并行查询分解和搜索能力。奖励函数同时考虑了答案准确性、查询分解质量和并行化的收益。

Result: 在七个问答基准测试中,ParallelSearch平均性能提升了2.9%;在可并行化问题上,性能提升12.7%,且仅需69.6%的LLM调用量(相比于串行方法)。

Insight: 通过并行化执行独立查询组件,可以显著提升计算效率,同时保持答案的准确性,为大语言模型在复杂推理任务中的应用提供了新思路。

Abstract: Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches.

[2] TEN: Table Explicitization, Neurosymbolically

Nikita Mehrotra,Aayush Kumar,Sumit Gulwani,Arjun Radhakrishna,Ashish Tiwari

Main category: cs.CL

TL;DR: TEN是一个神经符号方法,用于从半结构化文本中提取表格数据,结合了大型语言模型和符号检查器,显著优于纯神经基线。

Details Motivation: 半结构化文本中表格数据的提取具有挑战性,尤其是缺乏明确分隔符的情况。纯神经方法易产生幻觉且无法强制约束。

Contribution: 提出了一种神经符号方法TEN,结合Structural Decomposition提示和自调试循环,提升了表格提取的准确性和可验证性。

Method: 利用大型语言模型生成初始表格,符号检查器评估表格的合法性,批评LLM生成修正指导,形成自调试循环。

Result: TEN在多个数据集和指标上显著超越纯神经基线,用户实验显示其表格更准确且易于验证,参与者偏好率达到60%以上。

Insight: 结合神经和符号方法的优势可以有效解决结构化数据提取中的幻觉和约束问题,提高实用性和用户体验。

Abstract: We present a neurosymbolic approach, TEN, for extracting tabular data from semistructured input text. This task is particularly challenging for text input that does not use special delimiters consistently to separate columns and rows. Purely neural approaches perform poorly due to hallucinations and their inability to enforce hard constraints. TEN uses Structural Decomposition prompting - a specialized chain-of-thought prompting approach - on a large language model (LLM) to generate an initial table, and thereafter uses a symbolic checker to evaluate not only the well-formedness of that table, but also detect cases of hallucinations or forgetting. The output of the symbolic checker is processed by a critique-LLM to generate guidance for fixing the table, which is presented to the original LLM in a self-debug loop. Our extensive experiments demonstrate that TEN significantly outperforms purely neural baselines across multiple datasets and metrics, achieving significantly higher exact match accuracy and substantially reduced hallucination rates. A 21-participant user study further confirms that TEN’s tables are rated significantly more accurate (mean score: 5.0 vs 4.3; p = 0.021), and are consistently preferred for ease of verification and correction, with participants favoring our method in over 60% of the cases.

[3] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou,Jiawei Zhou,Karen Livescu

Main category: cs.CL

TL;DR: 论文提出Flow-SLM,通过联合建模语言和声学信息,生成语义标记和连续声学帧表示,优于现有模型。

Details Motivation: 当前无文本语音模型仅学习语义标记,依赖外部声码器,无法控制声学细节。本文旨在联合建模语言和声学信息。

Contribution: 联合学习语言和声学信息,使用流匹配目标预测连续声学帧,并验证多语义标记预测对语言信息的保留作用。

Method: 采用流匹配目标生成语义标记和声学连续向量,探索设计空间并验证多标记预测的有效性。

Result: 在语言似然基准上与现有模型持平,但在提示生成中提供更优声学细节。

Insight: 联合建模语言和声学信息能提升语音生成质量,多标记预测有助于保持语言信息的完整性。

Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

[4] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

Ting Cai,Stephen Sheen,AnHai Doan

Main category: cs.CL

TL;DR: 论文Columbo提出了一种基于大语言模型(LLM)的方法,用于扩展表格数据中的缩写列名(如esal→employee salary),并针对现有方法的问题提出了新的数据集、评价指标和解决方案。

Details Motivation: 在许多领域(如企业、科学和政府)中,表格数据的缩写列名需要扩展以便下游任务使用。现有方法使用合成数据且评价指标不准确,需要改进。

Contribution: 1. 提出四个真实世界缩写数据集;2. 引入基于同义词的准确评价指标;3. 开发了基于LLM的解决方案Columbo,结合上下文、规则和链式推理。

Method: Columbo利用大语言模型,通过上下文分析、规则应用、链式推理和令牌级分析扩展缩写列名。

Result: 在5个数据集上,Columbo比当前最先进的NameGuess方法提升4-29%,并已在环境科学数据门户EDI中投入使用。

Insight: 真实数据集和同义词感知的评价指标对任务至关重要,而结合多模态推理的LLM方法能显著提升性能。

Abstract: Expanding the abbreviated column names of tables, such as esal'' to employee salary’’, is critical for numerous downstream data tasks. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper we make three contributions that significantly advances the state of the art. First, we show that synthetic public data used by prior work has major limitations, and we introduce 4 new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over 5 datasets. Columbo has been used in production on EDI, a major data portal for environmental sciences.

[5] Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech

Lavanya Shankar,Leibny Paola Garcia Perera

Main category: cs.CL

TL;DR: 该论文利用Zipformer模型处理双语(普通话和英语)不均衡的儿童语音场景下的语言识别任务,显著提升了性能。

Details Motivation: 在双语环境中,儿童导向语音中的语言识别和代码切换任务具有挑战性,尤其是两种语言不均衡的情况。

Contribution: 主要贡献是提出使用Zipformer模型处理不均衡的双语语音任务,证明了其内部层能有效编码语言特征,并在语言识别任务中获得显著性能提升。

Method: 通过分析Zipformer内部层的嵌入提取方法,并将其与不同后端模型进行比较,验证了Zipformer的鲁棒性。

Result: 实验结果表明,该方法在不均衡数据上取得了81.89%的平衡准确率(BAC),比基线提高了15.47%。

Insight: Transformer编码器架构在现实场景中的潜力得到验证,特别是其内部层对语言特征的编码能力。

Abstract: Code-switching and language identification in child-directed scenarios present significant challenges, particularly in bilingual environments. This paper addresses this challenge by using Zipformer to handle the nuances of speech, which contains two imbalanced languages, Mandarin and English, in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the embeddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline. These findings highlight the potential of the transformer encoder architecture model in real scenarios.

[6] From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

Ridwan Mahbub,Mohammed Saidul Islam,Mir Tafseer Nayeem,Md Tahmid Rahman Laskar,Mizanur Rahman,Shafiq Joty,Enamul Hoque

Main category: cs.CL

TL;DR: 这篇论文研究了大型视觉语言模型(VLMs)在图表到文本任务中可能放大的地理经济偏见,揭示了模型对不同经济水平国家的描述存在情感偏见,并探讨了基于提示的去偏方法的效果。

Details Motivation: 图表常用于数据探索和洞察传递,但提取关键信息并生成自然语言总结可能具有挑战性。随着VLMs的快速发展,其输出中的偏见问题未受关注,可能对社会造成危害。

Contribution: 1. 首次系统评估VLMs在图表总结中的地理经济偏见;2. 发现模型对高收入国家生成更积极的描述;3. 提出并测试了提示去偏方法的局限性。

Method: 1. 构建包含6,000个图表-国家对的评测集;2. 分析六个主流VLMs的输出情感;3. 探索基于提示的去偏技术。

Result: 高收入国家的描述更正面,提示去偏仅部分有效,表明问题复杂性。

Insight: 地理经济偏见是VLMs的普遍问题,需要更鲁棒的去偏策略,包括数据、模型和提示的联合优化。

Abstract: Charts are very common for exploring data and communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country’s economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are publicly available here.

[7] From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

Siyuan Meng,Junming Liu,Yirong Chen,Song Mao,Pinlong Cai,Guohang Yan,Botian Shi,Ding Wang

Main category: cs.CL

TL;DR: 论文提出了一个动态段落选择器(DPS),用于改进检索增强生成(RAG)系统中的段落选择问题,避免传统固定Top-K方法的不足,显著提升了复杂查询的性能。

Details Motivation: 传统RAG系统的重排序模块通常独立评分段落并固定选择Top-K,这导致复杂多跳查询无法有效合成多文档证据,选择固定K值会遗漏关键信息或引入噪声。

Contribution: 提出动态段落选择器(DPS),将段落选择建模为监督学习问题,捕获段落间依赖关系并动态选择最相关段落,无需改动标准RAG流程。

Method: DPS框架通过监督学习训练,识别段落间依赖并动态决策最优段落组合,支持即插即用。

Result: 在五个基准测试中,DPS均优于现有重排序和微调方法,如在MuSiQue数据集上F1分数分别提升30.06%和15.4%。

Insight: 动态证据选择能显著提升RAG在复杂场景下的推理能力,DPS的成功表明段落间依赖建模的重要性。

Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.

[8] COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

Yunxiao Wang,Meng Liu,Wenqi Liu,Kaiyu Jiang,Bin Wen,Fan Yang,Tingting Gao,Guorui Zhou,Liqiang Nie

Main category: cs.CL

TL;DR: 论文提出了一种可控的共情推理方法(COMPEER),结合自然语言推理和心理步骤,通过强化学习和奖励模型提升情感支持对话的质量。

Details Motivation: 当前的情感支持对话模型缺乏基于心理学原则的深度共情推理能力,导致支持效果不佳。

Contribution: 1. 提出可控的共情推理方法;2. 构建细粒度标注的数据集;3. 设计统一的奖励模型和冗余感知的奖励重加权策略。

Method: 结合自然语言推理与心理步骤,采用强化学习训练,引入个性化对话重写和冗余感知奖励优化。

Result: 显著提升了模型的情感支持能力,生成了更具共情和人性化的回应。

Insight: 将心理学原则与自然语言处理结合,能有效改善对话系统的共情能力;奖励模型的精确设计对避免重复回应至关重要。

Abstract: Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model’s emotional support ability, advancing the development of empathetic, human-like support systems.

[9] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

Ziyang Ma,Qingyue Yuan,Linhai Zhang,Deyu Zhou

Main category: cs.CL

TL;DR: 论文提出了一种安全的思维链蒸馏方法SLowED,通过Slow Tuning和Low-Entropy Masking模块,在提升小模型推理能力的同时保持其安全性。

Details Motivation: 现有思维链蒸馏方法主要关注提升小语言模型的推理能力,但忽视了训练过程中对其安全性的负面影响。本文旨在解决这一问题。

Contribution: 提出了SLowED方法,包含Slow Tuning(慢调优)和Low-Entropy Masking(低熵屏蔽)两大模块,有效平衡推理能力的提升与安全性的保持。

Method: Slow Tuning通过限制模型权重变化的幅度,优化初始权重附近的邻域空间;Low-Entropy Masking屏蔽低熵词,避免其参与训练。

Result: 实验显示SLowED在推理任务(BBH、BB-Sub等)和安全评估(AdvBench)上优于现有蒸馏方法,同时保持了小模型的安全性。

Insight: Slow Tuning在训练早期维持安全性,Low-Entropy Masking则延长了安全训练的周期,两者协同作用显著。

Abstract: Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model’s safety in the early stage and the latter prolonging the safe training epochs.

Rahul Hemrajani

Main category: cs.CL

TL;DR: 本文通过实证评估大型语言模型(如GPT、Claude和Llama)在印度法律实践中的表现,发现其在起草和问题识别方面表现优异,但在专业法律研究和推理方面存在不足。

Details Motivation: 研究动机是探索AI在法律职业中的实际应用能力,特别是在印度法律背景下。

Contribution: 主要贡献是实证评估了LLMs在法律任务中的表现,并与初级律师的表现进行了对比。

Method: 采用了调查实验方法,由高级法律学生评价LLMs和人类律师在有用性、准确性和全面性方面的表现。

Result: 结果显示,LLMs在起草和问题识别上表现优秀,但在专业法律研究和推理中容易生成错误或虚构的内容。

Insight: LLMs可以辅助部分法律任务,但人类专家在复杂推理和法律精确应用上仍不可或缺。

Abstract: The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law.

[11] The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models

Ridwan Mahbub,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Mizanur Rahman,Mir Tafseer Nayeem,Enamul Hoque

Main category: cs.CL

TL;DR: 该论文研究了视觉语言模型(VLMs)在解读具有误导性设计的图表时的表现,结果显示大多数VLMs易受欺骗,导致对图表的错误解读,突显了加强VLM防误导能力的必要性。

Details Motivation: 随着VLMs被广泛应用于图表解读,尤其是在非专家用户中,了解这些模型对误导性视觉设计的敏感性变得至关重要。

Contribution: 论文首次对VLMs在解读误导性图表时的表现进行了系统评估,揭示了模型的易受欺骗性。

Method: 通过分析10种不同VLMs对8类误导性图表设计的16,000余条响应,研究了模型的解读能力。

Result: 大多数VLMs会被误导性图表欺骗,导致对图表内容的错误理解,即使底层数据未变。

Insight: VLMs的广泛应用亟需更强的防误导机制,尤其是在信息可视化领域,需提升模型的鲁棒性。

Abstract: Information visualizations are powerful tools that help users quickly identify patterns, trends, and outliers, facilitating informed decision-making. However, when visualizations incorporate deceptive design elements-such as truncated or inverted axes, unjustified 3D effects, or violations of best practices-they can mislead viewers and distort understanding, spreading misinformation. While some deceptive tactics are obvious, others subtly manipulate perception while maintaining a facade of legitimacy. As Vision-Language Models (VLMs) are increasingly used to interpret visualizations, especially by non-expert users, it is critical to understand how susceptible these models are to deceptive visual designs. In this study, we conduct an in-depth evaluation of VLMs’ ability to interpret misleading visualizations. By analyzing over 16,000 responses from ten different models across eight distinct types of misleading chart designs, we demonstrate that most VLMs are deceived by them. This leads to altered interpretations of charts, despite the underlying data remaining the same. Our findings highlight the need for robust safeguards in VLMs against visual misinformation.

[12] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Vaishnavi Shrivastava,Ahmed Awadallah,Vidhisha Balachandran,Shivam Garg,Harkirat Behl,Dimitris Papailiopoulos

Main category: cs.CL

TL;DR: 论文提出了GFPO方法,通过训练时采样更多组响应并筛选高效短响应,减少推理时的长度膨胀,同时保持准确性。

Details Motivation: 现有大型语言模型通过增加答复长度以提高准确性,导致许多冗余的‘填充’内容,GFPO旨在解决这一问题。

Contribution: 1. 提出GFPO方法,减少长度膨胀46-71%;2. 提出自适应难度GFPO,动态分配训练资源。

Method: 训练时采样多组响应,基于长度和奖励/令牌比筛选高效短响应,动态调整训练资源分配。

Result: 在多个挑战性基准测试中,GFPO显著减少长度膨胀(高达85%),同时保持准确性。

Insight: 训练时增加计算可直接减少推理时计算,是高效推理的有效权衡。

Abstract: Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length–inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely “filler”: repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO’s length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute–a simple yet effective trade-off for efficient reasoning.

[13] Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

Seokgi Lee

Main category: cs.CL

TL;DR: 提出了一种新型检索增强生成(RAG)框架,用于多跳问答,通过分解问题和文档生成可回答问题,优化检索性能。

Details Motivation: 多跳问答中复杂问题和文档的直接嵌入易导致检索模糊,需更清晰的目标和语义对齐。

Contribution: 1) 提出基于LLM的多跳问题分解为单跳子问题;2) 生成文档的可回答问题嵌入,提升检索效率。

Method: 使用LLM分解多跳问题为单跳子问题;生成文档的可回答问题并嵌入,通过问题-问题相似性检索。

Result: 在MuSiQue等数据集上,方法优于基线系统,验证了可回答问题嵌入和多跳问题分解的有效性。

Insight: 语义对齐的检索策略(如问题嵌入)和多跳问题分解能显著提升RAG在多跳问答中的性能。

Abstract: We introduce a novel retrieval-augmented generation (RAG) framework tailored for multihop question answering. First, our system uses large language model (LLM) to decompose complex multihop questions into a sequence of single-hop subquestions that guide document retrieval. This decomposition mitigates the ambiguity inherent in multi-hop queries by clearly targeting distinct knowledge facets. Second, instead of embedding raw or chunked documents directly, we generate answerable questions from each document chunk using Qwen3-8B, embed these generated questions, and retrieve relevant chunks via question-question embedding similarity. During inference, the retrieved chunks are then fed along with the original question into the RAG pipeline. We evaluate on three multihop question datasets (MuSiQue, 2WikiMultiHopQa, HotpotQA) from LongBench. Our method improves RAG performacne compared to baseline systems. Our contributions highlight the benefits of using answerable-question embeddings for RAG, and the effectiveness of LLM-based query decomposition for multihop scenarios.

[14] Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

Mahdi Dhaini,Tobias Müller,Roksoliana Rabets,Gjergji Kasneci

Main category: cs.CL

TL;DR: 本文通过采访行业从业者和学术研究者,探讨了可解释自然语言处理(NLP)在实际应用中的动机、技术、满意度及挑战,揭示了概念差距、方法满意度低和评估困难等问题,强调了清晰定义和用户中心框架的重要性。

Details Motivation: 复杂的NLP模型日益不透明,需要透明化和解释其决策过程,尤其是在高风险环境中。然而,关于可解释NLP的实际采用和有效性的从业者视角尚未充分研究。

Contribution: 通过行业和学术界的视角系统分析可解释NLP的实践与挑战,揭示概念差距和方法满意度低的问题,并提出用户中心框架的需求。

Method: 采用定性访谈方法,采访行业从业者和学术研究者,分析他们对可解释NLP的动机、技术、满意度及挑战的看法。

Result: 研究发现可解释NLP的满意度较低,存在概念差距和评估挑战,需清晰定义和用户中心框架以提升实际采用。

Insight: 可解释NLP的实践亟需从理论走向实用化,清晰的用户需求定义和评估标准是关键。

Abstract: The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners’ perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners’ experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice.

[15] BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Ahmed Masry,Abhay Puri,Masoud Hashemi,Juan A. Rodriguez,Megh Thakkar,Khyati Mahajan,Vikas Yadav,Sathwik Tejaswi Madhusudhan,Alexandre Piché,Dzmitry Bahdanau,Christopher Pal,David Vazquez,Enamul Hoque,Perouz Taslakian,Sai Rajeswar,Spandana Gella

Main category: cs.CL

TL;DR: 研究提出了一种结合真实世界数据的图表数据集生成方法BigCharts,并引入强化学习框架GRPO进行微调,显著提升了图表理解任务的性能。

Details Motivation: 现有视觉语言模型在图表理解任务中表现不佳,主要原因是训练数据缺乏多样性和真实性,且仅依赖低质量数据的监督微调限制了模型性能。

Contribution: 1. 提出BigCharts数据集生成流程,结合真实世界图表数据,确保视觉多样性和数据真实性;2. 引入GRPO强化学习框架,设计专门的奖励信号提升模型鲁棒性和泛化能力。

Method: 通过真实世界图表生成多样化数据集,结合监督微调和GRPO强化学习训练模型,优化图表推理性能。

Result: 实验表明,BigCharts-R1在多个图表问答基准测试中超越了现有开源和闭源模型。

Insight: 真实数据与强化学习的结合能有效提升图表理解任务的性能,尤其是在多样性和泛化能力方面。

Abstract: Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.

[16] A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

Aishik Mandal,Prottay Kumar Adhikary,Hiba Arnaout,Iryna Gurevych,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 该论文首次全面调查了用于临床心理健康AI系统开发的公开数据集,按病症、模态、任务等分类,指出了现有数据的不足并提出改进建议。

Details Motivation: 全球心理健康问题日益严重,但专业临床资源不足,AI系统有望填补这一缺口,但高质量数据的缺乏阻碍了AI模型的开发与应用。

Contribution: 1. 首次系统综述临床心理健康数据集;2. 分类数据集并提出关键问题;3. 提出未来数据标准化和多样化的建议。

Method: 通过文献调研和数据集分析,将数据集按病症、模态、任务、可访问性和文化背景分类,并评估其局限性。

Result: 发现现有数据集在纵向数据、文化多样性、标注标准等方面存在显著缺陷,尤其是合成数据的模态单一。

Insight: 未来需加强多模态、多文化和标准化数据集的构建,以支持更公平、鲁棒的AI心理健康系统。

Abstract: Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.

[17] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Weigao Sun,Jiaxi Hu,Yucheng Zhou,Jusen Du,Disen Lan,Kexin Wang,Tong Zhu,Xiaoye Qu,Yu Zhang,Xiaoyu Mo,Daizong Liu,Yuxuan Liang,Wenliang Chen,Guoqi Li,Yu Cheng

Main category: cs.CL

TL;DR: 该论文系统性地调研了针对大型语言模型(LLMs)的高效架构,以解决传统Transformer模型在计算资源和扩展性上的局限性,并探讨了线性/稀疏序列建模、高效注意力变体、混合专家模型等技术及其多模态应用。

Details Motivation: 传统Transformer模型虽然在语言理解和生成任务中表现出色,但其计算开销大,难以大规模训练和部署。因此,研究高效LLM架构成为迫切需求。

Contribution: 论文总结了多种高效LLM架构的创新方法,包括线性/稀疏序列建模、高效注意力、混合专家模型等,并提出了对未来高效AI系统的研究蓝图。

Method: 调研了线性/稀疏序列建模、高效注意力变体、稀疏混合专家模型、混合架构和新兴的扩散LLM,探讨了这些技术的多模态扩展和应用。

Result: 提出了一个全面的高效LLM架构分类框架,为未来的高效、可扩展的AI系统研究提供了方向。

Insight: 通过组合多种高效技术,可以实现资源敏感的LLM设计,同时保持或提升模型性能。

Abstract: Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

[18] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu,Tsz Ting Chung,Chulun Zhou,Tong Li,Rui Lu,Jiangnan Li,Liyan Xu,Haoshu Lu,Ning Zhang,Jing Li,Jie Zhou

Main category: cs.CL

TL;DR: PRELUDE是一个评估长文本上下文理解和推理能力的基准测试,要求判断角色的前传故事是否与原书主线一致。实验表明,现有技术(如LLMs和商业服务)与人类表现差距显著。

Details Motivation: 现有基准测试在长文本全局理解和深度推理方面不足,PRELUDE填补了这一空白,通过前传一致性任务,要求从分散信息中整合证据。

Contribution: 提出PRELUDE基准,首次聚焦于长文本中的全局理解和跨段落推理,揭示了现有模型与人类表现的显著差距。

Method: 通过设计前传一致性任务,要求模型从原书中间接关联信息中提取证据,并使用人类和模型(如LLMs、RAG)进行对比实验。

Result: 88%的任务需多段落证据,模型表现落后人类15%以上;研究发现30%的模型答案推理过程有误。

Insight: 长文本理解和推理能力仍需大幅提升,模型易生成表面正确但逻辑错误的答案,显式推理能力是未来改进方向。

Abstract: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

[19] Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription

Abdul Rehman Antall,Naveed Akhtar

Main category: cs.CL

TL;DR: 该研究评估了轻量级Whisper模型(Tiny、Base、Small)在低资源乌尔都语语音识别中的可行性,发现Whisper-Small表现最佳(33.68% WER),但仍存在语音准确性和词汇连贯性方面的挑战。

Details Motivation: 乌尔都语是全球第10大语言,拥有2.3亿使用者,但由于方言多样性、语码转换和缺乏训练数据,其在自动语音识别(ASR)中的研究有限。

Contribution: 主要贡献包括在低资源环境下对Whisper模型进行基准测试,发现Whisper-Small表现最佳,为未来低资源ASR研究奠定了基础。

Method: 使用未微调的Whisper模型(Tiny、Base、Small)在乌尔都语数据集上评估性能,指标为词错误率(WER)。

Result: 结果表明,Whisper-Small表现最优(33.68% WER),优于Tiny(67.08%)和Base(53.67%),但复杂语句的准确性和连贯性仍存在问题。

Insight: 研究揭示了轻量级Whisper模型在低资源乌尔都语ASR中的潜力,但需进一步解决语音和词汇挑战。

Abstract: This study evaluates the feasibility of lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings. Despite Urdu being the 10th most spoken language globally with over 230 million speakers, its representation in automatic speech recognition (ASR) systems remains limited due to dialectal diversity, code-switching, and sparse training data. We benchmark these models on a curated Urdu dataset using word error rate (WER), without fine-tuning. Results show Whisper-Small achieves the lowest error rates (33.68% WER), outperforming Tiny (67.08% WER) and Base (53.67% WER). Qualitative analysis reveals persistent challenges in phonetic accuracy and lexical coherence, particularly for complex utterances. While Whisper-Small demonstrates promise for deployable Urdu ASR, significant gaps remain. Our findings emphasize lay the groundwork for future research into effective, low-resource ASR systems.

[20] Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao,Jiarui Wang,Rubin Wei,Qipeng Guo,Kai Chen,Bowen Zhou,Zhouhan Lin

Main category: cs.CL

TL;DR: 提出了一种名为Memory Decoder的即插即用预训练记忆模块,用于高效适配大语言模型至特定领域,无需修改原始模型参数。

Details Motivation: 当前领域自适应方法(如DAPT)需要昂贵的全参数训练且易导致灾难性遗忘,而检索增强生成(RAG)则因高延迟和长上下文问题不实用。

Contribution: 设计了一个预训练的小型Transformer解码器,模仿外部非参数检索器的行为,实现即插即用的领域适配。

Method: Memory Decoder通过小规模解码器学习检索器的行为,训练后可无缝集成到任何共享相同分词器的预训练模型中。

Result: 实验表明,该方法在生物医学、金融和法律三个领域中,平均降低了6.17的困惑度。

Insight: 基于预训练记忆模块的即插即用范式,为领域自适应提供了一种高效且通用的解决方案。

Abstract: Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model’s parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

[21] A Survey of Cognitive Distortion Detection and Classification in NLP

Archie Sage,Jeroen Keppens,Helen Yannakoudakis

Main category: cs.CL

TL;DR: 这篇论文综述了自然语言处理(NLP)中认知扭曲(CD)的检测与分类研究,总结了20年间的38项研究,提供了数据集、建模方法和评估策略的概述,并指出了领域的挑战。

Details Motivation: 随着NLP在心理健康领域应用的兴起,自动检测和分类认知扭曲(CD)的研究需求增加。但目前该领域存在分类不一致、任务定义模糊和评估标准不统一等问题。

Contribution: 论文提供了一个统一的CD分类参考,总结了常见任务设置,并指出了该领域的研究挑战,以支持更一致和可重现的研究。

Method: 通过综述20年间的38项研究,系统性地分析了CD检测与分类的数据集、建模方法和评估策略。

Result: 论文整合了CD的分类体系,并总结了当前研究的局限性,如数据集不足、评估标准不统一等问题。

Insight: 未来研究需要更多标注数据、统一的评估标准和更先进的建模方法,以推动CD检测与分类的实际应用。

Abstract: As interest grows in the application of natural language processing (NLP) techniques to mental health, a growing body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are habitual patterns of negatively biased or flawed thinking that distort how people perceive events, judge themselves, and react to the world around them. Identifying and addressing them is an important part of therapy. Despite its momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices. This survey reviews 38 studies spanning two decades, providing a structured overview of datasets, modelling approaches, and evaluation strategies. We provide a consolidated CD taxonomy reference, summarise common task setups, and highlight open challenges to support more coherent and reproducible research in this emerging area.

[22] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Lingjie Jiang,Shaohan Huang,Xun Wu,Yixia Li,Dongdong Zhang,Furu Wei

Main category: cs.CL

TL;DR: VisCodex提出了一种通过合并视觉和编码语言模型的统一框架,提升了多模态大语言模型的代码生成能力,并引入了新数据集和评测基准,实验表明其性能接近专有模型。

Details Motivation: 现有的多模态大语言模型在从多模态输入生成代码方面仍有局限,研究者希望提升其能力。

Contribution: 1)提出了VisCodex框架,合并视觉和编码模型;2)引入了Multimodal Coding Dataset (MCD);3)设计了InfiBench-V评测基准。

Method: 采用任务向量驱动的模型合并技术,将编码大语言模型集成到视觉语言骨干中。

Result: VisCodex在开源MLLMs中达到最先进性能,接近专有模型如GPT-4o。

Insight: 合并视觉和编码模型能有效提升多模态代码生成能力,且新的数据集和评测基准对此领域的发展至关重要。

Abstract: Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

[23] Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki,David Mikhail,Daniel Milad,Danny A Mammo,Sumit Sharma,Sunil K Srivastava,Bing Yu Chen,Samir Touma,Mertcan Sevgi,Jonathan El-Khoury,Pearse A Keane,Qingyu Chen,Yih Chung Tham,Renaud Duval

Main category: cs.CL

TL;DR: GPT-5-high在眼科QA任务中表现最佳,准确率最高且性价比优越,但并未显著超越o3-high。文章还提出了一种自动评分框架用于评估LLM生成答案的质量。

Details Motivation: 研究动机是探索GPT-5等大型语言模型在复杂医学问题解答任务中的性能表现,并确定其在准确性和成本效率之间的最优配置。

Contribution: 主要贡献包括:1)评估了12种GPT-5配置在眼科QA任务中的表现,2)提出了基于Bradley-Terry模型的性能排名方法,3)引入了自动评分框架来评估LLM生成答案。

Method: 方法包括:1)使用260道眼科多选题评估模型性能,2)通过Bradley-Terry模型进行模型排名,3)基于token成本分析性价比。

Result: GPT-5-high准确率最高(0.965),但未显著优于o3-high(0.958)。GPT-5-mini-low在低成本高表现平衡中表现最佳。

Insight: 研究发现推理努力的增加会提升模型准确性,并提供了性价比最优的模型配置建议。

Abstract: Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI’s GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.

[24] Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT)

Renas Adnan,Hossein Hassani

Main category: cs.CL

TL;DR: 本文评估了Wav2Vec2-Large-XLSR-53和Whisper-small在Badini Kurdish语音转文本任务中的表现,发现Wav2Vec2模型在准确性和可读性上显著优于Whisper模型。

Details Motivation: Badini Kurdish作为一种低资源语言,缺乏高质量的语音转文本系统,阻碍了其社区的数字化发展。本文旨在填补这一空白,为该方言开发高效的语言模型。

Contribution: 1. 首次为Badini Kurdish构建了语音数据集(17小时录音)和语言模型。2. 对比了Wav2Vec2和Whisper在该任务中的表现,提供了实验数据支持。

Method: 1. 数据收集:使用Badini儿童故事(78篇)作为文本输入,六名叙述者录制语音。2. 数据预处理:清洗、分段、分词。3. 模型训练:分别采用Wav2Vec2-Large-XLSR-53和Whisper-small训练语言模型并评估性能。

Result: Wav2Vec2模型的可读性和准确性分别为90.38%和82.67%,显著高于Whisper-small的65.45%和53.17%。

Insight: 对于低资源语言,基于Wav2Vec2的模型可能是更优选择,而Whisper在资源有限的情况下表现较差。

Abstract: Speech-to-text (STT) systems have a wide range of applications. They are available in many languages, albeit at different quality levels. Although Kurdish is considered a less-resourced language from a processing perspective, SST is available for some of the Kurdish dialects, for instance, Sorani (Central Kurdish). However, that is not applied to other Kurdish dialects, Badini and Hawrami, for example. This research is an attempt to address this gap. Bandin, approximately, has two million speakers, and STT systems can help their community use mobile and computer-based technologies while giving their dialect more global visibility. We aim to create a language model based on Badini’s speech and evaluate its performance. To cover a conversational aspect, have a proper confidence level of grammatical accuracy, and ready transcriptions, we chose Badini kids’ stories, eight books including 78 stories, as the textual input. Six narrators narrated the books, which resulted in approximately 17 hours of recording. We cleaned, segmented, and tokenized the input. The preprocessing produced nearly 15 hours of speech, including 19193 segments and 25221 words. We used Wav2Vec2-Large-XLSR-53 and Whisper-small to develop the language models. The experiments indicate that the transcriptions process based on the Wav2Vec2-Large-XLSR-53 model provides a significantly more accurate and readable output than the Whisper-small model, with 90.38% and 65.45% readability, and 82.67% and 53.17% accuracy, respectively.

cs.CV [Back]

[25] A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection

Mohammad Zia Ur Rehman,Sufyaan Zahoor,Areeb Manzoor,Musharaf Maqbool,Nagendra Kumar

Main category: cs.CV

TL;DR: 该论文提出了一种基于上下文感知注意力与图神经网络的多模态框架,用于检测社交媒体上针对女性的仇恨内容(厌女症),通过三个模块显著提升了检测性能。

Details Motivation: 社交媒体上的仇恨内容中很大一部分针对女性,通用仇恨内容检测方法难以有效识别此类内容,因此需要针对性解决方案。

Contribution: 提出了一个包含多模态注意力模块(MANM)、基于图的特征重构模块(GFRM)和内容特定特征学习模块(CFLM)的新框架,显著提升了厌女症检测的性能。

Method: 结合自适应门控的多模态上下文感知注意力(MANM)、图神经网络优化特征(GFRM),以及学习文本与图像特定特征(CFLM),并引入测试时特征空间增强。

Result: 在MAMI和MMHS150K数据集上的实验表明,宏F1分数分别平均提升了10.17%和8.88%,优于现有方法。

Insight: 多模态上下文感知注意力与图神经网络的结合可有效捕捉社交媒体内容中的仇恨信息,测试时特征增强进一步提升了模型的泛化能力。

Abstract: A substantial portion of offensive content on social media is directed towards women. Since the approaches for general offensive content detection face a challenge in detecting misogynistic content, it requires solutions tailored to address offensive content against women. To this end, we propose a novel multimodal framework for the detection of misogynistic and sexist content. The framework comprises three modules: the Multimodal Attention module (MANM), the Graph-based Feature Reconstruction Module (GFRM), and the Content-specific Features Learning Module (CFLM). The MANM employs adaptive gating-based multimodal context-aware attention, enabling the model to focus on relevant visual and textual information and generating contextually relevant features. The GFRM module utilizes graphs to refine features within individual modalities, while the CFLM focuses on learning text and image-specific features such as toxicity features and caption features. Additionally, we curate a set of misogynous lexicons to compute the misogyny-specific lexicon score from the text. We apply test-time augmentation in feature space to better generalize the predictions on diverse inputs. The performance of the proposed approach has been evaluated on two multimodal datasets, MAMI and MMHS150K, with 11,000 and 13,494 samples, respectively. The proposed method demonstrates an average improvement of 10.17% and 8.88% in macro-F1 over existing methods on the MAMI and MMHS150K datasets, respectively.

[26] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

Yanhui Li,Yunkang Cao,Chengliang Liu,Yuan Xiong,Xinghui Dong,Chao Huang

Main category: cs.CV

TL;DR: IAD-R1是一个通用的后训练框架,显著提升了不同架构和参数规模的视觉语言模型在工业异常检测中的性能。通过两阶段训练策略和精心设计的奖励函数,模型在零样本设置下超越了GPT-4.1等商业模型。

Details Motivation: 工业异常检测中缺陷样本稀缺,传统方法泛化能力不足,而现有的视觉语言模型在该任务上表现有限。

Contribution: 提出了IAD-R1,一个适用于不同VLMs的后训练框架,通过两阶段训练策略显著提升了异常检测性能。

Method: 采用两阶段训练:PA-SFT阶段使用高质量的Expert-AD数据集增强异常感知能力;SC-GRPO阶段通过奖励函数实现从感知到解释的能力跃升。

Result: IAD-R1在7种VLMs上表现优异,平均准确率提升43.3%,0.5B参数模型在零样本设置下超越GPT-4.1和Claude-Sonnet-4。

Insight: 高质量的链式思维数据集和奖励函数设计对提升工业异常检测能力至关重要。

Abstract: Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from “Anomaly Perception” to “Anomaly Interpretation”. Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, attaining up to 43.3% enhancement in average accuracy on 6 industrial anomaly detection benchmark datasets. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.

[27] A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

Rongqian Chen,Allison Andreyev,Yanming Xiu,Mahdi Imani,Bin Li,Maria Gorlatova,Gang Tan,Tian Lan

Main category: cs.CV

TL;DR: 论文提出了一种名为CADAR的神经符号方法,用于增强现实(AR)中的认知攻击检测,结合了视觉语言模型的适应性和粒子滤波的可解释性。

Details Motivation: 由于AR的普及,认知攻击通过修改AR内容以操纵用户的语义感知日益受到关注。现有方法要么缺乏语义推理能力,要么依赖黑盒模型,可解释性不足。

Contribution: 提出了CADAR这一神经符号框架,融合了多模态视觉语言输入和粒子滤波统计推理,提升了认知攻击检测的准确性和可解释性。

Method: 利用预训练视觉语言模型构建符号感知图表示,结合先验知识、显著性加权和时间相关性,通过粒子滤波进行统计推理。

Result: 在扩展的AR认知攻击数据集上,CADAR比强基线方法在准确性上提升了多达10.7%。

Insight: 神经符号方法在AR认知攻击检测中展现了潜力,既保留了神经网络的适应性,又通过符号方法增强了推理和可解释性。

Abstract: Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users’ semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning – a sequential Monte Carlo method – to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection.

[28] RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

Abdolazim Rezaei,Mehdi Sookhak,Mahboobeh Haghparast

Main category: cs.CV

TL;DR: 这篇论文提出了RL-MoE框架,通过将视觉数据转化为文本描述来保护隐私,结合了Mixture-of-Experts和强化学习,实现了语义准确性和隐私保护的双重目标。

Details Motivation: 智能交通系统中广泛部署的AI摄像头引发了隐私保护与数据效用之间的冲突,现有方法(如模糊或加密)存在不足,需要在隐私和数据质量之间权衡。

Contribution: 提出了RL-MoE框架,通过文本描述替代图像传输,结合MoE和强化学习,优化语义准确性和隐私保护,显著降低回放攻击成功率。

Method: 使用MoE架构对场景进行多角度分解,并结合强化学习代理优化生成的文本,以实现隐私和语义的双重目标。

Result: 在CFP-FP数据集上,RL-MoE将回放攻击的成功率降至9.4%,同时生成的文本内容比基线方法更丰富。

Insight: RL-MoE为隐私敏感领域提供了一种实用且可扩展的解决方案,为构建可信赖的智能城市和自动驾驶网络奠定了基础。

Abstract: The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the fundamental right to privacy. Existing privacy-preserving mechanisms, such as blurring or encryption, are often insufficient, creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this impasse, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks.

[29] $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

Jucheng Hu,Suorong Yang,Dongzhan Zhou

Main category: cs.CV

TL;DR: Δ-AttnMask是一种高效的数据选择和增强框架,通过注意力引导的隐藏状态掩码来量化样本质量,无需额外标签或训练,显著提升了视觉指令微调的效率和性能。

Details Motivation: 视觉指令微调(VIF)需要大量多模态数据,而传统方法在处理数据选择和增强时效率低下且缺乏针对性。本文提出Δ-AttnMask,旨在高效评估样本质量,解决VIF中的数据挑战。

Contribution: 1. 提出Δ-AttnMask框架,通过注意力掩码和损失差异评估样本质量;2. 无需额外标签、辅助模型或训练,适用于多种模态和架构;3. 实验表明仅需20%数据即可超越全数据基线,训练速度提升5倍。

Method: 1. 利用模型的注意力机制生成高注意力区域掩码;2. 计算原始隐藏状态与掩码后状态的损失差异(Δ)来量化样本质量;3. 基于Δ值选择高质量样本或增强数据。

Result: 在多个VLM和数据集上,Δ-AttnMask仅用20%数据即达到SOTA性能,训练加速5倍,整体准确率提升+10.1%。

Insight: 注意力掩码可直接反映模型对数据的关注点,高效指导数据选择;损失差异是一种内在的样本质量评估指标,无需额外成本。

Abstract: Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose $\Delta$-AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model’s hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences ($\Delta$) between the original states and states masked using high-attention regions, $\Delta$-AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that $\Delta$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures.

[30] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

Masoumeh Sharafi,Soufiane Belharbi,Houssem Ben Salem,Ali Etemad,Alessandro Lameiras Koerich,Marco Pedersoli,Simon Bacon,Eric Granger

Main category: cs.CV

TL;DR: 该论文提出了一种源自由域自适应方法(PFT),通过在潜在空间中进行个性化特征翻译,解决了仅使用未标记中性目标数据时的模型适应问题。

Details Motivation: 深度面部表情识别模型在真实场景中性能受限,尤其是面对微妙表情和高个体差异时。现有的源自由域自适应方法通常难以处理仅含单类(中性)目标数据的情况,且基于图像的方法计算复杂。

Contribution: 提出了个性化特征翻译(PFT)方法,在潜在空间进行特征翻译,避免了图像合成的复杂性和噪声,同时减少了计算开销。

Method: 预训练一个翻译器在源域上转换主体风格特征,随后在目标域中性数据上适应该翻译器,优化表情一致性和风格感知目标。

Result: PFT在避免图像合成的同时,生成更具判别性的嵌入,显著提升了模型性能。

Insight: 通过潜在空间进行特征翻译是一种高效且稳定替代方案,尤其适合资源有限的实际场景。

Abstract: Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.

[31] GANime: Generating Anime and Manga Character Drawings from Sketches with Deep Learning

Tai Vu,Robert Yang

Main category: cs.CV

TL;DR: 论文研究了用于从草稿生成高质量动漫角色绘制的深度学习方法,发现C-GAN表现最佳。

Details Motivation: 动漫和漫画行业中,从草稿生成彩色绘制的成本高昂,希望通过深度学习技术提升效率和效果。

Contribution: 比较了多种图像到图像转换方法,明确C-GAN在生成高清、人类水平动漫角色中的优势。

Method: 评估了Neural Style Transfer、C-GAN和CycleGAN,通过定性和定量分析选择最优模型。

Result: C-GAN能够生成接近人类绘制的高质量、高分辨率动漫图像。

Insight: C-GAN在动漫绘制任务中展示了强大的生成能力,为行业提供了高效技术解决方案。

Abstract: The process of generating fully colorized drawings from sketches is a large, usually costly bottleneck in the manga and anime industry. In this study, we examine multiple models for image-to-image translation between anime characters and their sketches, including Neural Style Transfer, C-GAN, and CycleGAN. By assessing them qualitatively and quantitatively, we find that C-GAN is the most effective model that is able to produce high-quality and high-resolution images close to those created by humans.

[32] MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang,Zebang Cheng,Chong Deng,Haoxuan Li,Zheng Lian,Qian Chen,Huadai Liu,Wen Wang,Yi-Fan Zhang,Renrui Zhang,Ziyu Guo,Zhihong Zhu,Hao Wu,Haixin Wang,Yefeng Zheng,Xiaojiang Peng,Xian Wu,Kun Wang,Xiangang Li,Jieping Ye,Pheng-Ann Heng

Main category: cs.CV

TL;DR: MME-Emotion是一个系统性评测基准,专注于评估多模态大语言模型(MLLMs)的情商能力,涵盖情感理解和推理任务,包含6,000多个视频和问题-答案对。

Details Motivation: 当前情感评测基准的局限性(如泛化能力和推理能力不足)推动了MME-Emotion的提出,以填补这一空白并促进MLLMs情感智能的发展。

Contribution: 提出了一个全面的情感智能评测基准(MME-Emotion),具备可扩展性、多样性场景和统一评测协议;评测了20个先进MLLMs的性能表现。

Method: 通过多代理系统框架和混合指标评测MLLMs在情感识别和推理任务中的表现;设计了8项情感任务和对应的QA对。

Result: 当前MLLMs情感智能表现较差,最佳模型在情感识别和CoT推理任务中仅达到39.3%和56.0%的分数;通用模型表现依赖多模态理解能力,而专用模型可通过领域适应达到类似性能。

Insight: 通用模型在情感任务中表现受限,而领域适应的专用模型可能更有效;情感智能仍需进一步研究。

Abstract: Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3%$ recognition score and $56.0%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs’ emotional intelligence in the future.

[33] Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Zuoou Li,Weitong Zhang,Jingyuan Wang,Shuyuan Zhang,Wenjia Bai,Bernhard Kainz,Mengyun Qiao

Main category: cs.CV

TL;DR: 该论文针对多模态大语言模型(MLLMs)在对抗性提示下的脆弱性问题,提出了一个四轴评估框架及一种递归改写策略(BSD),显著提高了攻击成功率和输出危害性。

Details Motivation: MLLMs的安全性机制在对抗性提示下容易失效,但目前对攻击效果的评估标准存在高估问题,因此需要更准确的评估方法和更有效的攻击策略。

Contribution: 1)提出四轴评估框架(输入相关性、输入分布外强度、输出危害性、输出拒绝率);2)开发BSD策略,通过平衡相关性和新颖性提高攻击效果;3)在13种MLLMs上验证BSD的优越性。

Method: 1)四轴评估框架量化攻击效果;2)BSD策略通过递归改写分解恶意提示为子任务,并引入分布外信号和视觉线索。

Result: BSD在攻击成功率和输出危害性上分别比现有方法提高了67%和21%,揭示了当前多模态安全系统的潜在弱点。

Insight: 相关性与新颖性的平衡是关键:过度相关的提示易被拦截,而过度分布外的提示难以生成有害内容;BSD通过结构化分解和细微扰动实现了高效攻击。

Abstract: Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as “successful” are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67%$ and harmfulness by $21%$, revealing a previously underappreciated weakness in current multimodal safety systems.

[34] Towards Scalable Training for Handwritten Mathematical Expression Recognition

Haoyang Li,Jiaqing Li,Jialun Cao,Zongyuan Yang,Yongping Xiong

Main category: cs.CV

TL;DR: 论文提出了一种通过整合手写公式和大规模LaTeX渲染公式的新方法,解决了手写数学表达式识别(HMER)领域数据稀缺的问题,并构建了最大的公式数据集Tex80M。

Details Motivation: 手写数学表达式识别(HMER)因数据标注成本高而面临数据稀缺的挑战,限制了大规模训练的应用。

Contribution: 1. 提出了一种可扩展的数据引擎,生成复杂且一致的LaTeX序列。2. 构建了最大规模的手写公式数据集Tex80M(超8000万样本)。3. 提出了首个基于大规模训练的HMER模型TexTeller,在多个基准测试中达到SOTA性能。

Method: 结合有限的手写公式和大规模LaTeX渲染公式,通过数据引擎生成高质量训练数据,并设计混合训练策略。

Result: TexTeller在几乎所有基准测试中表现优异,达到了SOTA水平。

Insight: 通过数据合成和混合训练,可以显著提升HMER模型的性能,为小样本学习提供新的思路。

Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.

[35] Gradient-Direction-Aware Density Control for 3D Gaussian Splatting

Zheng Zhou,Yu-Jie Xiong,Chun-Ming Xia,Jia-Chen Zhang,Hong-Jian Zhan

Main category: cs.CV

TL;DR: GDAGS通过梯度方向感知的自适应密度控制框架解决了3DGS中的过重建和过密度问题,提出梯度一致性比和非线性动态加权机制,显著提升渲染质量并减少内存消耗。

Details Motivation: 现有3DGS方法在复杂场景中存在过重建和过密度问题,导致渲染质量下降和内存开销增加,亟需一种能够自适应控制密度的高效方法。

Contribution: 提出了梯度方向感知的密度控制框架GDAGS,包括梯度一致性比(GCR)和非线性动态加权机制,显著改善了渲染效果并减少了50%的内存消耗。

Method: 通过GCR区分梯度方向一致或冲突的高斯分布,利用动态加权机制自适应控制高斯分布的分裂和克隆,优先处理冲突方向的高斯以增强几何细节。

Result: 在多样化的真实场景基准测试中,GDAGS实现了更优的渲染质量,同时减少了过重建、过密度问题,并将内存消耗降低50%。

Insight: 梯度方向信息可有效指导高斯分布的自适应密度控制,为3D场景重建和渲染提供了一个高效且紧凑的解决方案。

Abstract: The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS), a gradient-direction-aware adaptive density control framework to address these challenges. Our key innovations: the gradient coherence ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations with 50% reduced memory consumption through optimized Gaussians utilization.

[36] FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

Fengxian Ji,Jingpu Yang,Zirui Song,Yuanxi Wang,Zhexuan Cui,Yuke Li,Qian Jiang,Miao Fang,Xiuying Chen

Main category: cs.CV

TL;DR: FineState-Bench是首个专注于细粒度GUI代理控制的评测基准,填补了现有评测框架忽略细粒度控制能力的空白,通过多平台任务和视觉诊断工具(VDA)实现全面评估。

Details Motivation: 当前GUI代理的评测框架过于关注粗粒度任务完成情况,而忽略了细粒度控制能力的重要性,无法满足实际应用需求。

Contribution: 1. 提出首个细粒度GUI代理评测标准FineState-Bench;2. 开发视觉诊断工具VDA,实现感知与定位能力的定量解耦分析;3. 实验证明当前最先进模型的细粒度交互准确率仅为32.8%,视觉定位能力是主要瓶颈。

Method: 1. 构建多平台(桌面、Web、移动)的2257个任务评测集;2. 设计四阶段指标全面评估感知到控制的能力;3. 开发VDA工具量化视觉能力的影响。

Result: 实验显示,理想视觉定位能力可将Gemini-2.5-Flash的成功率提升14.9%,确认视觉定位是当前GUI代理的主要瓶颈。

Insight: 细粒度控制能力是GUI代理实际应用的关键,而当前模型的视觉定位能力仍需大幅提升。

Abstract: With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash’s success rate by 14.9%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench

[37] Beyond Blanket Masking: Examining Granularity for Privacy Protection in Images Captured by Blind and Low Vision Users

Jeffri Murrugarra-LLerena,Haoran Niu,K. Suzanne Barber,Hal Daumé III,Yang Trista Cao,Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: 本文提出了FiGPriv,一种细粒度的隐私保护框架,通过选择性屏蔽高风险隐私信息来提升盲人和低视力用户的视觉助手系统的可用性和隐私保护效果。

Details Motivation: 随着视觉语言模型(VLMs)的普及,盲人和低视力用户在使用视觉助手时可能无意中拍摄到私人信息。现有的隐私保护方法采用粗粒度分割,导致图像可用性下降。

Contribution: 提出了FiGPriv框架,结合细粒度分割和数据驱动的风险评估,选择性屏蔽高风险隐私信息,同时保留低风险内容。

Method: 采用细粒度分割技术和风险评估机制,通过BIV-Priv-Seg数据集进行验证。

Result: FiGPriv保留了26%的图像内容,提升了VLMs的响应能力(11%)和内容识别能力(45%),同时确保隐私保护。

Insight: 细粒度隐私保护方法能显著提升用户体验和系统功能,同时避免过度屏蔽带来的可用性损失。

Abstract: As visual assistant systems powered by visual language models (VLMs) become more prevalent, concerns over user privacy have grown, particularly for blind and low vision users who may unknowingly capture personal private information in their images. Existing privacy protection methods rely on coarse-grained segmentation, which uniformly masks entire private objects, often at the cost of usability. In this work, we propose FiGPriv, a fine-grained privacy protection framework that selectively masks only high-risk private information while preserving low-risk information. Our approach integrates fine-grained segmentation with a data-driven risk scoring mechanism. We evaluate our framework using the BIV-Priv-Seg dataset and show that FiG-Priv preserves +26% of image content, enhancing the ability of VLMs to provide useful responses by 11% and identify the image content by 45%, while ensuring privacy protection. Project Page: https://artcs1.github.io/VLMPrivacy/

[38] Harnessing Input-Adaptive Inference for Efficient VLN

Dongwoo Kang,Akhil Perincherry,Zachary Coalson,Aiden Gabriel,Stefan Lee,Sanghyun Hong

Main category: cs.CV

TL;DR: 本文提出了一种输入自适应推理方法,显著提升了视觉与语言导航(VLN)模型的效率,通过三种自适应算法分别优化空间、模型内和时序效率,在多个基准测试中实现了计算量减少两倍以上,同时保持性能。

Details Motivation: 尽管现有的多模态Transformer模型在VLN任务中表现优异,但其计算资源需求成为实际部署的瓶颈。本研究旨在通过输入自适应机制提升效率,而无需牺牲性能。

Contribution: 提出了一种输入自适应的导航方法,包含三种算法:选择性全景视图处理、重要性自适应的提前退出机制,以及基于缓存的视图处理优化。这些方法显著降低了计算成本。

Method: 1. 选择性处理全景视图(空间效率);2. 基于重要性的自适应阈值提前退出(模型内效率);3. 缓存已处理视图避免重复计算(时序效率)。

Result: 在七个VLN基准测试中,实现了计算量减少超过两倍,且性能未显著下降,适用于标准和连续环境下的多种现成代理。

Insight: 输入自适应机制是提升VLN模型效率的有效途径。通过在空间、模型内和时序三个层级优化,可以显著减少计算开销,同时保持模型性能。

Abstract: An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at https://github.com/secure-ai-systems-group/adaptive-vision-and-language-navigation.

[39] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

Alexandre Brown,Glen Berseth

Main category: cs.CV

TL;DR: 该论文提出了SegDAC,一种基于分割的Actor-Critic方法,通过结合Segment Anything(SAM)和YOLO-World实现对象为中心的分解和语义分割,改进了视觉强化学习的泛化能力和样本效率。

Details Motivation: 视觉强化学习(RL)面临高维输入和噪声奖励的挑战,现有的大规模感知模型如何有效整合到RL中以实现视觉泛化和提高样本效率尚不明确。

Contribution: 1. 提出SegDAC方法,结合SAM和YOLO-World进行对象分割和语义标注;2. 设计了支持动态分割数量的Transformer架构;3. 在无需人工标注的情况下,通过在线RL学习关键分割区域。

Method: SegDAC结合SAM进行对象中心分割,并用YOLO-World通过文本提示实现语义标注。其Transformer架构支持动态分割数量,并通过在线RL自适应学习关键区域。

Result: 在Maniskill3基准测试中,SegDAC在强视觉干扰下显著提升视觉泛化能力,最困难任务性能翻倍,样本效率匹配或超越现有方法。

Insight: 通过对象为中心的分解和语义分割,SegDAC证明了在复杂视觉环境中动态选择关键对象的能力可以显著提升RL的性能和泛化能力。

Abstract: Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks.

[40] Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

Yifan Jiang,Ahmad Shariftabrizi,Venkata SK. Manem

Main category: cs.CV

TL;DR: Lung-DDPM+ 是一种改进的扩散概率模型,用于高效合成高质量的胸部 CT 图像,解决了现有生成模型效率低和解剖学不精确的问题。

Details Motivation: 现有生成模型在肺部 CT 图像合成中效率低且解剖学精度不足,限制了其在临床中的应用。Lung-DDPM+ 旨在解决这些问题,提升合成效率和质量。

Contribution: 提出了 Lung-DDPM+,一种由结节语义布局引导、基于肺部分区 DPM-solver 加速的扩散概率模型,显著提升了采样效率和图像质量。

Method: 结合结节语义布局引导和肺部分区 DPM-solver 加速的方法,专注于病灶区域,实现采样效率与质量的平衡。

Result: 在 LIDC-IDRI 数据集上,Lung-DDPM+ 实现了 8 倍 FLOPs 减少、6.8 倍 GPU 内存消耗降低和 14 倍采样速度提升,同时在分割任务中保持与 SOTA 模型相当的样本质量。

Insight: 通过语义布局引导和局部加速策略,Lung-DDPM+ 展示了在医学图像合成中的高效性和高保真度潜力,有望扩展到其他病灶生成任务。

Abstract: Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.

[41] UltraLight Med-Vision Mamba for Classification of Neoplastic Progression in Tubular Adenomas

Aqsa Sultana,Nordin Abouzahra,Ahmed Rahu,Brian Shula,Brandon Combs,Derrick Forchetti,Theus Aspiras,Vijayan K. Asari

Main category: cs.CV

TL;DR: 该论文提出了一种基于状态空间模型(SSM)的Ultralight Med-Vision Mamba,用于管状腺瘤的肿瘤进展分类,具有高效的计算性能和可扩展性。

Details Motivation: 识别癌前息肉对于降低结直肠癌风险至关重要,现有的深度学习方法在长短期依赖建模和图像泛化方面仍有不足。

Contribution: 提出了一种高效的SSM模型,擅长长短期依赖建模和图像泛化,适用于全切片图像分析,并可实时部署于临床。

Method: 基于状态空间模型(SSM)的Ultralight Med-Vision Mamba,优化了计算效率和可扩展性。

Result: 模型在腺瘤分类和分层中表现卓越,提升了风险评估的准确性。

Insight: 高效的状态空间模型可以显著改善医学图像分析的性能和实用性,推动个性化医疗的发展。

Abstract: Identification of precancerous polyps during routine colonoscopy screenings is vital for their excision, lowering the risk of developing colorectal cancer. Advanced deep learning algorithms enable precise adenoma classification and stratification, improving risk assessment accuracy and enabling personalized surveillance protocols that optimize patient outcomes. Ultralight Med-Vision Mamba, a state-space based model (SSM), has excelled in modeling long- and short-range dependencies and image generalization, critical factors for analyzing whole slide images. Furthermore, Ultralight Med-Vision Mamba’s efficient architecture offers advantages in both computational speed and scalability, making it a promising tool for real-time clinical deployment.

Anushka Bhatt

Main category: cs.CV

TL;DR: 论文提出了一种基于眨眼检测和莫尔斯码翻译的实时通信系统,用于帮助运动功能严重受损的人群。

Details Motivation: 为运动功能严重受损的人群提供一种低成本、实时的辅助通信方式。

Contribution: 开发了一种基于标准摄像头和计算机视觉的系统,能够实时将眨眼动作翻译为莫尔斯码。

Method: 系统通过标准摄像头检测和分类眨眼动作为短(点)或长(划),然后解码为字母数字字符。

Result: 在5名参与者的实验中,系统实现了62%的解码准确率和18-20秒的响应时间。

Insight: 研究表明,基于摄像头的眨眼检测是一种可行的低成本辅助通信解决方案。

Abstract: This study proposes a real-time system that translates voluntary eye blinks into Morse code, enabling communication for individuals with severe motor impairments. Using a standard webcam and computer vision, the system detects and classifies blinks as short (dot) or long (dash), then decodes them into alphanumeric characters. Experiments with five participants show 62% decoding accuracy and 18-20 seconds response times, demonstrating a viable, low-cost assistive communication method.

[43] FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition

Md. Milon Islam,Md Rezwanul Haque,S M Taslim Uddin Raju,Fakhri Karray

Main category: cs.CV

TL;DR: FusionEnsemble-Net是一种基于注意力的时空网络集成方法,用于多模态手语识别,通过动态融合视觉和运动数据提升准确性,在意大利手语数据集上取得了99.44%的测试精度。

Details Motivation: 医疗领域中的手语识别具有挑战性,需要能够准确解读复杂多模态手势的框架。

Contribution: 提出了FusionEnsemble-Net,一种新颖的基于注意力的时空网络集成方法,动态融合多模态数据以提高识别精度。

Method: 通过四种不同的时空网络同步处理RGB视频和雷达数据,利用注意力融合模块动态融合特征,并通过分类器集成提升鲁棒性。

Result: 在意大利手语数据集MultiMeDaLIS上实现99.44%的测试精度,优于现有方法。

Insight: 注意力机制与多网络集成相结合能够显著提升多模态手势识别的准确性和鲁棒性。

Abstract: Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model’s robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition.

[44] X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

Guoxian Song,Hongyi Xu,Xiaochen Zhao,You Xie,Tianpei Gu,Zenan Li,Chenxu Zhang,Linjie Luo

Main category: cs.CV

TL;DR: X-UniMotion 提出了一种统一的隐式潜表示方法,用于表现人体全身运动(包括面部表情、身体姿势和手势),并通过自监督学习框架实现了高保真、跨身份的运动迁移。

Details Motivation: 以往的运动迁移方法依赖于显式的骨骼姿态和启发式的跨身份调整,难以实现高保真且细节丰富的跨身份运动迁移。X-UniMotion 旨在通过学习身份无关的运动潜表示,解决这一问题。

Contribution: 1. 提出了一种统一的、表达力强的隐式潜表示(X-UniMotion);2. 设计了一个自监督的端到端框架,联合学习运动编码器和潜表示;3. 通过辅助解码器和数据增强技术实现了运动与身份的分离。

Method: 1. 使用四个解耦的潜令牌(面部表情、身体姿势和双手)编码多粒度运动;2. 通过自监督学习框架联合训练运动编码器和视频生成模型;3. 引入辅助解码器和数据增强(2D 空间/颜色增强、合成 3D 渲染)促进运动与身份的分离。

Result: 实验表明,X-UniMotion 在运动保真度和身份保留方面优于现有方法,能够生成高度表达的动画。

Insight: 运动与身份的分离可以通过隐式潜表示和自监督学习实现,无需显式姿态估计或启发式调整,为跨身份运动迁移提供了新思路。

Abstract: We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens – one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion-identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.

[45] DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

Kang Ni,Minrui Zou,Yuxuan Li,Xiang Li,Kehua Guo,Ming-Ming Cheng,Yimian Dai

Main category: cs.CV

TL;DR: DenoDet V2 通过相位-振幅交叉去噪机制,显著提升了合成孔径雷达(SAR)目标检测的性能,同时在模型复杂度上实现了减半。

Details Motivation: SAR 目标检测中的相干噪声问题一直是一个主要挑战,现有方法多依赖空间域特征分析或增强。DenoDet V2 提出了一种新的变换域特征解构与调制视角。

Contribution: 1. 提出了一种基于相位和振幅互补性的波段相互调制机制;2. 通过精心设计的注意力架构实现了相位与振幅谱的相互增强;3. 在多个 SAR 数据集上实现了最先进的性能。

Method: DenoDet V2 设计了一种波段相互调制机制,通过注意力架构在变换域中解构和调制特征,利用相位和振幅信息的互补性实现相互增强。

Result: 在 SARDet-100K 数据集上,DenoDet V2 相比 DenoDet V1 提升了 0.8% 的性能,同时模型复杂度减半。

Insight: 利用变换域中的相位和振幅信息互补性,可以显著提升对相干噪声的鲁棒性,同时简化模型设计。

Abstract: One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at https://github.com/GrokCV/GrokSAR.

[46] Skyshield: Event-Driven Submillimetre Thin Obstacle Detection for Drone Flight Safety

Zhengli Zhang,Xinyu Luo,Yuchen Sun,Wenhua Ding,Dongyu Huang,Xinlei Chen

Main category: cs.CV

TL;DR: SkyShield提出了一种基于事件驱动的无人机亚毫米级细薄障碍物检测框架,通过轻量级U-Net架构和创新的Dice-Contour正则化损失实现了高精度检测。

Details Motivation: 无人机在复杂环境中飞行时,亚毫米级细薄障碍物(如钢丝、风筝线)传统传感器难以检测,亟需高效解决方案。

Contribution: 提出事件驱动的端到端框架SkyShield,结合轻量级U-Net和Dice-Contour正则化损失,实现高效检测。

Method: 利用事件流中薄障碍物的独特特征,设计轻量级U-Net和Dice-Contour正则化损失,优化检测精度。

Result: 实验显示,方法平均F1分数达0.7088,延迟仅21.2毫秒,适合边缘和移动平台部署。

Insight: 事件驱动感知在细薄障碍物检测中具有潜力,轻量化和正则化设计是关键。

Abstract: Drones operating in complex environments face a significant threat from thin obstacles, such as steel wires and kite strings at the submillimeter level, which are notoriously difficult for conventional sensors like RGB cameras, LiDAR, and depth cameras to detect. This paper introduces SkyShield, an event-driven, end-to-end framework designed for the perception of submillimeter scale obstacles. Drawing upon the unique features that thin obstacles present in the event stream, our method employs a lightweight U-Net architecture and an innovative Dice-Contour Regularization Loss to ensure precise detection. Experimental results demonstrate that our event-based approach achieves mean F1 Score of 0.7088 with a low latency of 21.2 ms, making it ideal for deployment on edge and mobile platforms.

[47] Autonomous AI Bird Feeder for Backyard Biodiversity Monitoring

El Mustapha Mansouri

Main category: cs.CV

TL;DR: 本文介绍了一种低成本、本地运行的自主后院鸟类监测系统,结合运动触发摄像头和目标检测分类技术,实现了高精度的鸟类识别。

Details Motivation: 为了在城市花园中低成本、高效地监测鸟类多样性,同时保护隐私和避免云服务费用,作者开发了这套本地化系统。

Contribution: 主要贡献包括基于运动触发的本地化处理流程、Detectron2与EfficientNet-B3结合的检测分类方法,以及适用于小型鸟类的物理喂食器设计。

Method: 方法包括使用IP摄像头触发上传视频片段,通过Detectron2定位鸟类并裁剪区域,再用EfficientNet-B3微调模型分类。

Result: 系统在40种比利时鸟类子集上验证准确率达99.5%,实地测试中top-1准确率为88%,展示了家庭公民科学级监测的可行性。

Insight: 本地化处理和检测引导裁剪显著提升了分类性能,而小型喂食器设计减少了干扰触发,为类似系统提供了实用参考。

Abstract: This paper presents a low cost, on premise system for autonomous backyard bird monitoring in Belgian urban gardens. A motion triggered IP camera uploads short clips via FTP to a local server, where frames are sampled and birds are localized with Detectron2; cropped regions are then classified by an EfficientNet-B3 model fine tuned on a 40-species Belgian subset derived from a larger Kaggle corpus. All processing runs on commodity hardware without a discrete GPU, preserving privacy and avoiding cloud fees. The physical feeder uses small entry ports (30 mm) to exclude pigeons and reduce nuisance triggers. Detector-guided cropping improves classification accuracy over raw-frame classification. The classifier attains high validation performance on the curated subset (about 99.5 percent) and delivers practical field accuracy (top-1 about 88 percent) on held-out species, demonstrating feasibility for citizen-science-grade biodiversity logging at home.

[48] Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

Guangxun Zhu,Shiyu Fan,Hang Dai,Edmond S. L. Ho

Main category: cs.CV

TL;DR: Waymo-3DSkelMo是首个基于Waymo感知数据的大规模高质量3D骨架运动数据集,提供时间连贯的运动和明确的交互语义,适用于自动驾驶中的行人交互建模。

Details Motivation: 现有3D运动数据集多依赖单目RGB视频估计,存在遮挡和时间不连续问题,导致运动质量低且不真实。自动驾驶需高质量数据集以理解复杂行人交互行为。

Contribution: 1)提出首个大规模高质量3D骨架运动数据集;2)利用3D人体形状和运动先验提升LiDAR点云的3D姿态序列质量;3)建立3D姿态预测基准。

Method: 通过3D人体形状和运动先验,从LiDAR点云中提取高质量、时间连贯的3D姿态序列,并标注交互语义。

Result: 数据集覆盖800多个真实驾驶场景,时长超14,000秒,包含平均每个场景27个行人的交互(最多达250人)。基准测试验证其价值。

Insight: 结合3D先验和LiDAR数据可显著提升3D运动质量,为复杂城市环境中行人行为理解提供重要资源。

Abstract: Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo

[49] What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset

Yuxiao Wang,Yu Lei,Wolin Liang,Weiying Xue,Zhenao Wei,Nan Zhuang,Qi Liu

Main category: cs.CV

TL;DR: 论文提出了一种新视觉任务,结合高级动作语义和细粒度身体接触区域的预测。通过新数据集和框架PaIR-Net(包含三个模块:CPAM、PGCS和IIM)显著提升了性能。

Details Motivation: 现有方法未能同时建模动作语义及其空间上下文关系,限制了动作理解的全面性。论文旨在填补这一空白,提出一种统一学习动作和接触区域的任务。

Contribution: 1. 提出新视觉任务,统一学习动作语义和接触区域;2. 提出PaIR-Net框架(含CPAM、PGCS和IIM);3. 发布PaIR数据集(13,979张图像,654动作,80物体类别,17身体部位)。

Method: PaIR-Net框架通过三个模块协同工作:CPAM识别接触相关身体部位,PGCS进行像素级接触分割,IIM整合全局交互关系。

Result: 实验显示PaIR-Net显著优于基线方法,消融研究验证了各模块有效性。

Insight: 动作理解需同时考虑高级语义和细粒度空间上下文,新任务和数据集为未来研究提供了方向。

Abstract: People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

[50] Animate-X++: Universal Character Image Animation with Dynamic Backgrounds

Shuai Tan,Biao Gong,Zhuoxin Liu,Yan Wang,Xi Chen,Yifan Feng,Hengshuang Zhao

Main category: cs.CV

TL;DR: 这篇论文提出了Animate-X++,一种基于DiT的通用角色动画框架,能够处理包括拟人角色在内的多种角色类型,并通过多任务训练实现动态背景的生成。

Details Motivation: 现有角色动画方法主要针对人类角色,且无法处理动态背景,限制了其在游戏和娱乐等行业的广泛应用。

Contribution: 1. 提出了Animate-X++框架,支持拟人角色动画;2. 引入Pose Indicator增强运动表示;3. 结合多任务训练实现动态背景生成;4. 提出了A2Bench基准测试。

Method: 1. 采用DiT框架;2. 通过Pose Indicator捕捉驱动视频的运动模式(隐式和显式);3. 多任务训练结合动画和文本到视频(TI2V)任务;4. 部分参数训练策略。

Result: 实验表明Animate-X++在通用角色动画和动态背景生成方面具有优越性和有效性。

Insight: 1. 角色动画不仅需要关注姿态序列,还需理解运动模式;2. 结合文本驱动的动态背景可以提升视频真实感;3. 多任务联合训练有助于模型的泛化能力。

Abstract: Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Furthermore, previous methods could only generate videos with static backgrounds, which limits the realism of the videos. For the first challenge, our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X++, a universal animation framework based on DiT for various character types, including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of DiT by simulating possible inputs in advance that may arise during inference. For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks. Combined with the proposed partial parameter training, this approach achieves not only character animation but also text-driven background dynamics, making the videos more realistic. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X++ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X++.

[51] IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

Junxian Li,Beining Xu,Di Zhang

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的新型输入感知后门攻击方法IAG,旨在操控模型的视觉定位行为,使其忽略用户查询而定位特定目标对象。该方法通过自适应触发生成器和重建损失确保攻击的隐蔽性,并在实验中展示了高攻击成功率和低对干净样本的影响。

Details Motivation: 视觉语言模型(VLMs)在视觉定位任务中表现优异,但其安全性问题尤其是后门攻击的研究尚未充分探索。本文旨在填补这一空白,提出一种针对VLM视觉定位任务的后门攻击方法。

Contribution: 1. 提出了一种输入感知的后门攻击方法IAG,通过自适应触发生成器将攻击目标的语义信息嵌入图像。
2. 设计重建损失以确保攻击的隐蔽性,最小化毒化图像与干净图像间的视觉差异。
3. 提供了统一的攻击数据生成方法,并在实验中验证了攻击的有效性和迁移性。

Method: 1. 使用文本条件化的U-Net生成自适应触发器,将攻击目标描述语义嵌入图像。
2. 采用重建损失确保毒化图像与原始图像的视觉差异最小化。
3. 设计了统一的攻击数据生成框架,并在多个VLM模型(如InternVL-2.5-8B、Ferret-7B等)上验证攻击效果。

Result: 在InternVL-2.5-8B模型上,攻击成功率(ASR@0.5)超过65%,且在Ferret-7B和LlaVA-1.5-7B模型上也表现出色。此外,攻击对干净样本的准确率影响较小。

Insight: 1. 输入感知的后门攻击方法能够有效操控VLMs的视觉定位行为,且隐蔽性高。
2. IAG展示了强大的迁移能力,适用于多种VLM模型。
3. 该方法为VLMs的安全性研究提供了新视角,揭示了后门攻击的潜在威胁。

Abstract: Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user’s query. We propose an adaptive trigger generator that embeds the semantic information of the attack target’s description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack’s stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.

[52] RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Wen Huang,Jiarui Yang,Tao Dai,Jiawei Li,Shaoxiong Zhan,Bin Wang,Shu-Tao Xia

Main category: cs.CV

TL;DR: RelayFormer是一种统一的本地-全局注意力框架,用于可扩展的图像和视频篡改定位,支持跨模态高效处理高分辨率或长时间输入。

Details Motivation: 现有视觉篡改定位方法缺乏跨模态泛化能力,且难以高效处理高分辨率或长时间输入。

Contribution: 提出RelayFormer,一种模块化架构,通过GLoRA机制和轻量级模块支持跨模态可扩展处理。

Method: 结合本地单元和GLoRA注意力机制,兼容现有Transformer骨干网络(如ViT和SegFormer)。

Result: 在多个基准测试中达到最先进的定位性能,为可扩展和跨模态的视觉篡改定位设立了新基准。

Insight: 统一的本地-全局注意力框架能够在不破坏预训练表征的前提下,显著提升篡改定位的泛化能力和效率。

Abstract: Visual manipulation localization (VML) – across both images and videos – is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.

[53] Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Hao Yu,Rupayan Mallick,Margrit Betke,Sarah Adel Bargal

Main category: cs.CV

TL;DR: GEN-AFFECT是一个用于生成细粒度面部表情的个性化头像框架,通过结合多模态扩散变换器和身份-表情表示,实现了表情多样性和身份一致性。

Details Motivation: 现有方法难以捕捉细粒度面部表情且无法保证不同表情间的身份一致性,虚拟通信和游戏应用对此有强烈需求。

Contribution: 提出了GEN-AFFECT框架,结合多模态扩散变换器和一致注意力机制,实现表情多样性和身份一致性。

Method: 使用身份-表情表示条件化多模态扩散变换器,并在推理时通过一致注意力机制共享信息。

Result: 在表情准确性、身份保持和目标身份一致性方面优于现有方法。

Insight: 多模态变换器和注意力机制的结合能有效提升生成头像的表情多样性和身份一致性。

Abstract: Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

[54] Event-driven Robust Fitting on Neuromorphic Hardware

Tam Ngoc-Bang Nguyen,Anh-Dzung Doan,Zhipeng Cai,Tat-Jun Chin

Main category: cs.CV

TL;DR: 本文提出了一种基于神经形态计算的能量高效鲁棒拟合方法,通过设计新型的脉冲神经网络在Intel Loihi 2硬件上实现,能耗仅为传统CPU方法的15%。

Details Motivation: 传统鲁棒拟合方法在能效方面关注不足,而AI的高能耗问题日益突出。本文旨在利用神经形态计算范式解决这一挑战。

Contribution: 1. 提出了首个在真实神经形态硬件上实现的鲁棒拟合方法;2. 设计了事件驱动的模型估计方法;3. 克服了硬件精度和指令集的限制。

Method: 设计了一种新型的脉冲神经网络,在Intel Loihi 2上运行。通过事件驱动的模型估计和算法策略优化实现高效鲁棒拟合。

Result: 测试结果表明,该方法能耗仅为传统CPU方法的15%,且能达到相同的精度。

Insight: 神经形态硬件为解决计算机视觉任务的高能耗问题提供了新的可能性,尤其是在需要实时处理的场景中。

Abstract: Robust fitting of geometric models is a fundamental task in many computer vision pipelines. Numerous innovations have been produced on the topic, from improving the efficiency and accuracy of random sampling heuristics to generating novel theoretical insights that underpin new approaches with mathematical guarantees. However, one aspect of robust fitting that has received little attention is energy efficiency. This performance metric has become critical as high energy consumption is a growing concern for AI adoption. In this paper, we explore energy-efficient robust fitting via the neuromorphic computing paradigm. Specifically, we designed a novel spiking neural network for robust fitting on real neuromorphic hardware, the Intel Loihi 2. Enabling this are novel event-driven formulations of model estimation that allow robust fitting to be implemented in the unique architecture of Loihi 2, and algorithmic strategies to alleviate the current limited precision and instruction set of the hardware. Results show that our neuromorphic robust fitting consumes only a fraction (15%) of the energy required to run the established robust fitting algorithm on a standard CPU to equivalent accuracy.

[55] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

Jialei Xu,Zizhuang Wei,Weikang You,Linyun Li,Weijian Sun

Main category: cs.CV

TL;DR: CitySeg is a 3D semantic segmentation foundation model for city-scale point clouds, leveraging text modality for open vocabulary segmentation and zero-shot inference, achieving SOTA performance.

Details Motivation: Existing models struggle with limited data scale and domain gaps, reducing generalization. CitySeg addresses these issues with text modality and hierarchical classification.

Contribution: Proposes CitySeg with text modality for open vocabulary segmentation, hierarchical classification, and local-global cross-attention to enhance UAV perception.

Method: Uses custom data preprocessing, local-global cross-attention, hierarchical graph for label consolidation, and two-stage training with hinge loss.

Result: Achieves SOTA on nine benchmarks and enables zero-shot generalization in city-scale scenarios without visual information.

Insight: Combining text modality and hierarchical classification effectively addresses domain gaps and label discrepancies in 3D point cloud segmentation.

Abstract: Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.

[56] From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

Yuji Wang,Moran Li,Xiaobin Hu,Ran Yi,Jiangning Zhang,Chengming Xu,Weijian Cao,Yabiao Wang,Chengjie Wang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了Mixture of Facial Experts (MoFE)方法,通过组合三个专家模块动态捕捉不同面部特征,并结合定制的大角度人脸数据集LFA,显著提升了视频生成中身份一致性和大角度处理的性能。

Details Motivation: 现有视频生成模型在大角度人脸时难以保持身份一致性,主要缺乏有效的机制将身份特征融入DiT结构,以及开源数据集中大角度人脸的覆盖不足。

Contribution: 1. 提出了MoFE模块,动态整合三个专家(身份、语义、细节)的特征;2. 设计了Face Constraints和Identity Consistency的数据处理流程,并构建了LFA数据集。

Method: 通过MoFE模块动态融合身份敏感特征、视觉语义和像素级细节,结合LFA数据集中的标注大角度人脸数据优化训练。

Result: 在LFA基准测试中,方法在面部相似性、面部FID和CLIP语义对齐上显著优于现有方法。

Insight: 动态专家组合和数据集的针对性设计(大角度覆盖和身份一致性)是提升视频生成身份一致性的关键。

Abstract: Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.

[57] CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection

Zhipeng Yuan,Kai Wang,Weize Quan,Dong-Ming Yan,Tieru Wu

Main category: cs.CV

TL;DR: 论文提出了一种基于异常检测的通用AI生成图像检测器CLIP-Flow,无需接触任何AI生成图像,通过无监督学习和代理图像实现高泛化性能。

Details Motivation: 随着AI生成模型的快速发展,AI生成图像的质量接近自然图像,引发了安全担忧。现有检测器对未见过的生成模型泛化能力有限。

Contribution: 提出了一种通用AI生成图像检测器,无需依赖AI生成图像训练数据,通过无监督学习和代理图像提升了泛化性。

Method: 利用预训练的CLIP编码器作为特征提取器,设计类似归一化流的无监督模型,通过最小化代理图像的似然(可选结合最大化自然图像的似然)进行训练。

Result: 实验表明,该方法对多种图像生成器生成的AI图像具有高效检测能力。

Insight: 从异常检测视角设计检测器,结合无监督学习和代理图像,是一种解决AI生成图像检测泛化问题的有效途径。

Abstract: With the rapid advancement of AI generative models, the visual quality of AI-generated images (AIIs) has become increasingly close to natural images, which inevitably raises security concerns. Most AII detectors often employ the conventional image classification pipeline with natural images and AIIs (generated by a generative model), which can result in limited detection performance for AIIs from unseen generative models. To solve this, we proposed a universal AI-generated image detector from the perspective of anomaly detection. Our discriminator does not need to access any AIIs and learn a generalizable representation with unsupervised learning. Specifically, we use the pre-trained CLIP encoder as the feature extractor and design a normalizing flow-like unsupervised model. Instead of AIIs, proxy images, e.g., obtained by applying a spectral modification operation on natural images, are used for training. Our models are trained by minimizing the likelihood of proxy images, optionally combined with maximizing the likelihood of natural images. Extensive experiments demonstrate the effectiveness of our method on AIIs produced by various image generators.

[58] SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang,Xinyi Liu,Yi Wan,Zhi Zheng,Bin Zhang,Mingtao Xiong,Yingying Pei,Yongjun Zhang

Main category: cs.CV

TL;DR: SkySplat是一个自监督框架,通过将RPC模型集成到通用3D高斯散射管道中,显著提升了稀疏卫星图像的三维重建能力。

Details Motivation: 现有的3D高斯散射(3DGS)方法在稀疏卫星图像上表现不佳,主要由于RPC模型的不兼容性和泛化能力不足。SkySplat旨在解决这些问题。

Contribution: 1. 提出自监督框架SkySplat,将RPC模型融入通用3DGS;2. 提出跨自一致性模块(CSCM)降低瞬态对象干扰;3. 多视角一致性聚合策略提升重建效果;4. 显著提升速度和精度。

Method: 1. 整合RPC模型,利用稀疏几何线索;2. 仅需RGB图像和辐射鲁棒的相对高度监督;3. CSCM通过一致性掩码减少干扰;4. 多视角一致性聚合优化结果。

Result: 1. 比EOGS快86倍且精度更高;2. 在DFC19数据集上将MAE从13.18米降至1.80米;3. 在MVS3D基准测试中展示强泛化能力。

Insight: SkySplat通过自监督和通用3DGS的结合,显著提升了稀疏卫星图像重建的效率和精度,同时减少了对真实高度图的依赖。

Abstract: Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark.

[59] Episodic Memory Representation for Long-form Video Understanding

Yun Wang,Long Zhang,Jingren Liu,Jiaqi Yan,Zhanjie Zhang,Jiahao Zheng,Xun Yang,Dapeng Wu,Xiangyu Chen,Xuelong Li

Main category: cs.CV

TL;DR: Video-EM提出了一种基于人类情景记忆原理的无训练框架,通过建模时间有序的片段事件,解决了长视频理解中上下文限制和关键帧冗余的问题。

Details Motivation: 现有的视频大语言模型(Video-LLMs)在处理长视频时受限于上下文窗口,且关键帧检索方法忽略了时空关系,导致场景过渡和上下文连续性的丢失。

Contribution: 1. 提出了Video-EM,一种无训练框架,将关键帧建模为时间有序的片段事件,捕获时空动态关系;2. 结合链式思维(CoT)迭代选择信息量最大的片段记忆,提高问答效率。

Method: 1. 将关键帧作为时间有序的片段事件建模;2. 利用大语言模型的链式思维迭代优化关键帧选择。

Result: 在Video-MME、EgoSchema等基准测试中,Video-EM性能提升4-9%,且使用更少帧数。

Insight: 通过情景记忆原理设计系统可以更好地解决长视频理解中的时空关系建模和上下文连续性难题。

Abstract: Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.

[60] SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection

Ju Yeon Kang,Jaehong Park,Semin Kim,Ji Won Yoon,Nam Soo Kim

Main category: cs.CV

TL;DR: 提出了一种基于语义感知重建误差(SARE)的方法,用于检测扩散模型生成的图像,通过量化图像与其标题引导重建之间的语义差异,提升了检测的泛化能力。

Details Motivation: 扩散模型快速发展带来了潜在的滥用问题,现有检测方法在面对未见过(OOD)生成模型时性能下降,主要依赖模型特定伪影。

Contribution: 提出SARE方法,通过语义差异量化作为判别特征,提升对多种生成模型生成图像的检测鲁棒性和泛化能力。

Method: 利用标题引导的图像重建过程,比较原始图像与重建图像的语义差异,提出SARE特征用于检测。

Result: 在GenImage和CommunityForensics等基准测试中表现出色,显著优于现有基线方法。

Insight: 假图像通常与其标题更一致,而真实图像的标题难以完全捕捉其复杂视觉内容,导致语义差异更显著,这一特性可用于检测。

Abstract: Recently, diffusion-generated image detection has gained increasing attention, as the rapid advancement of diffusion models has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts. To address this limitation, we explore a fundamental property commonly observed in fake images. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE can be utilized as a discriminative feature for robust detection across diverse generative models. We empirically demonstrate that the proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics.

[61] CWFBind: Geometry-Awareness for Fast and Accurate Protein-Ligand Docking

Liyan Jia,Chuan-Xian Ren,Hong Yan

Main category: cs.CV

TL;DR: CWFBind是一种基于局部曲率特征的快速、准确的蛋白质-配体对接方法,通过整合几何信息和改进的消息传递机制,显著提升了对接的准确性和效率。

Details Motivation: 现有的深度学习方法在蛋白质-配体对接中常忽视几何信息,导致口袋定位和结合构象不准确。CWFBind旨在通过引入几何感知的特征提取和消息传递机制来解决这一问题。

Contribution: 1. 提出了一种基于局部曲率特征的几何感知对接方法;2. 引入度感知权重机制增强消息传递;3. 采用配体感知的动态半径策略和增强损失函数解决类别不平衡问题。

Method: CWFBind整合局部曲率描述符来丰富蛋白质和配体的几何表示,并结合化学、序列和结构特征。此外,通过度感知权重机制改进消息传递,并使用动态半径策略优化口袋预测。

Result: 实验表明,CWFBind在多个对接基准测试中性能优越,实现了精度和效率的平衡。

Insight: 几何信息的引入和类别不平衡问题的有效处理是提升蛋白质-配体对接性能的关键。

Abstract: Accurately predicting the binding conformation of small-molecule ligands to protein targets is a critical step in rational drug design. Although recent deep learning-based docking surpasses traditional methods in speed and accuracy, many approaches rely on graph representations and language model-inspired encoders while neglecting critical geometric information, resulting in inaccurate pocket localization and unrealistic binding conformations. In this study, we introduce CWFBind, a weighted, fast, and accurate docking method based on local curvature features. Specifically, we integrate local curvature descriptors during the feature extraction phase to enrich the geometric representation of both proteins and ligands, complementing existing chemical, sequence, and structural features. Furthermore, we embed degree-aware weighting mechanisms into the message passing process, enhancing the model’s ability to capture spatial structural distinctions and interaction strengths. To address the class imbalance challenge in pocket prediction, CWFBind employs a ligand-aware dynamic radius strategy alongside an enhanced loss function, facilitating more precise identification of binding regions and key residues. Comprehensive experimental evaluations demonstrate that CWFBind achieves competitive performance across multiple docking benchmarks, offering a balanced trade-off between accuracy and efficiency.

[62] Generation of Indian Sign Language Letters, Numbers, and Words

Ajeet Kumar Yadav,Nishant Kumar,Rathna G N

Main category: cs.CV

TL;DR: 该论文提出了一种结合ProGAN和SAGAN的GAN变体,用于生成高质量、高分辨率的印度手语字母、数字和单词图像,显著提升了生成效果,并发布了大型数据集。

Details Motivation: 手语是与听力障碍者沟通的重要媒介,但生成高质量的手语图像仍需探索。结合ProGAN的高分辨率图像生成能力和SAGAN的特征丰富性,可以进一步提升手语生成的效果。

Contribution: 1. 提出了一种结合ProGAN和SAGAN的改进注意力模型,生成高质量的手语图像;2. 在Inception Score和Fréchet Inception Distance指标上显著优于传统ProGAN;3. 发布了一个包含印度手语字母、数字和129个单词的大型高质量数据集。

Method: 结合Progressive Growing of GAN(ProGAN)的高分辨率图像生成能力和Self-Attention GAN(SAGAN)的特征提取能力,构建了一个改进的注意力GAN模型,专注于生成高质量的印度手语图像。

Result: 改进的模型在Inception Score(IS)上提升了3.2,在Fréchet Inception Distance(FID)上提升了30.12,显著优于传统ProGAN。

Insight: 通过结合两种GAN的优势,可以在生成高分辨率图像的同时保留丰富的特征细节,这对于手语生成尤为重要。

Abstract: Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don’t know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fr'echet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words.

[63] SOI is the Root of All Evil: Quantifying and Breaking Similar Object Interference in Single Object Tracking

Yipei Wang,Shiyu Hu,Shukun Jia,Panxi Xu,Hongfei Ma,Yiping Ma,Jing Zhang,Xiaobo Lu,Xin Zhao

Main category: cs.CV

TL;DR: 论文首次系统研究并量化了单目标跟踪中的相似对象干扰(SOI),通过实验证明消除干扰源可显著提升跟踪性能,并提出了利用自然语言作为外部认知引导的新范式。

Details Motivation: 相似对象干扰(SOI)是单目标跟踪(SOT)中长期被忽视但严重影响性能的瓶颈,论文旨在量化SOI的影响并探索通过外部认知引导解决这一问题。

Contribution: 1)首次系统量化SOI的影响;2)构建SOIBench基准测试,针对SOI挑战自动挖掘数据并提供语义引导文本;3)提出基于大规模视觉语言模型(VLM)的新范式,显著提升现有跟踪器的性能。

Method: 1)通过在线干扰掩蔽(OIM)实验量化SOI影响;2)利用多跟踪器集体判断构建SOIBench;3)设计多级标注协议生成语义引导文本;4)集成大规模VLM作为外部认知引擎。

Result: 消除干扰源可带来AUC最高提升4.35;现有VLT方法未能有效利用语义引导(AUC变化-0.26至+0.71),而VLM新范式带来最高AUC提升0.93。

Insight: SOI是单目标跟踪的主要瓶颈;自然语言作为外部认知引导的可行性高,大规模VLM在语义引导任务中表现优越。

Abstract: In this paper, we present the first systematic investigation and quantification of Similar Object Interference (SOI), a long-overlooked yet critical bottleneck in Single Object Tracking (SOT). Through controlled Online Interference Masking (OIM) experiments, we quantitatively demonstrate that eliminating interference sources leads to substantial performance improvements (AUC gains up to 4.35) across all SOTA trackers, directly validating SOI as a primary constraint for robust tracking and highlighting the feasibility of external cognitive guidance. Building upon these insights, we adopt natural language as a practical form of external guidance, and construct SOIBench-the first semantic cognitive guidance benchmark specifically targeting SOI challenges. It automatically mines SOI frames through multi-tracker collective judgment and introduces a multi-level annotation protocol to generate precise semantic guidance texts. Systematic evaluation on SOIBench reveals a striking finding: existing vision-language tracking (VLT) methods fail to effectively exploit semantic cognitive guidance, achieving only marginal improvements or even performance degradation (AUC changes of -0.26 to +0.71). In contrast, we propose a novel paradigm employing large-scale vision-language models (VLM) as external cognitive engines that can be seamlessly integrated into arbitrary RGB trackers. This approach demonstrates substantial improvements under semantic cognitive guidance (AUC gains up to 0.93), representing a significant advancement over existing VLT methods. We hope SOIBench will serve as a standardized evaluation platform to advance semantic cognitive tracking research and contribute new insights to the tracking research community.

[64] Learning Spatial Decay for Vision Transformers

Yuxin Mao,Zhen Qin,Jinxing Zhou,Bin Fan,Jing Zhang,Yiran Zhong,Yuchao Dai

Main category: cs.CV

TL;DR: 论文提出了Spatial Decay Transformer (SDT),通过内容感知门控机制(CAG)动态生成数据相关的空间衰减,提升了ViT在空间结构任务上的表现。

Details Motivation: ViT的自注意力机制缺乏显式的空间归纳偏置,导致在空间结构化任务上表现不理想。现有方法采用固定的距离度量引入空间衰减,无法适应多样化的视觉场景。

Contribution: 1. 首次成功将数据相关的空间衰减引入2D视觉Transformer;2. 提出Context-Aware Gating (CAG)机制;3. 通过统一的空间-内容融合框架解决1D到2D的适应问题。

Method: SDT采用CAG机制,结合曼哈顿距离的空间先验和学习到的内容表示,动态调制空间注意力的衰减。

Result: 在ImageNet-1K分类和生成任务上,SDT显著优于基线模型。

Insight: 数据相关的空间衰减为视觉Transformer提供了一种新的空间注意力增强范式。

Abstract: Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.

[65] COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Peiran Peng,Tingfa Xu,Liqiang Song,Mengqi Zhu,Yuqiang Fang,Jianan Li

Main category: cs.CV

TL;DR: COXNet通过跨层融合、动态对齐与尺度优化以及改进标签分配策略,显著提升了RGBT微小目标检测性能。

Details Motivation: 由于空间错位、低光照条件、遮挡和杂乱背景等问题,当前方法难以有效利用可见光和热红外模态的互补信息,RGBT微小目标检测面临挑战。

Contribution: 1) 提出跨层融合模块(Cross-Layer Fusion Module);2) 动态对齐与尺度优化模块(Dynamic Alignment and Scale Refinement);3) 基于几何形状相似性的标签分配策略。

Method: 采用跨层特征融合、动态对齐多模态数据、尺度优化以及改进标签分配策略,结合可见光和热红外模态信息。

Result: 在RGBTDronePerson数据集上,COXNet的mAP$_{50}$比现有技术提升了3.32%。

Insight: 通过有效融合多模态特征和动态对齐优化,可以显著提升复杂环境下的微小目标检测性能。

Abstract: Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

[66] Iterative Volume Fusion for Asymmetric Stereo Matching

Yuanting Gao,Linghao Shen

Main category: cs.CV

TL;DR: 论文提出了一种用于非对称立体匹配的两阶段迭代体积融合网络(IVF-AStereo),通过综合使用两种成本体积来解决视觉不对称带来的问题,并在基准数据集上验证了其有效性。

Details Motivation: 非对称多相机系统(如长焦-广角相机)的兴起挑战了传统立体匹配算法对对称视觉特性的假设,导致匹配困难。视觉不对称会影响成本体积的计算,从而破坏立体匹配的效果。

Contribution: 论文的主要贡献在于提出了一种两阶段的迭代体积融合网络(IVF-AStereo),通过综合分析两种成本体积的信息来解决非对称立体匹配问题,并展示了其优越性能。

Method: 方法首先通过聚合拼接体积来优化相关性体积,随后融合两者以增强细节。两阶段的迭代体积融合有效地解决了视觉不对称问题。

Result: 论文在基准数据集上进行了广泛的比较实验和消融研究,证明了IVF-AStereo在分辨率和颜色退化等非对称场景中的鲁棒性和有效性。

Insight: 论文揭示了两种成本体积在非对称场景中的信息失真特点,表明两者需综合使用才能解决问题,为未来非对称立体匹配研究提供了新思路。

Abstract: Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.

[67] GoViG: Goal-Conditioned Visual Navigation Instruction Generation

Fengyi Wu,Yifei Dong,Zhi-Qi Cheng,Yilong Dai,Guangyu Chen,Hang Wang,Qi Dai,Alexander G. Hauptmann

Main category: cs.CV

TL;DR: GoViG提出了一种基于目标条件的视觉导航指令生成任务,仅依赖初始和目标状态的自我中心视觉观察,生成精确且上下文连贯的导航指令。

Details Motivation: 传统方法依赖结构化输入(如语义标注或环境地图),限制了在未知和非结构化环境中的适应性。GoViG通过直接利用原始视觉数据解决这一问题。

Contribution: 1) 提出GoViG任务和目标导向的视觉导航指令生成框架;2) 引入视觉预测和指令生成两个子任务;3) 设计多模态推理策略和一流的自回归模型。

Method: 1) 分解任务为视觉预测和指令生成;2) 使用自回归多模态大语言模型结合定制目标;3) 提出一次性(one-pass)和交错(interleaved)多模态推理策略。

Result: 在R2R-Goal数据集上显著优于现有方法,BLEU-4和CIDEr分数更高,并展现了强大的跨域泛化能力。

Insight: 无需结构化输入即可生成精确导航指令,结合人类认知过程的多模态推理策略是提升生成质量的关键。

Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.

[68] Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification

Haowen Wang,Guowei Zhang,Xiang Zhang,Zeyuan Chen,Haiyang Xu,Dou Hoon Kwark,Zhuowen Tu

Main category: cs.CV

TL;DR: 论文探讨了在图像分类任务中,使用闭集生成数据增强(基于训练集的生成模型生成数据)与真实数据增强的等效性,并通过实验量化了合成数据的规模需求。

Details Motivation: 研究在图像分类任务中,是否可以通过训练生成模型来生成闭集合成数据以提升分类性能,并与真实数据增强的效果进行比较。

Contribution: 1. 系统性分析了闭集合成数据与真实数据的差异与相似性;2. 量化了合成数据增强的等效规模;3. 验证了闭集生成数据增强与真实数据增强的等效性。

Method: 通过实验对比闭集生成数据增强与真实数据增强的效果,量化合成数据的规模需求,并分析不同基线训练集规模和合成数据量对结果的影响。

Result: 研究表明,闭集生成数据增强可以达到与真实数据增强相当的分类性能,但需要更大规模的合成数据。效果还受基线训练集规模的影响。

Insight: 虽然真实数据更受青睐,但闭集生成数据增强在资源有限时是一种可行的替代方案,但需注意合成数据规模的补偿效应。

Abstract: In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.

[69] Topological Invariant-Based Iris Identification via Digital Homology and Machine Learning

Ahmet Öztel,İsmet Karaca

Main category: cs.CV

TL;DR: 该论文提出了一种基于拓扑不变量的虹膜识别方法,通过数字化同调和机器学习实现高精度分类,并在解释性和计算效率上优于深度学习模型。

Details Motivation: 虹膜识别通常依赖复杂的深度学习模型,但这些模型缺乏解释性且计算成本高。作者希望通过拓扑不变量(如Betti数)提供一种更简洁、可解释且高效的方法。

Contribution: 1. 首次将数字化同调理论用于虹膜识别;2. 提出基于Betti数的紧凑特征表示;3. 展示了拓扑特征在高精度和低方差上的优势。

Method: 将归一化虹膜图像分块,计算每块的Betti0、Betti1及其比率,形成特征矩阵,结合逻辑回归等分类器。同时与CNN进行对比。

Result: 逻辑回归的准确率为97.78%,优于CNN(96.44%)和其他方法,拓扑特征表现出高精度和低方差。

Insight: 拓扑特征不仅适用于虹膜识别,还可推广到其他领域(如医学影像、遥感),且具有解释性和计算高效性,尤其适合安全关键场景。

Abstract: Objective - This study presents a biometric identification method based on topological invariants from 2D iris images, representing iris texture via formally defined digital homology and evaluating classification performance. Methods - Each normalized iris image (48x482 pixels) is divided into grids (e.g., 6x54 or 3x27). For each subregion, we compute Betti0, Betti1, and their ratio using a recent algorithm for homology groups in 2D digital images. The resulting invariants form a feature matrix used with logistic regression, KNN, and SVM (with PCA and 100 randomized repetitions). A convolutional neural network (CNN) is trained on raw images for comparison. Results - Logistic regression achieved 97.78 +/- 0.82% accuracy, outperforming CNN (96.44 +/- 1.32%) and other feature-based models. The topological features showed high accuracy with low variance. Conclusion - This is the first use of topological invariants from formal digital homology for iris recognition. The method offers a compact, interpretable, and accurate alternative to deep learning, useful when explainability or limited data is important. Beyond iris recognition, it can apply to other biometrics, medical imaging, materials science, remote sensing, and interpretable AI. It runs efficiently on CPU-only systems and produces robust, explainable features valuable for security-critical domains.

[70] Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Junyan Ye,Dongzhi Jiang,Zihao Wang,Leqi Zhu,Zhenghao Hu,Zilong Huang,Jun He,Zhiyuan Yan,Jinghua Yu,Hongsheng Li,Conghui He,Weijia Li

Main category: cs.CV

TL;DR: Echo-4o利用GPT-4o生成的合成图像改进开源图像生成模型,提出了Echo-4o-Image数据集,并通过新的评测基准展示其优越性能。

Details Motivation: 真实图像数据集虽质量高,但无法覆盖罕见场景且存在噪声和文本-图像不对齐问题,而合成图像可以填补这些空白并提供可控的监督信号。

Contribution: 1) 提出Echo-4o-Image合成数据集;2) 设计了新的评测基准GenEval++和Imagine-Bench;3) 展示了合成数据在多模态基础模型中的强可迁移性。

Method: 利用GPT-4o生成180K规模的合成数据集Echo-4o-Image,并在此基础上微调Bagel模型得到Echo-4o。

Result: Echo-4o在标准评测中表现优异,且Echo-4o-Image在其他基础模型(如OmniGen2、BLIP3-o)上也带来性能提升。

Insight: 合成数据能有效补充真实数据的不足,尤其在罕见场景和精准对齐任务中表现突出,为图像生成领域提供了新思路。

Abstract: Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.

[71] WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

Jiahao Wen,Hang Yu,Zhedong Zheng

Main category: cs.CV

TL;DR: WeatherPrompt提出了一种多模态学习方法,通过融合图像嵌入和文本上下文,实现天气不变的视觉地理定位表示。

Details Motivation: 当前无人机视觉地理定位方法在恶劣天气下性能显著下降,且现有方法受限于有限的天气类别和伪天气类别导致的特征分离不足。

Contribution: 1) 提出了无需训练的大规模多模态天气推理机制,2) 设计了动态门控机制以优化特征分离。

Method: 结合多模态框架和动态门控机制,利用文本嵌入自适应重加权和多模态融合图像特征,并通过跨模态对比学习和匹配优化表示空间。

Result: 在多样天气条件下,Recall@1显著提升,夜间条件下+13.37%,雾雪条件下+18.69%。

Insight: 文本上下文可以帮助解耦天气和场景特征,跨模态学习能增强对复杂天气的泛化能力。

Abstract: Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37% under night conditions and by 18.69% under fog and snow conditions.

[72] A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation

Haibo Jin,Haoxuan Che,Sunan He,Hao Chen

Main category: cs.CV

TL;DR: 该论文提出了一个名为‘诊断链’(CoD)的框架,旨在生成准确且可解释的放射学报告,通过诊断对话生成QA对提取关键发现,并利用大语言模型生成报告,同时设计了诊断和病变定位模块增强可解释性和工作效率。

Details Motivation: 现有的放射学报告生成(RRG)方法在临床效果和可解释性上表现不足,尤其是病变属性描述和结果的可信度问题。论文关注于构建一个可信赖的RRG模型,既能准确描述异常,又能提供预测依据。

Contribution: 1)提出CoD框架,通过诊断对话生成QA对并利用大语言模型生成报告;2)设计诊断和病变定位模块提升可解释性;3)提出全监督学习策略,利用多标注数据;4)提供全标注数据集和评估工具。

Method: CoD框架包括诊断对话生成QA对、大语言模型生成报告、诊断和病变定位模块,以及基于临床一致性的全监督学习策略。

Result: 在多个基准测试中,CoD表现优于专家和通用模型,展示了卓越的准确性和可解释性,能够将生成的句子准确关联到诊断和图像。

Insight: 通过诊断链框架和全监督学习,RRG不仅提升了生成报告的准确性,还增强了其可解释性,为临床实践提供了更好的支持。

Abstract: Despite the progress of radiology report generation (RRG), existing works face two challenges: 1) The performances in clinical efficacy are unsatisfactory, especially for lesion attributes description; 2) the generated text lacks explainability, making it difficult for radiologists to trust the results. To address the challenges, we focus on a trustworthy RRG model, which not only generates accurate descriptions of abnormalities, but also provides basis of its predictions. To this end, we propose a framework named chain of diagnosis (CoD), which maintains a chain of diagnostic process for clinically accurate and explainable RRG. It first generates question-answer (QA) pairs via diagnostic conversation to extract key findings, then prompts a large language model with QA diagnoses for accurate generation. To enhance explainability, a diagnosis grounding module is designed to match QA diagnoses and generated sentences, where the diagnoses act as a reference. Moreover, a lesion grounding module is designed to locate abnormalities in the image, further improving the working efficiency of radiologists. To facilitate label-efficient training, we propose an omni-supervised learning strategy with clinical consistency to leverage various types of annotations from different datasets. Our efforts lead to 1) an omni-labeled RRG dataset with QA pairs and lesion boxes; 2) a evaluation tool for assessing the accuracy of reports in describing lesion location and severity; 3) extensive experiments to demonstrate the effectiveness of CoD, where it outperforms both specialist and generalist models consistently on two RRG benchmarks and shows promising explainability by accurately grounding generated sentences to QA diagnoses and images.

[73] SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs

Bei Yan,Zhiyuan Chen,Yuecong Min,Jie Zhang,Jiahao Wang,Xiaozhen Wang,Shiguang Shan

Main category: cs.CV

TL;DR: 论文提出了SHALE基准,用于细粒度幻觉评估,通过自动化数据构建和分层次幻觉诱导框架,解决了现有评估方法的不足。

Details Motivation: 大型视觉语言模型(LVLMs)存在幻觉问题,现有评估方法粗粒度且依赖人工数据,难以扩展。

Contribution: 提出了自动化数据构建流水线,设计了分层次幻觉诱导框架,并构建了包含多样化的数据集的SHALE基准。

Method: 通过输入扰动模拟噪声场景,使用细粒度分类评估幻觉(忠实性和事实性),生成30K+图像-指令对。

Result: 实验显示主流LVLMs存在显著的事实性幻觉,并对语义扰动敏感。

Insight: SHALE提供了可扩展的评估工具,揭示了LVLMs在噪声环境下的性能问题。

Abstract: Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.

[74] Offline Auto Labeling: BAAS

Stefan Haag,Bharanidhar Duraisamy,Felix Govaers,Wolfgang Koch,Martin Fritzsche,Juergen Dickmann

Main category: cs.CV

TL;DR: BAAS 是一种基于雷达检测的扩展目标跟踪(EOT)和融合标注框架,通过贝叶斯方法和融合技术提供高精度的目标轨迹和形状估计,支持多级监督下的标注,并支持闭环持续改进。

Details Motivation: 自动驾驶中雷达检测的标注通常需要大量人工干预,成本高且效率低。BAAS 旨在通过贝叶斯跟踪和融合技术实现离线自动标注,减少人工依赖并提升标注质量。

Contribution: 提出 BAAS 框架,结合贝叶斯跟踪、平滑和融合方法,实现雷达检测的高精度自动标注,支持多级监督和闭环改进。

Method: 框架基于贝叶斯跟踪和融合技术,利用雷达检测数据估计目标轨迹和形状,并通过多模块独立或联合分析提升标注性能。

Result: 在复杂城市场景中验证了 BAAS 的跟踪性能和标注精度,适用于多种动态目标和类别。

Insight: BAAS 通过自动标注降低人工成本,同时支持闭环改进,为自动驾驶感知系统的数据标注提供高效解决方案。

Abstract: This paper introduces BAAS, a new Extended Object Tracking (EOT) and fusion-based label annotation framework for radar detections in autonomous driving. Our framework utilizes Bayesian-based tracking, smoothing and eventually fusion methods to provide veritable and precise object trajectories along with shape estimation to provide annotation labels on the detection level under various supervision levels. Simultaneously, the framework provides evaluation of tracking performance and label annotation. If manually labeled data is available, each processing module can be analyzed independently or combined with other modules to enable closed-loop continuous improvements. The framework performance is evaluated in a challenging urban real-world scenario in terms of tracking performance and the label annotation errors. We demonstrate the functionality of the proposed approach for varying dynamic objects and class types

[75] Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma

Haotian Tang,Jianwei Chen,Xinrui Tang,Yunjia Wu,Zhengyang Miao,Chao Li

Main category: cs.CV

TL;DR: 论文提出了一种名为Hi-SMGNN的分层框架,通过整合结构和形态连接组,从区域到模块层次预测神经胶质瘤的基因型(IDH突变状态),在UCSF-PDGM数据集上表现出优越性能。

Details Motivation: IDH突变状态是神经胶质瘤预后的关键生物标志物,但现有预测方法受限于功能MRI的低可用性和噪声。结构和形态连接组提供了一种非侵入性替代方案,但忽略了大脑的分层组织和多尺度相互作用。

Contribution: 提出Hi-SMGNN框架,整合多模态连接组数据;设计多模态交互模块(Siamese网络和跨模态注意力)和多尺度特征融合机制;引入个性化模块划分策略以增强特异性和可解释性。

Method: 分层建模大脑结构,从区域到模块层次;通过Siamese网络和跨模态注意力实现多模态交互;多尺度特征融合减少冗余;个性化模块划分策略优化个体特异性。

Result: 在UCSF-PDGM数据集上,Hi-SMGNN优于基线和当前最优模型,显示出更高的鲁棒性和有效性。

Insight: 大脑的分层组织和多尺度相互作用对预测IDH突变状态至关重要;个性化模块策略能够提升模型的预测性能和可解释性。

Abstract: Isocitrate DeHydrogenase (IDH) mutation status is a crucial biomarker for glioma prognosis. However, current prediction methods are limited by the low availability and noise of functional MRI. Structural and morphological connectomes offer a non-invasive alternative, yet existing approaches often ignore the brain’s hierarchical organisation and multiscale interactions. To address this, we propose Hi-SMGNN, a hierarchical framework that integrates structural and morphological connectomes from regional to modular levels. It features a multimodal interaction module with a Siamese network and cross-modal attention, a multiscale feature fusion mechanism for reducing redundancy, and a personalised modular partitioning strategy to enhance individual specificity and interpretability. Experiments on the UCSF-PDGM dataset demonstrate that Hi-SMGNN outperforms baseline and state-of-the-art models, showing improved robustness and effectiveness in IDH mutation prediction.

[76] SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

Heyi Sun,Cong Wang,Tian-Xing Xu,Jingwei Huang,Di Kang,Chunchao Guo,Song-Hai Zhang

Main category: cs.CV

TL;DR: SVG-Head 提出了一种混合表面与体积高斯表示的头部重建方法,支持高保真渲染和实时外观编辑,通过解耦几何与全局外观建模解决现有技术的挑战。

Details Motivation: 当前头部虚拟化身的重建和编辑技术因隐式表示和几何与外观纠缠的建模而难以实现高保真和实时编辑。SVG-Head 旨在通过显式建模和纹理解耦来解决这一问题。

Contribution: 1) 提出混合表面与体积高斯表示;2) 引入纹理图像解耦全局外观;3) 设计网格感知的 Gaussian UV 映射方法;4) 分层优化策略提升重建质量和编辑灵活性。

Method: SVG-Head 结合表面高斯(显式建模纹理)和体积高斯(增强非朗伯区域重建),通过 FLAME 网格的 UV 坐标实现高效纹理映射,分层优化以平衡性能。

Result: 在 NeRSemble 数据集上,SVG-Head 实现了高保真渲染,并首次支持高斯头部虚拟化身的显式纹理图像和实时外观编辑。

Insight: 解耦几何与外观的显式建模是实现高保真和实时编辑的关键,混合表示方法可高效处理复杂区域(如头发和嘴唇)。

Abstract: Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.

[77] BridgeTA: Bridging the Representation Gap in Knowledge Distillation via Teacher Assistant for Bird’s Eye View Map Segmentation

Beomjun Kim,Suhan Woo,Sejong Heo,Euntai Kim

Main category: cs.CV

TL;DR: 该论文提出了一种名为BridgeTA的知识蒸馏框架,通过引入教师助理(TA)网络,在不增加学生模型推理成本的情况下,缩小了激光雷达-相机(LC)融合与纯相机模型在BEV地图分割任务中的性能差距。

Details Motivation: 纯相机的BEV分割方法虽然在成本上优于LiDAR-相机融合方法,但其性能仍有差距。传统的知识蒸馏方法通过模仿教师模型的架构增大学生模型,导致推理成本增加。作者希望通过一种更有效且成本低廉的方法缩小这一差距。

Contribution: 1. 提出了BridgeTA框架,通过教师助理(TA)网络桥接教师与学生模型的表征差距;2. 设计了基于Young不等式的蒸馏损失函数,稳定优化并增强知识传递;3. 在nuScenes数据集上实现了显著性能提升,超越了其他蒸馏方法。

Method: 1. 使用轻量级TA网络结合教师和学生的BEV表征,生成共享的潜在空间作为中间表征;2. 通过Young不等式理论推导蒸馏损失,将直接教师-学生路径分解为教师-TA和TA-学生双路径;3. 保持学生模型架构和推理成本不变。

Result: 在nuScenes数据集上,BridgeTA比纯相机基线提升了4.2%的mIoU,比其他先进蒸馏方法提升了高达45%。

Insight: 引入TA网络不仅缓解了教师与学生之间的表征差距,还提供了一种理论支持的优化路径,为知识蒸馏提供了新的设计思路。

Abstract: Bird’s-Eye-View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher’s architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student’s architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young’s Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.

[78] Plane Detection and Ranking via Model Information Optimization

Daoxin Zhong,Jun Li,Meng Yee Michael Chuah

Main category: cs.CV

TL;DR: 本文提出了一种基于模型信息优化的平面检测与排序框架,解决了传统RANSAC方法因阈值模糊性导致的误检问题。

Details Motivation: 在复杂场景中,传统RANSAC方法的阈值模糊性容易导致平面检测的误判,尤其在真实平面数量未知时更为严重。本文旨在通过信息优化方法提供更客观的平面检测机制。

Contribution: 主要贡献是提出了一种基于概率分布约束和信息优化的通用框架,能够更准确地检测和排序平面,并通过分区加速算法。

Method: 方法包括将深度数据视为离散随机变量,通过随机子采样生成候选模型,利用传感器物理及噪声模型计算模型信息,选择信息最少的模型作为最优解。实验中使用神经网络分区加速。

Result: 实验表明,该方法在合成数据上比Open3D RANSAC更准确地估计平面参数,并能在真实场景中生成更真实的平面。

Insight: 通过信息优化方法,能够更客观地确定平面数量和避免误检,为平面检测任务提供了一种新的理论依据。

Abstract: Plane detection from depth images is a crucial subtask with broad robotic applications, often accomplished by iterative methods such as Random Sample Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic guarantees, the ambiguity of its inlier threshold criterion makes it susceptible to false positive plane detections. This issue is particularly prevalent in complex real-world scenes, where the true number of planes is unknown and multiple planes coexist. In this paper, we aim to address this limitation by proposing a generalised framework for plane detection based on model information optimization. Building on previous works, we treat the observed depth readings as discrete random variables, with their probability distributions constrained by the ground truth planes. Various models containing different candidate plane constraints are then generated through repeated random sub-sampling to explain our observations. By incorporating the physics and noise model of the depth sensor, we can calculate the information for each model, and the model with the least information is accepted as the most likely ground truth. This information optimization process serves as an objective mechanism for determining the true number of planes and preventing false positive detections. Additionally, the quality of each detected plane can be ranked by summing the information reduction of inlier points for each plane. We validate these properties through experiments with synthetic data and find that our algorithm estimates plane parameters more accurately compared to the default Open3D RANSAC plane segmentation. Furthermore, we accelerate our algorithm by partitioning the depth map using neural network segmentation, which enhances its ability to generate more realistic plane parameters in real-world data.

[79] Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation

Xu Tang,Junan Jia,Yijing Wang,Jingjing Ma,Xiangrong Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为SAD-Splat的新方法,通过语义感知的高斯点丢弃模块和伪标签生成流水线,解决了3D航空视场景语义分割中的语义模糊问题,并在新基准数据集上验证了其高效性和可扩展性。

Details Motivation: 传统方法在处理3D航空视场景语义分割时,因尺度变化和结构遮挡导致语义模糊,限制了分割精度和一致性。为解决这一问题,论文提出了SAD-Splat方法。

Contribution: 1. 提出语义感知的高斯点丢弃模块,通过Hard Concrete分布学习稀疏性,消除冗余和语义模糊的高斯点。2. 引入基于2D基础模型的伪标签生成流水线,增强监督信息。3. 提出了新的3D航空语义基准数据集3D-AS。

Method: 1. 集成语义置信度估计与Hard Concrete分布的可学习稀疏机制。2. 利用2D基础模型生成高置信度伪标签,增强有限标注数据下的监督信息。

Result: 实验表明,SAD-Splat在分割精度和表示紧凑性之间取得了优秀平衡,提供了高效的3D航空场景理解解决方案。

Insight: 通过引入语义感知的稀疏性和伪标签增强,SAD-Splat能够有效解决航空场景中的语义模糊问题,同时为稀疏标注数据提供了新思路。

Abstract: In the task of 3D Aerial-view Scene Semantic Segmentation (3D-AVS-SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D-AVS-SS approach named SAD-Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD-Splat incorporates a high-confidence pseudo-label generation pipeline. It leverages 2D foundation models to enhance supervision when ground-truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D-AS), which encompasses diverse real-world aerial scenes with sparse annotations. Experimental results demonstrate that SAD-Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.

[80] Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors

Giorgos Karvounas,Nikolaos Kyriazis,Iason Oikonomidis,Georgios Pavlakos,Antonis A. Argyros

Main category: cs.CV

TL;DR: 本文通过引入轻量级纹理模块,利用纹理对齐作为监督信号,显著提升了单目3D手部重建的精度和真实感。

Details Motivation: 在现有的高性能3D手部重建模型中,纹理与几何的匹配往往不完美,作者认为纹理对齐是一种未充分利用的监督信号,可以主动支持姿态和形状估计。

Contribution: 提出了一个轻量级的纹理模块,通过将像素观测嵌入UV纹理空间,并设计了一种新的密集对齐损失函数,从而改进现有重建模型的精度和真实感。

Method: 结合了可微渲染流程和已知拓扑的3D手部网格模型,通过反向投影纹理手部到图像中,实现像素级的对齐。该模块可轻松集成到现有重建流程中。

Result: 通过增强HaMeR模型,实验证明了纹理引导监督在提升重建精度和视觉效果上的有效性。

Insight: 纹理不仅可用于提升照片真实感,还可以作为一种密集的空间线索,直接辅助3D重建任务。

Abstract: We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Our observation is simple: even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment may be an underused supervisory signal. We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a novel dense alignment loss between predicted and observed hand appearances. Our approach assumes access to a differentiable rendering pipeline and a model that maps images to 3D hand meshes with known topology, allowing us to back-project a textured hand onto the image and perform pixel-based alignment. The module is self-contained and easily pluggable into existing reconstruction pipelines. To isolate and highlight the value of texture-guided supervision, we augment HaMeR, a high-performing yet unadorned transformer architecture for 3D hand pose estimation. The resulting system improves both accuracy and realism, demonstrating the value of appearance-guided alignment in hand reconstruction.

[81] Preacher: Paper-to-Video Agentic System

Jingwei Liu,Ling Yang,Hao Luo,Fan Wang Hongyan Li,Mengdi Wang

Main category: cs.CV

TL;DR: Preacher是一个论文到视频的代理系统,通过分解、总结和重构论文内容,再生成多样化视频片段,解决现有视频生成模型在上下文窗口、视频时长、风格多样性和领域知识表示上的限制。

Details Motivation: 当前视频生成模型存在上下文窗口有限、视频时长固定、风格多样性不足和无法表示领域知识等问题,限制了论文到视频任务的性能。Preacher旨在解决这些限制。

Contribution: 1. 首次提出论文到视频的代理系统Preacher;2. 采用自上而下分解论文和自下而上生成视频的方法;3. 定义关键场景和引入渐进式思维链(P-CoT)实现跨模态对齐。

Method: 1. 将论文内容分层次解构和总结;2. 通过P-CoT迭代规划生成多样化视频片段;3. 整合视频片段为连贯摘要。

Result: Preacher在五个研究领域成功生成高质量视频摘要,表现优于现有视频生成模型。

Insight: 利用分层规划和渐进式思维链可以有效提升跨模态生成任务的性能,特别是在专业领域知识的表示上。

Abstract: The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video

[82] Multi-Sequence Parotid Gland Lesion Segmentation via Expert Text-Guided Segment Anything Model

Zhongyuan Wu,Chuan-Xian Ren,Yu Wang,Xiaohua Ban,Jianning Xiao,Xiaohui Duan

Main category: cs.CV

TL;DR: 本文提出了一种基于专家诊断文本引导的Segment Anything Model(PG-SAM),用于多序列腮腺病变分割,解决了传统SAM依赖精确标注和忽略医学专家知识的局限性。

Details Motivation: 腮腺病变分割对疾病治疗至关重要,但因病变大小不一和边界复杂而具有挑战性。现有SAM依赖精确标注,而医学图像分割方法常忽略专家知识。

Contribution: 1) 提出了专家诊断报告引导的提示生成模块,自动生成包含先验知识的提示信息;2) 设计了跨序列注意力模块,整合多模态互补信息以提升分割效果;3) 在三个独立临床中心验证了PG-SAM的优越性和临床适用性。

Method: 1) 专家诊断文本引导的提示生成模块;2) 跨序列注意力模块整合多模态信息;3) 多序列图像特征与生成的提示输入解码器获得分割结果。

Result: PG-SAM在腮腺病变分割任务中取得了最先进的性能,验证了其在临床环境中的有效性和诊断文本对分割的增强作用。

Insight: 专家诊断文本可以作为有效的先验知识,显著提升医学图像分割性能,尤其是在多模态医学图像中。

Abstract: Parotid gland lesion segmentation is essential for the treatment of parotid gland diseases. However, due to the variable size and complex lesion boundaries, accurate parotid gland lesion segmentation remains challenging. Recently, the Segment Anything Model (SAM) fine-tuning has shown remarkable performance in the field of medical image segmentation. Nevertheless, SAM’s interaction segmentation model relies heavily on precise lesion prompts (points, boxes, masks, etc.), which are very difficult to obtain in real-world applications. Besides, current medical image segmentation methods are automatically generated, ignoring the domain knowledge of medical experts when performing segmentation. To address these limitations, we propose the parotid gland segment anything model (PG-SAM), an expert diagnosis text-guided SAM incorporating expert domain knowledge for cross-sequence parotid gland lesion segmentation. Specifically, we first propose an expert diagnosis report guided prompt generation module that can automatically generate prompt information containing the prior domain knowledge to guide the subsequent lesion segmentation process. Then, we introduce a cross-sequence attention module, which integrates the complementary information of different modalities to enhance the segmentation effect. Finally, the multi-sequence image features and generated prompts are feed into the decoder to get segmentation result. Experimental results demonstrate that PG-SAM achieves state-of-the-art performance in parotid gland lesion segmentation across three independent clinical centers, validating its clinical applicability and the effectiveness of diagnostic text for enhancing image segmentation in real-world clinical settings.

[83] The Brain Resection Multimodal Image Registration (ReMIND2Reg) 2025 Challenge

Reuben Dorent,Laura Rigolo,Colin P. Galvin,Junyu Chen,Mattias P. Heinrich,Aaron Carass,Olivier Colliot,Demian Wassermann,Alexandra Golby,Tina Kapur,William Wells

Main category: cs.CV

TL;DR: ReMIND2Reg 2025 Challenge为脑肿瘤手术中的多模态图像配准任务提供了最大的公共基准数据集,旨在促进算法的开发,以解决因脑移位导致的术前MRI与术后超声图像配准的挑战。

Details Motivation: 脑肿瘤手术中,术前的MRI导航系统会因脑移位而失去准确性。通过对术后超声图像和术前MRI的配准,可以恢复空间准确性,但这是一个具有挑战性的任务,因为存在大的解剖结构变化和模态强度差异。

Contribution: 1. 提供了最大的公共基准数据集ReMIND2Reg,包含99个训练案例和15个测试案例;
2. 建立了标准化的评估框架,包括目标配准误差(TRE)、最坏情况下的配准鲁棒性(TRE30)和运行时间等指标。

Method: 1. 使用术前3D ceT1 MRI、T2 MRI和术后3D iUS体积数据进行配准;
2. 数据和评估基于手动标注的解剖标志物。

Result: 未具体提及实验结果,但通过挑战赛的形式推动算法的开发和改进。

Insight: 1. 多模态图像配准在脑肿瘤手术中具有重要的临床意义;
2. 大规模公共数据集和标准化评估对于推动技术发展至关重要。

Abstract: Accurate intraoperative image guidance is critical for achieving maximal safe resection in brain tumor surgery, yet neuronavigation systems based on preoperative MRI lose accuracy during the procedure due to brain shift. Aligning post-resection intraoperative ultrasound (iUS) with preoperative MRI can restore spatial accuracy by estimating brain shift deformations, but it remains a challenging problem given the large anatomical and topological changes and substantial modality intensity gap. The ReMIND2Reg 2025 Challenge provides the largest public benchmark for this task, built upon the ReMIND dataset. It offers 99 training cases, 5 validation cases, and 10 private test cases comprising paired 3D ceT1 MRI, T2 MRI, and post-resection 3D iUS volumes. Data are provided without annotations for training, while validation and test performance are evaluated on manually annotated anatomical landmarks. Metrics include target registration error (TRE), robustness to worst-case landmark misalignment (TRE30), and runtime. By establishing a standardized evaluation framework for this clinically critical and technically complex problem, ReMIND2Reg aims to accelerate the development of robust, generalizable, and clinically deployable multimodal registration algorithms for image-guided neurosurgery.

[84] TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos

Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazely,Sunil Aryal

Main category: cs.CV

TL;DR: TOTNet是一种利用3D卷积、可见性加权损失和遮挡增强的时态遮挡跟踪网络,旨在解决体育视频中球体在遮挡情况下的鲁棒跟踪问题。

Details Motivation: 体育视频分析中,球体在遮挡情况下的跟踪是关键挑战,影响事件检测和裁判决策。TOTNet旨在提升遮挡场景下的性能。

Contribution: 提出TOTNet框架,结合3D卷积、可见性加权损失和遮挡增强;发布TTA遮挡丰富的乒乓球数据集。

Method: 使用3D卷积捕获时空信息,引入可见性加权损失优化遮挡区域,通过遮挡增强提升鲁棒性。

Result: 在多个数据集中显著优于现有方法,RMSE从37.30降至7.19,完全遮挡帧的准确率从0.63提升至0.80。

Insight: TOTNet的遮挡处理和时空建模能力为体育视频分析中的鲁棒跟踪提供了新思路。

Abstract: Robust ball tracking under occlusion remains a key challenge in sports video analysis, affecting tasks like event detection and officiating. We present TOTNet, a Temporal Occlusion Tracking Network that leverages 3D convolutions, visibility-weighted loss, and occlusion augmentation to improve performance under partial and full occlusions. Developed in collaboration with Paralympics Australia, TOTNet is designed for real-world sports analytics. We introduce TTA, a new occlusion-rich table tennis dataset collected from professional-level Paralympic matches, comprising 9,159 samples with 1,996 occlusion cases. Evaluated on four datasets across tennis, badminton, and table tennis, TOTNet significantly outperforms prior state-of-the-art methods, reducing RMSE from 37.30 to 7.19 and improving accuracy on fully occluded frames from 0.63 to 0.80. These results demonstrate TOTNets effectiveness for offline sports analytics in fast-paced scenarios. Code and data access:\href{https://github.com/AugustRushG/TOTNet}{AugustRushG/TOTNet}.

[85] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation

Eduarda Caldeira,Naser Damer,Fadi Boutros

Main category: cs.CV

TL;DR: NegFaceDiff 是一种新的采样方法,通过在身份条件扩散过程中引入负条件,显著提高了生成人脸数据的身份一致性和可分性。实验表明,该方法在多个基准测试中优于无负条件的生成数据训练的模型。

Details Motivation: 当前的身份条件扩散模型在生成身份一致的人脸图像时,缺乏明确的采样机制来确保类间可分性,导致生成数据的身份重叠问题,影响面部识别性能。

Contribution: 提出了 NegFaceDiff 方法,首次将负条件引入身份条件扩散过程,显著提升了生成数据的身份可分性和一致性。

Method: 在扩散模型的采样阶段,利用负条件显式引导模型远离不想要的特征,同时保持类内一致性。

Result: 身份可分性(FDR 值)从 2.427 提升至 5.687;基于 NegFaceDiff 数据训练的面部识别模型在多个基准测试中表现更优。

Insight: 负条件在生成数据中可以有效增强身份之间的区分度,为扩散模型在身份相关任务中的优化提供了新方向。

Abstract: The use of synthetic data as an alternative to authentic datasets in face recognition (FR) development has gained significant attention, addressing privacy, ethical, and practical concerns associated with collecting and using authentic data. Recent state-of-the-art approaches have proposed identity-conditioned diffusion models to generate identity-consistent face images, facilitating their use in training FR models. However, these methods often lack explicit sampling mechanisms to enforce inter-class separability, leading to identity overlap in the generated data and, consequently, suboptimal FR performance. In this work, we introduce NegFaceDiff, a novel sampling method that incorporates negative conditions into the identity-conditioned diffusion process. NegFaceDiff enhances identity separation by leveraging negative conditions that explicitly guide the model away from unwanted features while preserving intra-class consistency. Extensive experiments demonstrate that NegFaceDiff significantly improves the identity consistency and separability of data generated by identity-conditioned diffusion models. Specifically, identity separability, measured by the Fisher Discriminant Ratio (FDR), increases from 2.427 to 5.687. These improvements are reflected in FR systems trained on the NegFaceDiff dataset, which outperform models trained on data generated without negative conditions across multiple benchmarks.

[86] GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors

Xingyilang Yin,Qi Zhang,Jiahao Chang,Ying Feng,Qingnan Fan,Xi Yang,Chi-Man Pun,Huaqi Zhang,Xiaodong Cun

Main category: cs.CV

TL;DR: GSFixer improves 3D Gaussian Splatting (3DGS) reconstructions from sparse views by leveraging reference-guided video diffusion priors, enhancing semantic and 3D consistency.

Details Motivation: Sparse-view 3DGS reconstructions are ill-posed, leading to artifacts. Existing methods struggle to ensure content consistency with input observations.

Contribution: Proposes GSFixer, a framework combining 2D semantic and 3D geometric features from reference views to enhance 3DGS quality. Introduces DL3DV-Res benchmark for artifact restoration evaluation.

Method: Uses a DiT-based video diffusion model trained on paired artifact renders and clean frames, integrating reference-guided conditions for semantic and 3D coherence.

Result: GSFixer outperforms state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction.

Insight: Reference-guided priors and multi-modal feature integration are key to improving 3DGS reconstructions from sparse inputs.

Abstract: Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: https://github.com/GVCLab/GSFixer.

[87] Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision

Gerardo Loza,Junlei Hu,Dominic Jones,Sharib Ali,Pietro Valdastri

Main category: cs.CV

TL;DR: 提出了一种基于可逆NeRF的新方法Surg-InvNeRF,用于手术场景中的3D跟踪与重建,通过多尺度HexPlanes和双向可变形-规范映射优化性能。

Details Motivation: 解决当前点跟踪方法在长期3D一致性和运动估计中的局限性,特别是在手术视觉中的应用。

Contribution: 1. 提出了可逆NeRF架构(InvNeRF);2. 引入多尺度HexPlanes实现快速推理;3. 开发高效像素采样和收敛准则的新算法。

Method: 采用测试时优化(TTO)框架,结合NeRF进行像素对应监督和双向可变形映射,利用HexPlanes和射线密度引导策略。

Result: 在STIR和SCARE数据集上表现优异,2D跟踪精度超越现有方法50%,3D跟踪为首次提出的TTO方法。

Insight: 可逆NeRF和渲染优化策略为手术场景中的3D跟踪提供了新思路,兼具精度与效率。

Abstract: We proposed a novel test-time optimisation (TTO) approach framed by a NeRF-based architecture for long-term 3D point tracking. Most current methods in point tracking struggle to obtain consistent motion or are limited to 2D motion. TTO approaches frame the solution for long-term tracking as optimising a function that aggregates correspondences from other specialised state-of-the-art methods. Unlike the state-of-the-art on TTO, we propose parametrising such a function with our new invertible Neural Radiance Field (InvNeRF) architecture to perform both 2D and 3D tracking in surgical scenarios. Our approach allows us to exploit the advantages of a rendering-based approach by supervising the reprojection of pixel correspondences. It adapts strategies from recent rendering-based methods to obtain a bidirectional deformable-canonical mapping, to efficiently handle a defined workspace, and to guide the rays’ density. It also presents our multi-scale HexPlanes for fast inference and a new algorithm for efficient pixel sampling and convergence criteria. We present results in the STIR and SCARE datasets, for evaluating point tracking and testing the integration of kinematic data in our pipeline, respectively. In 2D point tracking, our approach surpasses the precision and accuracy of the TTO state-of-the-art methods by nearly 50% on average precision, while competing with other approaches. In 3D point tracking, this is the first TTO approach, surpassing feed-forward methods while incorporating the benefits of a deformable NeRF-based reconstruction.

[88] Slot Attention-based Feature Filtering for Few-Shot Learning

Javier Rodenas,Eduardo Aguilar,Petia Radeva

Main category: cs.CV

TL;DR: 论文提出了一种基于Slot Attention的特征过滤方法SAFF,用于小样本学习,通过过滤无关特征提升分类性能。

Details Motivation: 小样本学习中,无关特征(如背景)易导致分类混淆,现有方法难以有效过滤这些特征。

Contribution: 提出了SAFF方法,结合Slot Attention与patch embeddings,通过相似度矩阵过滤无关特征。

Method: 使用Slot Attention机制区分并过滤弱特征,通过相似度矩阵量化特征相关性以提升分类。

Result: 在多个小样本学习基准数据集上(如CIFAR-FS、miniImageNet等),SAFF优于其他state-of-the-art方法。

Insight: Slot Attention能有效捕捉判别性特征,同时减少无关信息,为小样本学习提供新思路。

Abstract: Irrelevant features can significantly degrade few-shot learn ing performance. This problem is used to match queries and support images based on meaningful similarities despite the limited data. However, in this process, non-relevant fea tures such as background elements can easily lead to confu sion and misclassification. To address this issue, we pro pose Slot Attention-based Feature Filtering for Few-Shot Learning (SAFF) that leverages slot attention mechanisms to discriminate and filter weak features, thereby improving few-shot classification performance. The key innovation of SAFF lies in its integration of slot attention with patch em beddings, unifying class-aware slots into a single attention mechanism to filter irrelevant features effectively. We intro duce a similarity matrix that computes across support and query images to quantify the relevance of filtered embed dings for classification. Through experiments, we demon strate that Slot Attention performs better than other atten tion mechanisms, capturing discriminative features while reducing irrelevant information. We validate our approach through extensive experiments on few-shot learning bench marks: CIFAR-FS, FC100, miniImageNet and tieredIma geNet, outperforming several state-of-the-art methods.

[89] NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation

Devvrat Joshi,Islem Rekik

Main category: cs.CV

TL;DR: NEURAL提出了一种基于语义引导的多模态医学影像压缩框架,利用视觉-语言模型的交叉注意力分数剪枝X光图像,保留诊断关键区域,并将其转化为图表示,同时融合临床报告生成统一数据结构。该方法在肺炎检测任务中显著减少数据大小并保持高诊断性能。

Details Motivation: 多模态医学影像数据的快速增长在资源受限的临床环境中带来了存储和传输挑战。

Contribution: 1. 提出了一种语义引导的数据压缩框架NEURAL;2. 通过视觉-语言模型的交叉注意力分数实现图像剪枝,生成统一的图表示;3. 在保持诊断性能的同时显著减小数据大小。

Method: 利用预训练的视觉-语言模型的交叉注意力分数剪枝X光图像,生成图表示,并融合临床报告知识图谱,形成统一数据结构。

Result: 在MIMIC-CXR和CheXpert Plus数据集上,NEURAL实现93.4-97.7%的数据压缩,AUC为0.88-0.95,优于基线模型。

Insight: NEURAL通过语义引导的压缩方法解决了数据大小与临床效用的权衡问题,为资源受限环境提供高效解决方案。

Abstract: The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at https://github.com/basiralab/NEURAL.

[90] Multimodal Sheaf-based Network for Glioblastoma Molecular Subtype Prediction

Shekhnaz Idrissova,Islem Rekik

Main category: cs.CV

TL;DR: 该论文提出了一种基于Sheaf的多模态网络框架,用于胶质母细胞瘤分子亚型预测,通过结合MRI和组织病理学图像,解决了现有方法在多模态数据融合和缺失数据处理上的不足。

Details Motivation: 胶质母细胞瘤分子亚型分类对靶向治疗至关重要,但现有方法依赖侵入性组织提取且在多模态数据融合和结构信息保留上存在局限。

Contribution: 提出了一种新颖的基于Sheaf的框架,实现了MRI和组织病理学数据的结构感知和一致性融合,提升了分类性能并增强了在缺失数据场景下的鲁棒性。

Method: 采用Sheaf理论构建模型,通过基于结构的融合机制保留多模态数据的共享结构信息,并解决了异构图中的特征保留问题。

Result: 模型在基准方法中表现优异,尤其在数据不完整或缺失情况下展现出更强的鲁棒性。

Insight: Sheaf理论为多模态数据提供了一种有效的结构建模方式,有助于开发非侵入性诊断工具。

Abstract: Glioblastoma is a highly invasive brain tumor with rapid progression rates. Recent studies have shown that glioblastoma molecular subtype classification serves as a significant biomarker for effective targeted therapy selection. However, this classification currently requires invasive tissue extraction for comprehensive histopathological analysis. Existing multimodal approaches combining MRI and histopathology images are limited and lack robust mechanisms for preserving shared structural information across modalities. In particular, graph-based models often fail to retain discriminative features within heterogeneous graphs, and structural reconstruction mechanisms for handling missing or incomplete modality data are largely underexplored. To address these limitations, we propose a novel sheaf-based framework for structure-aware and consistent fusion of MRI and histopathology data. Our model outperforms baseline methods and demonstrates robustness in incomplete or missing data scenarios, contributing to the development of virtual biopsy tools for rapid diagnostics. Our source code is available at https://github.com/basiralab/MMSN/.

[91] Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System

Romeo Valentin,Sydney M. Katz,Artur B. Carneiro,Don Walker,Mykel J. Kochenderfer

Main category: cs.CV

TL;DR: 论文提出了一种基于计算机视觉的实时飞机着陆系统,通过概率关键点回归和不确定性校准提升系统安全性和鲁棒性。

Details Motivation: 自动驾驶技术在民航领域的应用需要满足高鲁棒性和安全性的要求,但目前的数据驱动计算机视觉系统难以满足这些需求。

Contribution: 1. 提出了一种基于空间Soft Argmax的高效神经架构;2. 设计了校准预测不确定性的损失函数;3. 改进了残差基的RAIM方法,用于运行时错误检测。

Method: 1. 使用空间Soft Argmax实现概率关键点回归;2. 通过锐度和校准指标评估不确定性;3. 基于残差的RAIM方法过滤错误输出。

Result: 在跑道图像数据集上验证了方法的准确性,且不确定性估计达到亚像素精度。

Insight: 预测不确定性的校准和运行时错误检测是提升自动驾驶系统安全性的关键。

Abstract: Recent advances in data-driven computer vision have enabled robust autonomous navigation capabilities for civil aviation, including automated landing and runway detection. However, ensuring that these systems meet the robustness and safety requirements for aviation applications remains a major challenge. In this work, we present a practical vision-based pipeline for aircraft pose estimation from runway images that represents a step toward the ability to certify these systems for use in safety-critical aviation applications. Our approach features three key innovations: (i) an efficient, flexible neural architecture based on a spatial Soft Argmax operator for probabilistic keypoint regression, supporting diverse vision backbones with real-time inference; (ii) a principled loss function producing calibrated predictive uncertainties, which are evaluated via sharpness and calibration metrics; and (iii) an adaptation of Residual-based Receiver Autonomous Integrity Monitoring (RAIM), enabling runtime detection and rejection of faulty model outputs. We implement and evaluate our pose estimation pipeline on a dataset of runway images. We show that our model outperforms baseline architectures in terms of accuracy while also producing well-calibrated uncertainty estimates with sub-pixel precision that can be used downstream for fault detection.

[92] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long,Yichen He,Wentao Ye,Yiyuan Pan,Yuan Lin,Hang Li,Junbo Zhao,Wei Li

Main category: cs.CV

TL;DR: M3-Agent是一个具有长期记忆的多模态代理框架,能够处理视觉和听觉输入,并通过强化学习实现高效的任务执行。

Details Motivation: 现有的多模态代理在处理实时输入和长期记忆方面存在不足,M3-Agent旨在模拟人类的记忆机制,实现对环境的深度理解。

Contribution: 提出了M3-Agent框架,引入了实体中心的多模态长期记忆组织方法,并开发了新的评估基准M3-Bench。

Method: 通过强化学习训练M3-Agent,利用实体中心的多模态记忆格式来存储和检索信息。

Result: M3-Agent在M3-Bench和VideoMME-long基准上显著优于最强基线模型(Gemini-1.5-pro和GPT-4o)。

Insight: 多模态记忆的组织方式对代理的任务执行能力至关重要,强化学习在复杂任务中表现出色。

Abstract: We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

[93] MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Dianyi Wang,Siyuan Wang,Zejun Li,Yikun Wang,Yitong Li,Duyu Tang,Xiaoyu Shen,Xuanjing Huang,Zhongyu Wei

Main category: cs.CV

TL;DR: 论文提出了一种名为MoIIE的稀疏混合专家架构,用于大规模视觉语言模型(LVLMs),通过联合学习模态内特征和跨模态交互,提高了参数效率和性能。

Details Motivation: 现有密集LVLMs计算成本高,而传统MoE架构在多模态任务中难以同时建模模态内和跨模态特征。

Contribution: 提出了MoIIE架构,结合了模态内和跨模态专家,并设计了两阶段训练策略,显著提升了模型的效率和性能。

Method: 采用模态引导的路由机制,将输入分发给模态内专家和共享跨模态专家,结合两阶段训练激活多模态能力。

Result: 实验显示,MoIIE模型在5.5B和11.3B激活参数下,性能优于或匹配更大激活参数的开源模型。

Insight: MoIIE通过稀疏专家设计,有效平衡了模态内和跨模态学习的效率,为多模态模型的高效训练提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

[94] DSS-Prompt: Dynamic-Static Synergistic Prompting for Few-Shot Class-Incremental Learning

Linpu He,Yanan Li,Bingze Li,Elvis Han Cui,Donghui Wang

Main category: cs.CV

TL;DR: DSS-Prompt是針對少样本類别增量學習(FSCIL)任務提出的方法,通過靜態和動態提示協同增強預訓練Vision Transformer的適應性,無需針對增量任務進一步訓練即可超越現有方法。

Details Motivation: 預訓練模型在下游任務中表現出色,但在少样本類别增量學習(FSCIL)中應用不足。如何通過最小修改預訓練模型來持續學習新類别,同時避免災難性遺忘,是一個重要挑戰。

Contribution: 提出了DSS-Prompt,一種協同使用靜態和動態提示的方法。靜態提示橋接預訓練與下游任務的領域差距,而動態提示捕獲實例感知語義,實現類别間的靈活遷移。

Method: 在每個Transformer塊中結合靜態和動態提示:靜態提示適應領域差距;動態提示利用多模態模型提取輸入相關語義,生成多樣化提示,並自適應調整其重要性。最終基於提示的嵌入,使用簡單的原型分類器。

Result: 在四個基準數據集上驗證了DSS-Prompt的有效性,其性能一致超越現有方法,且能緩解災難性遺忘問題。

Insight: 通過協同使用靜態和動態提示,既能增強模型適應性,又能實現類别間的靈活遷移,展示了提示在增量學習中的潛力。

Abstract: Learning from large-scale pre-trained models with strong generalization ability has shown remarkable success in a wide range of downstream tasks recently, but it is still underexplored in the challenging few-shot class-incremental learning (FSCIL) task. It aims to continually learn new concepts from limited training samples without forgetting the old ones at the same time. In this paper, we introduce DSS-Prompt, a simple yet effective approach that transforms the pre-trained Vision Transformer with minimal modifications in the way of prompts into a strong FSCIL classifier. Concretely, we synergistically utilize two complementary types of prompts in each Transformer block: static prompts to bridge the domain gap between the pre-training and downstream datasets, thus enabling better adaption; and dynamic prompts to capture instance-aware semantics, thus enabling easy transfer from base to novel classes. Specially, to generate dynamic prompts, we leverage a pre-trained multi-modal model to extract input-related diverse semantics, thereby generating complementary input-aware prompts, and then adaptively adjust their importance across different layers. In this way, on top of the prompted visual embeddings, a simple prototype classifier can beat state-of-the-arts without further training on the incremental tasks. We conduct extensive experiments on four benchmarks to validate the effectiveness of our DSS-Prompt and show that it consistently achieves better performance than existing approaches on all datasets and can alleviate the catastrophic forgetting issue as well.

[95] MeMoSORT: Memory-Assisted Filtering and Motion-Adaptive Association Metric for Multi-Person Tracking

Yingjie Wang,Zhixing Wang,Le Zheng,Tianxiao Liu,Roujing Li,Xueyao Hu

Main category: cs.CV

TL;DR: MeMoSORT提出了一种基于记忆辅助卡尔曼滤波和动态自适应IoU的多目标跟踪方法,显著提升了跟踪性能。

Details Motivation: 传统多目标跟踪方法依赖卡尔曼滤波和刚性IoU关联,无法适应复杂运动和遮挡场景,导致跟踪误差和身份切换。

Contribution: 1. 提出记忆辅助卡尔曼滤波(MeKF),通过记忆增强网络补偿运动模型失配问题;2. 设计动态自适应IoU(Mo-IoU),扩展匹配空间并引入高度相似性,减少检测误差和关联失败。

Method: MeMoSORT结合MeKF和Mo-IoU,前者增强运动模型的适应性,后者优化目标关联,从而提升跟踪鲁棒性。

Result: 在DanceTrack和SportsMOT数据集上分别达到67.9%和82.1%的HOTA分数,性能领先。

Insight: 通过引入记忆机制和动态关联度量,能够有效解决复杂场景下的多目标跟踪问题,且方法轻量、实时性好。

Abstract: Multi-object tracking (MOT) in human-dominant scenarios, which involves continuously tracking multiple people within video sequences, remains a significant challenge in computer vision due to targets’ complex motion and severe occlusions. Conventional tracking-by-detection methods are fundamentally limited by their reliance on Kalman filter (KF) and rigid Intersection over Union (IoU)-based association. The motion model in KF often mismatches real-world object dynamics, causing filtering errors, while rigid association struggles under occlusions, leading to identity switches or target loss. To address these issues, we propose MeMoSORT, a simple, online, and real-time MOT tracker with two key innovations. First, the Memory-assisted Kalman filter (MeKF) uses memory-augmented neural networks to compensate for mismatches between assumed and actual object motion. Second, the Motion-adaptive IoU (Mo-IoU) adaptively expands the matching space and incorporates height similarity to reduce the influence of detection errors and association failures, while remaining lightweight. Experiments on DanceTrack and SportsMOT show that MeMoSORT achieves state-of-the-art performance, with HOTA scores of 67.9% and 82.1%, respectively.

[96] MUJICA: Reforming SISR Models for PBR Material Super-Resolution via Cross-Map Attention

Xin Du,Maoyuan Xu,Zhi Ying

Main category: cs.CV

TL;DR: MUJICA是一种基于跨图注意力机制的适配器,用于改进预训练的Swin-transformer SISR模型,提升PBR材质超分辨率的性能。

Details Motivation: 现有SISR方法在多图PBR材质超分辨率任务中存在跨图不一致、模态特征建模不足和数据分布偏移导致的泛化能力差等问题。

Contribution: 提出MUJICA适配器,通过跨图注意力机制融合特征,提升PBR材质超分辨率的性能,同时保持预训练SISR模型的重构能力。

Method: 将MUJICA无缝集成到预训练的Swin-transformer SISR模型(如SwinIR、DRCT、HMANet)中,利用跨图注意力机制融合多模态特征。

Result: 实验表明MUJICA提升了PSNR、SSIM和LPIPS分数,保持了跨图一致性,并在有限资源下高效训练,性能达到最先进水平。

Insight: 跨图注意力机制可以有效融合PBR材质的多模态特征,提升超分辨率的整体性能和一致性。

Abstract: Physically Based Rendering (PBR) materials are typically characterized by multiple 2D texture maps such as basecolor, normal, metallic, and roughness which encode spatially-varying bi-directional reflectance distribution function (SVBRDF) parameters to model surface reflectance properties and microfacet interactions. Upscaling SVBRDF material is valuable for modern 3D graphics applications. However, existing Single Image Super-Resolution (SISR) methods struggle with cross-map inconsistency, inadequate modeling of modality-specific features, and limited generalization due to data distribution shifts. In this work, we propose Multi-modal Upscaling Joint Inference via Cross-map Attention (MUJICA), a flexible adapter that reforms pre-trained Swin-transformer-based SISR models for PBR material super-resolution. MUJICA is seamlessly attached after the pre-trained and frozen SISR backbone. It leverages cross-map attention to fuse features while preserving remarkable reconstruction ability of the pre-trained SISR model. Applied to SISR models such as SwinIR, DRCT, and HMANet, MUJICA improves PSNR, SSIM, and LPIPS scores while preserving cross-map consistency. Experiments demonstrate that MUJICA enables efficient training even with limited resources and delivers state-of-the-art performance on PBR material datasets.

[97] TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

Jinxi Li,Ziyang Song,Bo Yang

Main category: cs.CV

TL;DR: TRACE提出了一种从多视角视频中学习3D高斯物理动力学的新框架,通过将3D点建模为刚性粒子并学习其平移旋转动力学系统,显式估计物理参数,无需人工标注即可建模复杂运动物理。

Details Motivation: 现有方法通过物理约束或简单物理模型难以学习复杂运动物理,且通常需要额外标注(如物体类型或掩码)。本文旨在仅从动态多视角视频中建模3D场景几何、外观和物理信息。

Contribution: 提出TRACE框架,将3D点建模为刚性粒子,直接学习其平移旋转动力学系统,显式估计物理参数。

Method: 通过将每个3D点视为具有空间尺寸和方向的刚性粒子,学习其动力学系统,并估计物理参数以预测运动。

Result: 在三个现有数据集和一个新合成数据集上,TRACE在帧外推任务中显著优于基线方法。

Insight: 通过聚类学习到的物理参数,可轻松分割多个物体或部件。

Abstract: In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle’s motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.

[98] Poaching Hotspot Identification Using Satellite Imagery

Aryan Pandhi,Shrey Baid,Sanjali Jha

Main category: cs.CV

TL;DR: 论文提出利用卫星图像结合计算机视觉模型自动识别非洲象盗猎热点区域,以动态追踪盗猎活动。

Details Motivation: 非洲象盗猎是一个长期且动态的问题,传统的反盗猎努力集中在城镇附近,而多数盗猎发生在偏远地区。通过计算机视觉模型分析卫星图像,可以高效覆盖大范围区域,避免人工追踪的限制和干扰生态。

Contribution: 提出了一种基于卫星图像和计算机视觉的盗猎热点识别系统,能够动态追踪盗猎活动的地理指标,优化资源部署。

Method: 利用计算机视觉模型分析卫星图像,识别盗猎活动的有利地理因素(如水源、巡逻盲区等),生成动态热点地图。

Result: 系统能够覆盖大面积区域,减少人工干预,并适应盗猎区域的动态变化。

Insight: 计算机视觉与卫星图像的结合为野生动物保护提供了新的工具,尤其适用于动态性强且范围广的问题。

Abstract: Elephant Poaching in African countries has been a decade-old problem. So much so that African Forest Elephants are now listed as an endangered species, and African Savannah Elephants as critically endangered by the IUCN (International Union for Conservation of Nature). [1] Elephants are hunted primarily for their ivory tusks which caused many elephants to be born tuskless as a genetic modification for survival. [2] Data gathered by recent studies shows that though poaching methods remain the same, the poaching grounds are rather dynamic. Poachers have shifted to areas with less ranger patrols and several other factors like watering holes, seasons, altitude etc. cause constant shifts in poaching hotspot locations. [3] After a period of low poaching from 2000-2014, poaching numbers in African countries are now on the rise again – WWF (World Wildlife Foundation) says there are 20,000 elephants poached annually [4]. In African countries, anti-poaching efforts are concentrated near towns, while a majority of poaching occurs in the deserted regions. All of these factors result in the need for a Computer Vision Model to identify poaching hotspots through locating the geographic indicators of favorable poaching regions. A CV model eliminates the need to manually track poachers and account for the environmental factors to deploy resources and its combination with satellite imagery allows us to survey large areas without disturbing local species or cross border aviation restrictions.

[99] Evolution of Low-Level and Texture Human-CLIP Alignment

Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra

Main category: cs.CV

TL;DR: 论文研究了多模态模型(如CLIP)训练中低层次人类图像质量评估相关性先升后降的现象,揭示了形状-纹理偏差对齐和噪声下分类准确率下降的关键因素。

Details Motivation: 研究者在训练中发现CLIP模型早期与低层次人类感知对齐性最好,随后逐渐下降,这一现象背后的原因及其对模型优化的启示值得探索。

Contribution: 揭示了CLIP训练中低层次人类感知对齐性变化的机制,提出了形状-纹理偏差和噪声敏感性的关系,为视觉-语言模型的感知对齐与鲁棒性权衡提供了新见解。

Method: 通过分析形状-纹理偏差对齐和噪声下的分类准确率变化,探究CLIP训练过程中低层次感知对齐性演变的机制。

Result: CLIP早期学习低层次视觉特征,增强与人类感知对齐但增加噪声敏感性;后期转向抽象形状表示,提升鲁棒性但降低低层次对齐性。

Insight: CLIP的学习机制在感知对齐与鲁棒性之间存在权衡,优化模型需平衡两者。这一发现为多模态模型的训练策略提供了重要参考。

Abstract: During the training of multi-modal models like CLIP, we observed an intriguing phenomenon: the correlation with low-level human image quality assessments peaks in the early epochs before gradually declining. This study investigates this observation and seeks to understand its causes through two key factors: shape-texture bias alignment and classification accuracy drop under noise. Our findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias. As training progresses, the model shifts toward more abstract shape-based representations, improving noise robustness but reducing alignment with low-level human perception. These results suggest that these factors shared an underlying learning mechanism and provide new insights into optimizing the trade-off between perceptual alignment and robustness in vision-language models.

[100] ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta,Md Yeasin Rahat,Nafiz Fahad,Abir Ahmed,Liew Tze Hui

Main category: cs.CV

TL;DR: ViMoNet是一种结合了视频和运动数据的多模态视觉-语言框架,用于全面理解人类行为,并通过联合训练策略和新的VIMOS数据集提升了性能。

Details Motivation: 已有模型仅关注单一数据类型(运动或视频),无法完整捕捉人类行为的细微差别,因此需要结合两者的优势以实现更全面的理解。

Contribution: 提出了ViMoNet框架,结合视频和运动数据的多模态训练策略,并建立了VIMOS数据集和ViMoNet-Bench基准。

Method: 采用联合训练策略,结合详细的运动文本数据和通用的视频文本数据,以利用两者的互补优势。

Result: ViMoNet在行为理解、运动理解和字幕生成任务上优于现有方法。

Insight: 多模态数据的联合学习能有效提升模型对人类行为的时空信息捕捉能力。

Abstract: This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model’s acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation.

[101] Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Zijian Song,Sihan Qin,Tianshui Chen,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: 提出了一种物理自回归模型(PAR),通过结合视频帧和动作的物理标记来联合建模机器人与环境的动态演化,无需动作预训练即可实现准确的视频预测和一致的动作轨迹。

Details Motivation: 机器人操作数据的稀缺性促使研究者利用其他模态预训练的大模型,但现有方法通常需要动作预训练。本研究旨在通过自回归视频生成模型嵌入的世界知识,实现对物理动态的理解,而无需额外的动作预训练。

Contribution: 1. 提出了物理自回归模型(PAR),将视频帧与动作联合建模为物理标记;2. 采用基于DiT的去标记器,减少量化误差并增强帧与动作的交互;3. 结合因果掩码、逆运动学和KV缓存机制提升性能与效率。

Method: PAR结合视频帧与动作为物理标记,利用自回归模型预测未来的帧和动作轨迹。采用DiT去标记器建模连续标记,并引入因果掩码、逆运动学和KV缓存技术优化训练与推理速度。

Result: 在ManiSkill基准测试中,PAR在PushCube任务上实现了100%的成功率,性能与动作预训练的基线相当,且能准确预测视频帧与动作轨迹。

Insight: 通过自回归视频预训练中的世界知识迁移,PAR展示了无需动作预训练即可有效建模物理动态的潜力,为机器人操作任务提供了新思路。

Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.

[102] KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging

Valentin Boussot,Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: KonfAI是一个模块化、可配置的深度学习框架,专为医学影像任务设计,通过YAML文件配置训练、推理和评估流程,支持高级策略如分块学习、测试时增强和多模型训练。

Details Motivation: 医学影像领域缺乏一种能够支持复杂任务(如分割、配准和图像合成)的模块化深度学习框架,同时需要高可配置性和可扩展性以提升实验的透明度和效率。

Contribution: KonfAI提供了一个模块化、完全可配置的框架,支持高级策略(如分块学习、测试时增强)和复杂的多模型训练,同时保持代码透明性和可扩展性。

Method: 使用YAML配置文件定义工作流,结合模块化设计支持自定义模型、损失函数和数据组件,实现了灵活的深度学习流程。

Result: KonfAI已在多个国际医学影像竞赛中取得优异成绩,并成功应用于分割、配准和图像合成任务。

Insight: 通过配置驱动的框架设计,KonfAI为医学影像研究提供了高效、可复现的工具,尤其适合复杂任务和多模型训练场景。

Abstract: KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \href{https://github.com/vboussot/KonfAI}{https://github.com/vboussot/KonfAI}.

[103] RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

Shenxing Wei,Jinxi Li,Yafei Yang,Siyuan Zhou,Bo Yang

Main category: cs.CV

TL;DR: RayletDF是一种通用的3D表面重建方法,通过射线距离场直接预测表面点,具有高效性和泛化能力。

Details Motivation: 现有的基于坐标的方法在渲染显式表面时计算量较大,因此需要一种更高效且通用的3D表面重建方法。

Contribution: 提出了RayletDF,通过射线距离场直接预测表面点,实现了高效且泛化的3D表面重建。

Method: 方法包括三个关键模块:射线特征提取器、射线距离场预测器和多射线混合器,用于提取局部几何特征、预测距离并聚合多射线预测结果。

Result: 在多个公开数据集上表现优异,尤其是对未见数据集的单次前向推理泛化能力突出。

Insight: 射线距离场的概念为3D表面重建提供了一种新的高效且泛化的解决方案,适用于不同输入源(点云或3D高斯)。

Abstract: In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.

[104] Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance

Dhruvraj Singh Rawat,Enggen Sherpa,Rishikesan Kirupanantha,Tin Hoang

Main category: cs.CV

TL;DR: 本文研究了在CelebAMask-HQ小规模数据集上扩散模型的人脸生成,评估了无条件与条件生成管道,比较了UNet和DiT架构,并提出结合InfoNCE损失的属性嵌入和SegFormer分割编码器以提升语义对齐和可控性。

Details Motivation: 旨在提升小规模数据集下人脸生成的语义对齐和可控性,解决现有方法在属性引导生成中的不足。

Contribution: 提出结合InfoNCE损失优化属性嵌入与SegFormer分割编码器,增强扩散模型的语义对齐和可控性。

Method: 对比UNet和DiT架构的无条件生成,并在Stable Diffusion模型上用LoRA微调;引入InfoNCE损失和SegFormer编码器优化多条件生成。

Result: 实验表明,对比性嵌入学习和先进分割编码在小数据场景下显著提升属性引导生成的效果。

Insight: 对比学习能有效优化属性嵌入,SegFormer的高效分割特征提取为小数据可控生成提供了新思路。

Abstract: We present a benchmark of diffusion models for human face generation on a small-scale CelebAMask-HQ dataset, evaluating both unconditional and conditional pipelines. Our study compares UNet and DiT architectures for unconditional generation and explores LoRA-based fine-tuning of pretrained Stable Diffusion models as a separate experiment. Building on the multi-conditioning approach of Giambi and Lisanti, which uses both attribute vectors and segmentation masks, our main contribution is the integration of an InfoNCE loss for attribute embedding and the adoption of a SegFormer-based segmentation encoder. These enhancements improve the semantic alignment and controllability of attribute-guided synthesis. Our results highlight the effectiveness of contrastive embedding learning and advanced segmentation encoding for controlled face generation in limited data settings.

[105] Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Valero Laparra,Jesus Malo

Main category: cs.CV

TL;DR: 这篇论文研究了Vision Transformers(ViTs)在感知对齐方面的表现,发现模型规模增大、数据增强和正则化会降低其对人类感知的对齐性,而数据多样性的影响较小。

Details Motivation: 尽管ViTs在图像识别任务中表现优异,但其与人类感知的对齐性尚未得到充分研究。论文旨在探索ViTs的感知对齐特性,为需要类人视觉理解的应用提供指导。

Contribution: 系统地分析了模型规模、数据集规模、数据增强和正则化对ViT感知对齐的影响,发现模型复杂性和训练策略与对齐性之间存在权衡。

Method: 在TID2013数据集上评估ViTs的感知对齐性,考察模型规模、数据集多样性、训练重复次数、数据增强和正则化的影响。

Result: 更大的模型表现出更低的对齐性;数据多样性影响极小,但重复训练会降低对齐性;更强的数据增强和正则化进一步降低对齐性。

Insight: 模型复杂性和训练策略的选择会影响ViTs与人类感知的对齐性,这对需要类人视觉理解的应用具有重要启示。

Abstract: Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.

[106] OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

Yupeng Zhou,Zhen Li,Ziheng Ouyang,Yuming Chen,Ruoyi Du,Daquan Zhou,Bin Fu,Yihao Liu,Peng Gao,Ming-Ming Cheng,Qibin Hou

Main category: cs.CV

TL;DR: OneVAE提出了一种联合离散和连续优化的方法,通过利用连续VAE的先验改善离散视频VAE的训练,显著提升重建质量和训练效率。

Details Motivation: 离散视频VAE在训练中存在不稳定、训练时间长和重建质量差的问题,而连续VAE训练更稳定且性能更优。因此,希望通过结合连续VAE的优势来改进离散视频VAE。

Contribution: 1. 提出了一种联合离散和连续优化的方案(OneVAE),首次在单一网络中实现离散和连续表示的性能竞争。2. 引入多令牌量化机制,显著提升重建质量。3. 通过增强第一帧重建,改善了高压缩视频VAE的重建能力。

Method: 1. 利用FSQ(固定尺寸量化)方法保留预训练连续VAE的先验。2. 提出多令牌量化机制和第一帧重建强化策略。3. 联合优化离散和连续表示的损失。

Result: OneVAE在收敛速度上比从头训练快数倍,PSNR提升了近1 dB,在4x16x16的高压缩离散VAE上显著改善了性能。

Insight: 离散和连续表示之间存在内在联系,联合优化可以充分利用两者的优势,为多模态任务提供高效统一的视频编码方案。

Abstract: Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.

[107] HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics

Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: HumanGenesis提出了一种基于代理的框架,结合几何与生成建模,解决合成人类动态中的几何不一致与运动泛化限制问题,通过四个代理协作实现高质量的3D重建与视频合成。

Details Motivation: 现有合成人类动态方法存在几何不一致和运动泛化能力弱的问题,导致重建粗糙且场景不协调。HumanGenesis致力于结合几何与生成建模,提升重建质量和动态表现力。

Contribution: 提出HumanGenesis框架,通过四个代理(Reconstructor、Critique Agent、Pose Guider、Video Harmonizer)协作,实现3D一致的几何重建、运动泛化与高质量视频合成。

Method: 结合3D Gaussian Splatting、变形分解、多轮MLLM反思、时间感知参数编码器与混合渲染管道,通过Back-to-4D反馈循环优化重建与生成过程。

Result: 在文本引导合成、视频重演和新姿态泛化任务中表现优异,显著提升了表达力、几何保真度和场景协调性。

Insight: 几何与生成建模的结合可通过代理协作实现高质量的动态合成,反馈循环和多代理交互是提升重建与生成能力的关键。

Abstract: \textbf{Synthetic human dynamics} aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) \emph{geometric inconsistency} and \emph{coarse reconstruction}, due to limited 3D modeling and detail preservation; and (2) \emph{motion generalization limitations} and \emph{scene inharmonization}, stemming from weak generative capabilities. To address these, we present \textbf{HumanGenesis}, a framework that integrates geometric and generative modeling through four collaborative agents: (1) \textbf{Reconstructor} builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) \textbf{Critique Agent} enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) \textbf{Pose Guider} enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) \textbf{Video Harmonizer} synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.

[108] SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Yachao Liang,Min Yu,Gang Li,Jianguo Jiang,Boquan Li,Feng Yu,Ning Zhang,Xiang Meng,Weiqing Huang

Main category: cs.CV

TL;DR: 论文提出了一种基于视听语音表征学习的面部伪造视频检测方法,利用音频信号与面部运动的协同作用,通过自监督掩码预测任务学习表征,并在未见数据集上表现出优异的泛化能力和鲁棒性。

Details Motivation: 面部伪造视频检测在数字取证中具有挑战性,尤其是在跨数据集泛化和抗干扰性方面。作者发现富含语音内容的音频信号能有效反映面部运动,从而提出了利用视听协同作用的新方法。

Contribution: 1. 提出了一种新颖的自监督学习框架,通过视听掩码预测任务学习语音表征;2. 在不使用任何伪造视频训练的情况下,实现了跨数据集的高检测性能和鲁棒性。

Method: 1. 在真实视频上通过自监督掩码预测任务学习局部和全局语义的视听语音表征;2. 将学习到的模型直接迁移到伪造检测任务中。

Result: 实验表明,该方法在跨数据集泛化和鲁棒性上优于现有方法,且无需伪造视频参与训练。

Insight: 视听语音表征的学习可以捕捉面部运动的精细信息,为面部伪造检测提供了一种无需伪造数据的有效方案。

Abstract: Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at https://github.com/Eleven4AI/SpeechForensics.

[109] Quo Vadis Handwritten Text Generation for Handwritten Text Recognition?

Vittorio Pippi,Konstantina Nikolaidou,Silvia Cascianelli,George Retsinas,Giorgos Sfikas,Rita Cucchiara,Marcus Liwicki

Main category: cs.CV

TL;DR: 论文系统评估了三种手写文本生成(HTG)模型对低资源手写文本识别(HTR)任务的影响,并提供了选择最有效模型的量化指南。

Details Motivation: 解决历史手稿数字化中对小规模、作者特定手写文本识别的挑战,尤其是当训练数据分布与目标数据分布不一致时。

Contribution: 首次系统比较了三种HTG模型(生成对抗、扩散和自回归范式)在提升HTR性能上的效果,并分析了合成数据的视觉和语言特征对微调结果的影响。

Method: 评估了三种HTG模型(GAN、扩散模型和自回归模型)生成的合成数据在低资源HTR任务中的效果,并提出了量化选择标准。

Result: 结果表明,HTG模型对低资源HTR任务的性能提升有显著影响,但效果因模型类型和合成数据特性而异。

Insight: 视觉和语言特征的匹配性对合成数据在HTR中的微调效果至关重要,未来需进一步优化生成模型以更好地模拟真实手写文本。

Abstract: The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.

[110] Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Trevine Oorloff,Vishwanath Sindagi,Wele Gedara Chaminda Bandara,Ali Shafahi,Amin Ghiasi,Charan Prakash,Reza Ardekani

Main category: cs.CV

TL;DR: 论文展示了现成的Stable Diffusion模型可以通过简单的注意力重计算机制,直接用于视觉上下文学习(V-ICL),无需额外训练,并在多个视觉任务上取得了优于现有方法的表现。

Details Motivation: 自然语言处理中的大语言模型(LLM)展示了上下文学习(ICL)的潜力,但视觉领域的类似方法通常需要专门训练或额外数据,限制了其泛化能力。本文旨在探索现成的Stable Diffusion模型是否可以通过简单修改实现视觉上下文学习。

Contribution: 主要贡献是通过在Stable Diffusion的自注意力层中添加一种即时的注意力重计算机制,无需微调即可实现视觉上下文学习,并在六个不同视觉任务上验证了其有效性。

Method: 提出了在Stable Diffusion架构中通过自注意力层实现上下文学习的即时代替修改方法,显式地将查询与示例提示之间的上下文关系纳入计算。

Result: 在Pascal-5i数据集上,提出的方法在前景分割任务中的mIoU分别比Visual Prompting和IMProv提高了8.9%和3.2%。此外,通过集成多个提示进一步提升了性能。

Insight: 现成的生成模型(如Stable Diffusion)可能隐含具备视觉上下文学习能力,只需简单的注意力机制修改即可解锁这一潜力,为视觉任务提供了一种高效灵活的解决方案。

Abstract: Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) – the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance.

[111] LIA-X: Interpretable Latent Portrait Animator

Yaohui Wang,Di Yang,Xinyuan Chen,Francois Bremond,Yu Qiao,Antitza Dantcheva

Main category: cs.CV

TL;DR: LIA-X是一种新型可解释的肖像动画生成器,通过稀疏运动字典实现面部动态的精细控制,支持线性潜在空间导航,性能优于现有方法。

Details Motivation: 现有肖像动画方法缺乏可解释性和精细控制能力,LIA-X旨在通过稀疏运动字典实现面部动态的分离和可控编辑,解决这一问题。

Contribution: 1. 提出稀疏运动字典,实现面部动态的分离和可解释性;2. 支持“编辑-形变-渲染”策略,提升精细控制能力;3. 训练了10亿参数的大规模模型,展示了方法的扩展性。

Method: LIA-X采用自编码器架构,通过线性导航潜在空间中的运动编码实现动态转移,引入稀疏运动字典将面部动态分解为可解释因子。

Result: 实验表明,LIA-X在自重建和跨重建任务上优于现有方法,支持精细的用户引导编辑和3D感知视频操作。

Insight: 稀疏性和可解释性是提升肖像动画控制能力的关键,线性潜在空间导航方法为复杂动态建模提供了新思路。

Abstract: We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous ‘warp-render’ approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable ‘edit-warp-render’ strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.

[112] January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis

Amir Hosseinian,Ashkan Dehghani Zahedani,Umer Mansoor,Noosheen Hashemi,Mark Woodward

Main category: cs.CV

TL;DR: 论文提出了January Food Benchmark (JFB),一个包含1000张食物图像的公开数据集,并为多模态食物分析提供了标准化的评估框架和基线结果。

Details Motivation: 自动营养分析的AI研究缺乏标准化的评估方法和高质量的基准数据集,阻碍了研究进展。

Contribution: 1. 公开的JFB数据集;2. 完整的评估框架和评分方法;3. 专有模型与通用模型的基线比较。

Method: 构建了包含人类验证标注的数据集,设计了综合评分指标,并对比了通用视觉语言模型与专用模型的性能。

Result: 专用模型在整体评分上达到86.2,比通用模型提升了12.1分。

Insight: 专用模型在特定任务上显著优于通用模型,显示任务特定性对性能提升的重要性。

Abstract: Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis.

[113] MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification

Tianqi Xiang,Yi Li,Qixiang Zhang,Xiaomeng Li

Main category: cs.CV

TL;DR: 该论文提出了一种元优化分类器(MOC),用于解决few-shot全切片图像分类中的数据稀缺问题,通过结合元学习器和多样化的候选分类器库,显著提升了分类性能。

Details Motivation: 现有的few-shot学习方法虽然提高了诊断准确性,但仍依赖传统分类器设计,对数据稀缺敏感。为此,作者提出了一种更鲁棒的解决方案以适应临床数据稀缺的场景。

Contribution: 论文的主要贡献是提出了MOC框架,通过元学习器优化分类器配置和利用多样化的候选分类器库,显著提升了few-shot WSI分类的性能。

Method: 方法包括两个核心组件:1)元学习器自动优化分类器配置;2)分类器库存储多样候选分类器以实现全面的病理学解释。

Result: 在多个few-shot基准测试中,MOC表现优于现有方法,在TCGA-NSCLC基准上AUC提高了10.4%,1-shot条件下提升高达26.25%。

Insight: MOC框架为数据稀缺的临床部署提供了重要进展,表明结合元学习和多样化分类器可以有效提升few-shot学习性能。

Abstract: Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at https://github.com/xmed-lab/MOC.

[114] PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Geonhee Sim,Gyeongsik Moon

Main category: cs.CV

TL;DR: PERSONA提出了一种结合3D基与扩散基方法的新框架,通过单张图像生成具有姿态驱动变形的个性化3D人体化身。该方法利用扩散模型生成姿态丰富的视频,并通过平衡采样与几何加权优化确保高质量渲染。

Details Motivation: 现有3D基方法需要大量姿态丰富的视频来建模姿态驱动变形,而扩散基方法在身份保持和姿态依赖的身份解耦上表现不佳。PERSONA旨在结合两者优势,解决这些局限性。

Contribution: PERSONA的主要贡献包括:1)结合3D基与扩散基方法,从单张图像生成个性化3D化身;2)提出平衡采样和几何加权优化技术,提升渲染质量和身份一致性。

Method: 方法分为两步:1)利用扩散模型从输入图像生成姿态丰富的视频;2)基于生成视频优化3D化身。通过平衡采样减少身份偏移,几何加权优化则优先几何约束以保持渲染质量。

Result: PERSONA在单张输入图像下成功生成具有姿态驱动变形的3D化身,保持了高真实感和渲染清晰度。

Insight: 结合生成模型与3D优化方法可有效解决数据需求与身份保持的两难问题,为个性化3D建模提供了新思路。

Abstract: Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.

[115] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

Shuting He,Peilin Ji,Yitong Yang,Changshuo Wang,Jiayi Ji,Yinglin Wang,Henghui Ding

Main category: cs.CV

TL;DR: 这篇论文是关于3D高斯泼溅(3DGS)在分割、编辑和生成等应用中的综述,分析了3DGS如何作为NeRF的高效替代方案,并探讨了其在多种下游任务中的应用。

Details Motivation: 3DGS作为一种新兴的3D场景表示方法,具有显式和紧凑的特性,能够实现高保真度渲染和实时性能。本文旨在总结3DGS在多种应用中的最新进展,推动相关研究的发展。

Contribution: 1. 提供了3DGS应用的综合综述。2. 介绍了支持3DGS语义理解和控制的2D基础模型。3. 将3DGS应用分类为分割、编辑、生成等功能任务,并总结了代表性方法和监督策略。

Method: 论文首先介绍了2D基础模型和NeRF方法对3DGS的启发,随后系统分类和总结了3DGS在不同任务中的应用,包括方法、监督策略和学习范式。

Result: 整理了常用的数据集和评估协议,并在公开基准上对最新方法进行了比较分析,同时维护了一个持续更新的资源库。

Insight: 3DGS的显式特性和高效性使其在多种3D任务中表现出色,未来研究可以进一步探索其在几何和语义理解上的潜力。

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

[116] LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Chengtao Lv,Bilang Zhang,Yang Yong,Ruihao Gong,Yushi Huang,Shiqiao Gu,Jiajun Wu,Yumeng Shi,Jinyang Guo,Wenya Wang

Main category: cs.CV

TL;DR: 论文LLMC+提出了一个全面的视觉-语言模型(VLM)压缩基准工具包,支持20多种算法,并揭示空间与时间冗余需要不同的技术策略,组合压缩方法可高效减少性能损失。

Details Motivation: 现有视觉-语言模型的计算和内存需求过高,且现有压缩方法存在模块不可比、评估任务单一、未探索联合压缩潜力等问题。

Contribution: 1. 提出LLMC+工具包,支持多种压缩算法;2. 系统研究了令牌级和模型级压缩;3. 揭示了空间与时间冗余的策略差异及组合压缩的优势。

Method: 开发了支持20多种算法的插拔式工具包,对五种代表性VLM家族进行令牌级和模型级压缩的系统研究。

Result: 发现空间与时间冗余需不同策略,令牌削减在多轮对话中表现不佳,组合压缩方法能实现极小性能损失的极端压缩。

Insight: 空间与时间冗余需针对性策略,组合压缩方法是高效压缩的未来方向。

Abstract: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

q-bio.NC [Back]

[117] Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions

Baihan Lin

Main category: q-bio.NC

TL;DR: 论文提出了Perceptual Reality Transformer框架,通过六种神经网络架构模拟八种神经感知障碍,提升对非典型人类感知的理解,并在医疗教育等领域具有应用价值。

Details Motivation: 神经感知障碍导致患者与外界体验差异显著,需一种科学方法模拟这些感知状态,以促进理解与共情,并推动医疗教育和技术发展。

Contribution: 1. 提出了首个系统性神经感知模拟基准;2. 基于临床文献设计了条件特定的扰动函数;3. 提供了评估模拟保真度的量化指标。

Method: 采用Transformer架构,学习从自然图像到特定感知状态的映射,优于传统CNN和生成方法。

Result: 在ImageNet和CIFAR-10数据集上验证了Vision Transformer的性能优势。

Insight: 神经网络能够有效模拟人类非典型感知,为医疗和辅助技术提供了新工具。

Abstract: Neurological conditions affecting visual perception create profound experiential divides between affected individuals and their caregivers, families, and medical professionals. We present the Perceptual Reality Transformer, a comprehensive framework employing six distinct neural architectures to simulate eight neurological perception conditions with scientifically-grounded visual transformations. Our system learns mappings from natural images to condition-specific perceptual states, enabling others to experience approximations of simultanagnosia, prosopagnosia, ADHD attention deficits, visual agnosia, depression-related changes, anxiety tunnel vision, and Alzheimer’s memory effects. Through systematic evaluation across ImageNet and CIFAR-10 datasets, we demonstrate that Vision Transformer architectures achieve optimal performance, outperforming traditional CNN and generative approaches. Our work establishes the first systematic benchmark for neurological perception simulation, contributes novel condition-specific perturbation functions grounded in clinical literature, and provides quantitative metrics for evaluating simulation fidelity. The framework has immediate applications in medical education, empathy training, and assistive technology development, while advancing our fundamental understanding of how neural networks can model atypical human perception.

cs.MM [Back]

[118] AI Blob! LLM-Driven Recontextualization of Italian Television Archives

Roberto Balestri

Main category: cs.MM

TL;DR: 本文介绍了AI Blob!系统,利用大语言模型和语义技术对意大利电视档案进行语义分类和重新语境化,通过自动语音识别和检索增强生成技术实现自动化叙事构建。

Details Motivation: 探讨如何利用大语言模型和语义技术改进电视档案的检索和再利用,解决传统静态元数据方法的局限性。

Contribution: 提出了AI Blob!系统,结合自动语音识别、语义嵌入和检索增强生成技术,实现动态内容感知的档案检索和叙事构建。

Method: 系统通过自动语音识别转录音频,分割为句子级单元,嵌入向量数据库进行语义检索,用户输入主题后生成相关查询并重组音像片段为叙事序列。

Result: 生成了1,547个意大利电视视频的数据集,展示了语义技术如何支持自动化叙事构建和文化分析。

Insight: 语义技术和LLM的结合为档案研究提供了动态内容检索的新方法,推动了媒体史和AI驱动研究的交叉创新。

Abstract: This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation.

cs.CR [Back]

[119] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Shuang Liang,Zhihao Xu,Jialing Tao,Hui Xue,Xiting Wang

Main category: cs.CR

TL;DR: 论文提出了一种无监督框架LoD,通过多模态安全概念激活向量和自编码器检测大视觉语言模型中的越狱攻击,实现了高准确性和统一的检测效果。

Details Motivation: 尽管大视觉语言模型(LVLMs)经过大量对齐努力,但仍容易受到越狱攻击,现有检测方法依赖启发式规则,性能欠佳。

Contribution: 提出LoD框架,通过无监督学习方法将越狱攻击检测建模为异常检测问题,引入多模态安全概念激活向量(MSCAV)和安全模式自编码器(SPAE)。

Method: LoD利用MSCAV提取跨模态安全特征,通过自编码器建模安全输入的分布,以重构误差检测异常的越狱输入。

Result: 在三个LVLM和五个基准测试中,LoD的平均AUROC达到0.9951,最小AUROC相比基线提升了38.89%。

Insight: 无监督方法和跨模态安全表征的结合为越狱攻击检测提供了统一且高效的解决方案。

Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.

cs.LG [Back]

[120] MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis

Xingle Xu,Yongkang Liu,Dexian Cai,Shi Feng,Xiaocui Yang,Daling Wang,Yifei Zhang

Main category: cs.LG

TL;DR: MoLAN是一个统一的模态感知噪声动态编辑框架,用于多模态情感分析,通过细粒度的噪声抑制和动态去噪强度分配,有效保留关键信息。

Details Motivation: 多模态情感分析常因无关或误导性的视觉和听觉信息而表现不佳,现有方法通常将整个模态信息视为独立单元进行去噪,可能丢失关键信息。

Contribution: 提出MoLAN框架,通过模态感知分块和动态去噪强度分配,实现细粒度噪声抑制;进一步提出MoLAN+,在多模型和数据集上验证其有效性。

Method: 将每个模态特征分为多个块,根据噪声水平和语义相关性动态分配去噪强度。

Result: 实验表明MoLAN框架广泛有效,MoLAN+在多个数据集上取得SOTA性能。

Insight: 细粒度噪声处理和多模态信息动态保留是提升多模态情感分析性能的关键。

Abstract: Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.

[121] NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Birong Pan,Mayi Xu,Qiankun Pi,Jianhao Chen,Yuanyuan Zhu,Ming Zhong,Tieyun Qian

Main category: cs.LG

TL;DR: NeuronTune 是一种细粒度神经元调制框架,通过动态调整稀疏神经元实现 LLMs 的安全性与实用性平衡优化。

Details Motivation: 现有方法在 LLMs 的安全性与实用性平衡上存在问题,包括安全性不足、频繁拒绝良性查询及生成质量下降。这些问题的根源在于干预的粗粒度层级调节。

Contribution: 提出 NeuronTune,首次通过细粒度神经元调制实现安全性与实用性的同时优化,支持灵活调节干预范围以适应不同场景需求。

Method: 通过归因方法识别安全关键和实用性保留的神经元,利用元学习动态调整神经元激活,支持基于神经元数量阈值的可调干预范围。

Result: 实验表明 NeuronTune 在安全性和实用性上显著优于现有技术。

Insight: 细粒度神经元调制是解决 LLMs 安全性与实用性平衡问题的有效方向,支持灵活的场景适应性。

Abstract: Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance–the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.

[122] Masked Training for Robust Arrhythmia Detection from Digitalized Multiple Layout ECG Images

Shanwei Zhang,Deyun Zhang,Yirao Tao,Kexin Wang,Shijia Geng,Jun Li,Qinghao Zhao,Xingpeng Liu,Yuxi Zhou,Shenda Hong

Main category: cs.LG

TL;DR: 论文提出了PatchECG框架,通过掩码训练策略解决多布局ECG图像中的异步导联时间和部分黑斑缺失问题,提升了心律失常检测的鲁棒性。

Details Motivation: 不同医院使用的ECG布局差异导致数字化信号的异步导联时间和部分缺失,现有模型难以处理这一问题,因此需要一种适应性强的方法。

Contribution: 提出了PatchECG框架,利用掩码训练策略实现自适应可变块缺失表示学习,并自动关注具有协同依赖性的关键导联补丁。

Method: 采用掩码训练策略,学习多布局ECG图像中关键补丁的表示,并在PTB-XL数据集和生成的异步ECG图像上进行实验。

Result: 在PTB-XL数据集和真实ECG数据上,AUROC分别达到0.835和0.778,优于经典方法和ECGFounder模型。

Insight: 掩码训练策略能够有效捕捉多布局ECG中的关键信息,提升模型对异步导联和缺失数据的鲁棒性。

Abstract: Electrocardiogram (ECG) as an important tool for diagnosing cardiovascular diseases such as arrhythmia. Due to the differences in ECG layouts used by different hospitals, the digitized signals exhibit asynchronous lead time and partial blackout loss, which poses a serious challenge to existing models. To address this challenge, the study introduced PatchECG, a framework for adaptive variable block count missing representation learning based on a masking training strategy, which automatically focuses on key patches with collaborative dependencies between leads, thereby achieving key recognition of arrhythmia in ECGs with different layouts. Experiments were conducted on the PTB-XL dataset and 21388 asynchronous ECG images generated using ECG image kit tool, using the 23 Subclasses as labels. The proposed method demonstrated strong robustness under different layouts, with average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.835 and remained stable (unchanged with layout changes). In external validation based on 400 real ECG images data from Chaoyang Hospital, the AUROC for atrial fibrillation diagnosis reached 0.778; On 12 x 1 layout ECGs, AUROC reaches 0.893. This result is superior to various classic interpolation and baseline methods, and compared to the current optimal large-scale pre-training model ECGFounder, it has improved by 0.111 and 0.19.

[123] SVGen: Interpretable Vector Graphics Generation with Large Language Models

Feiyu Wang,Zhiyuan Zhao,Yuandong Liu,Da Zhang,Junyu Gao,Hao Sun,Xuelong Li

Main category: cs.LG

TL;DR: SVGen 是一个基于大语言模型的端到端模型,能够根据自然语言输入生成 SVG 代码,性能优于传统方法和通用大型模型。

Details Motivation: 将创意转化为精确的矢量图形通常耗时且挑战性大,因此需要一种高效且语义准确的生成方法。

Contribution: 提出了 SVG-1M 数据集,包含高质量的 SVG 和自然语言描述对,并设计了 SVGen 模型,结合课程学习和强化学习优化生成效果。

Method: 通过大规模数据集训练端到端模型,结合课程学习和强化学习优化生成过程,确保语义准确性和结构完整性。

Result: 实验表明,SVGen 在生成 SVG 代码的效果和效率上均优于传统方法和通用大型模型。

Insight: SVG-1M 数据集和 SVGen 模型为矢量图形的语义生成提供了新思路,展示了自然语言引导图形生成的潜力。

Abstract: Scalable Vector Graphics (SVG) is widely used in front-end development and UI/UX design due to its scalability, editability, and rendering efficiency. However, turning creative ideas into precise vector graphics remains a time-consuming challenge. To address this, we introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions. Through advanced data augmentation and annotation, we create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance. Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs. Our approach ensures semantic accuracy and structural completeness, supported by curriculum learning and reinforcement learning optimization. Experiments show that SVGen outperforms general large models and traditional rendering methods in both effectiveness and efficiency. Code, model, and dataset are available on GitHub.

[124] Multimodal RAG Enhanced Visual Description

Amit Kumar Jaiswal,Haiming Liu,Ingo Frommholz

Main category: cs.LG

TL;DR: 论文提出一种轻量级的训练无关方法,利用检索增强生成(RAG)跨模态映射,以解决预训练大规模多模态模型(LMMs)的模态鸿沟问题。

Details Motivation: 预训练大规模多模态模型存在模态鸿沟问题,即文本与视觉表示在共享嵌入空间中的不对齐。尽管微调可以缓解这一问题,但其成本高昂且需大量领域数据。因此,需要一种无需训练或低成本的方法。

Contribution: 提出了一种无需训练的方法,利用RAG和线性映射扩展跨模态能力,通过生成合成描述优化映射,显著提升了多模态输入文本描述的效果。

Method: 1. 使用RAG和线性映射实现跨模态对齐;2. 生成合成描述以优化映射;3. 在推理时检索最近文本描述,结合指令生成新描述。

Result: 在两个多模态基准数据集上实验,结果表明该方法显著提升了性能。

Insight: 无需微调即可缓解模态鸿沟问题,为跨模态对齐提供了一种低成本高效的解决方案。

Abstract: Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM enabling retrieval of closest textual descriptions from the training set. These textual descriptions, in conjunction with an instruction, cater as an input prompt for the language model to generate new textual descriptions. In addition, we introduce an iterative technique for distilling the mapping by generating synthetic descriptions via the language model facilitating optimisation for standard utilised image description measures. Experimental results on two benchmark multimodal datasets demonstrate significant improvements.

[125] Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Luca Eyring,Shyamgopal Karthik,Alexey Dosovitskiy,Nataniel Ruiz,Zeynep Akata

Main category: cs.LG

TL;DR: 本文提出了一种名为噪声超网络(Noise Hypernetworks)的方法,用于在扩散模型中替代测试时噪声优化,从而减少计算开销。

Details Motivation: 测试时缩放(test-time scaling)虽然能提升模型性能,但带来了显著的计算开销。本文旨在保留其优点的同时,减少推理时的计算负担。

Contribution: 提出了噪声超网络,通过在训练后阶段整合测试时缩放知识,显著降低了扩散模型在测试时的计算成本。

Method: 用噪声超网络替代奖励引导的测试时噪声优化,提出了一种理论框架,通过学习奖励倾斜分布来优化噪声生成。

Result: 实验表明,该方法能以较低计算成本恢复测试时优化的质量提升。

Insight: 通过噪声空间的优化目标,可以在保证基础模型保真度的同时,有效提升生成质量。

Abstract: The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost. Code is available at https://github.com/ExplainableML/HyperNoise

eess.IV [Back]

[126] Generative Artificial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Translation

Xuanru Zhou,Cheng Li,Shuqiang Wang,Ye Li,Tao Tan,Hairong Zheng,Shanshan Wang

Main category: eess.IV

TL;DR: 本文综述了生成式人工智能在医学影像中的应用,涵盖了生成对抗网络、变分自编码器、扩散模型等多模态基础架构的进展,并提出了一个三层次评估框架,探讨了其临床转化中的挑战与前景。

Details Motivation: 医学影像领域面临数据稀缺、模态标准化和整合等挑战,生成式AI提供了数据合成、图像增强等功能,为推动临床影像工作流程优化提供了新途径。

Contribution: 1. 系统总结了生成式AI(如GANs、VAEs、扩散模型)在医学影像中的最新进展。2. 提出三层次评估框架(像素级、特征级、任务级临床相关性)。3. 探讨了临床转化中的关键障碍(如领域泛化、隐私问题)。

Method: 通过文献综述和系统分析,评估了生成式模型在医学影像工作流(如重建、跨模态合成)中的应用,并提出了评估框架。

Result: 生成式AI在医学影像中展现出潜力,但仍需解决领域泛化、模型幻觉等挑战。未来与大规模基础模型的结合有望推动临床应用。

Insight: 生成式AI与多模态基础模型的融合是未来发展方向,但需要跨学科合作以解决技术、伦理和监管问题。

Abstract: Generative artificial intelligence (AI) is rapidly transforming medical imaging by enabling capabilities such as data synthesis, image enhancement, modality translation, and spatiotemporal modeling. This review presents a comprehensive and forward-looking synthesis of recent advances in generative modeling including generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, and emerging multimodal foundation architectures and evaluates their expanding roles across the clinical imaging continuum. We systematically examine how generative AI contributes to key stages of the imaging workflow, from acquisition and reconstruction to cross-modality synthesis, diagnostic support, and treatment planning. Emphasis is placed on both retrospective and prospective clinical scenarios, where generative models help address longstanding challenges such as data scarcity, standardization, and integration across modalities. To promote rigorous benchmarking and translational readiness, we propose a three-tiered evaluation framework encompassing pixel-level fidelity, feature-level realism, and task-level clinical relevance. We also identify critical obstacles to real-world deployment, including generalization under domain shift, hallucination risk, data privacy concerns, and regulatory hurdles. Finally, we explore the convergence of generative AI with large-scale foundation models, highlighting how this synergy may enable the next generation of scalable, reliable, and clinically integrated imaging systems. By charting technical progress and translational pathways, this review aims to guide future research and foster interdisciplinary collaboration at the intersection of AI, medicine, and biomedical engineering.

[127] HiFi-Mamba: Dual-Stream W-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

Hongli Chen,Pengcheng Fang,Yuxia Chen,Yingxuan Ren,Jing Hao,Fangfang Tang,Xiaohao Cai,Shanshan Shan,Feng Liu

Main category: eess.IV

TL;DR: HiFi-Mamba是一种双流Mamba架构,用于高保真MRI重建,通过W-Laplacian模块和HiFi-Mamba模块解决了现有方法对高频细节不敏感和冗余扫描的问题,显著提升了重建精度。

Details Motivation: MRI重建任务中,现有Mamba变体对高频解剖细节不敏感且依赖冗余的多向扫描,限制了重建质量。

Contribution: 提出了HiFi-Mamba架构,结合W-Laplacian模块(保真度光谱解耦)和HiFi-Mamba模块(选择性高频特征集成),并采用单向遍历策略以提高计算效率。

Method: 双流架构(W-Laplacian和HiFi-Mamba模块)、光谱解耦、自适应状态空间调制和单向遍历策略。

Result: 在标准MRI重建基准上超越CNN、Transformer和其他Mamba模型,同时模型设计紧凑高效。

Insight: 通过分频处理和选择性特征集成,可以有效提升高频细节的重建质量;单向遍历策略在保持长程建模能力的同时优化了计算效率。

Abstract: Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked W-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.

[128] MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data

Baraa Al Jorf,Farah Shamout

Main category: eess.IV

TL;DR: MedPatch提出了一种多阶段多模态融合架构,结合置信度引导的补丁方法,有效处理临床数据中的异质性和缺失模态问题,在院内死亡预测和临床条件分类任务中达到最优性能。

Details Motivation: 由于临床数据的异质性、规模有限和模态缺失,现有方法在多模态数据融合任务中表现不佳。本文受临床工作流程启发,旨在通过多阶段融合策略提升模型性能。

Contribution: 1) 提出多阶段融合策略,同时利用联合和晚期融合;2) 设计缺失模态感知模块,处理稀疏数据;3) 开发联合融合模块,基于校准的单模态置信度聚类潜在补丁。

Method: MedPatch包括多阶段融合策略、缺失模态感知模块和联合融合模块,通过置信度引导的补丁方法整合多模态数据。

Result: 在MIMIC数据集的院内死亡预测和临床条件分类任务中,MedPatch优于现有基线方法,达到SOTA性能。

Insight: 置信度引导的多阶段融合能有效应对临床数据的异质性和缺失模态问题,为多模态医学数据建模提供新思路。

Abstract: Clinical decision-making relies on the integration of information across various data modalities, such as clinical time-series, medical images and textual reports. Compared to other domains, real-world medical data is heterogeneous in nature, limited in size, and sparse due to missing modalities. This significantly limits model performance in clinical prediction tasks. Inspired by clinical workflows, we introduce MedPatch, a multi-stage multimodal fusion architecture, which seamlessly integrates multiple modalities via confidence-guided patching. MedPatch comprises three main components: (i) a multi-stage fusion strategy that leverages joint and late fusion simultaneously, (ii) a missingness-aware module that handles sparse samples with missing modalities, (iii) a joint fusion module that clusters latent token patches based on calibrated unimodal token-level confidence. We evaluated MedPatch using real-world data consisting of clinical time-series data, chest X-ray images, radiology reports, and discharge notes extracted from the MIMIC-IV, MIMIC-CXR, and MIMIC-Notes datasets on two benchmark tasks, namely in-hospital mortality prediction and clinical condition classification. Compared to existing baselines, MedPatch achieves state-of-the-art performance. Our work highlights the effectiveness of confidence-guided multi-stage fusion in addressing the heterogeneity of multimodal data, and establishes new state-of-the-art benchmark results for clinical prediction tasks.

[129] impuTMAE: Multi-modal Transformer with Masked Pre-training for Missing Modalities Imputation in Cancer Survival Prediction

Maria Boyko,Aleksandra Beliaeva,Dmitriy Kornilov,Alexander Bernstein,Maxim Sharaev

Main category: eess.IV

TL;DR: impuTMAE 是一种基于 Transformer 的多模态预训练方法,通过掩码重建学习模态间的交互,并填补缺失模态,显著提升了胶质瘤生存预测的性能。

Details Motivation: 医学数据通常复杂且存在缺失模态,这对多模态模型的训练提出了挑战。需要一种能够有效填补缺失模态并提升预测性能的方法。

Contribution: 提出 impuTMAE,一种端到端的 Transformer 架构,通过预训练填补缺失模态,实现了在多模态数据(如基因、影像和临床数据)上的高效学习。

Method: 采用掩码预训练策略,学习模态内和模态间的交互,同时重建掩码数据以填补缺失模态。模型在异构和不完整数据上预训练,并通过微调用于胶质瘤生存预测。

Result: 在 TCGA-GBM/LGG 和 BraTS 数据集上,impuTMAE 超越了现有方法,达到了胶质瘤患者生存预测的最先进性能。

Insight: 通过预训练填补缺失模态的策略,不仅提升了模型的鲁棒性,还促进了多模态数据的有效利用,为医学预后任务提供了一种新思路。

Abstract: The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data. By addressing missing data during pre-training and enabling efficient resource utilization, impuTMAE surpasses prior multimodal approaches, achieving state-of-the-art performance in glioma patient survival prediction. Our code is available at https://github.com/maryjis/mtcp

[130] Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction

Jinho Kim,Marcel Dominik Nickel,Florian Knoll

Main category: eess.IV

TL;DR: 论文探讨了零样本自监督学习在缩短磁共振胰胆管造影(MRCP)屏气时间中的可行性,通过浅层训练显著提高了重建效率。

Details Motivation: MRCP检查中屏气时间长是临床瓶颈,传统方法如并行成像和压缩感知重建存在局限性,需要更高效的重建方法。

Contribution: 提出了零样本自监督学习方法用于MRCP重建,并引入浅层训练策略,显著缩短了训练时间。

Method: 采用非相干k空间采样模式,结合预训练网络的浅层训练方法,减少反向传播深度。

Result: 零样本学习显著提升了图像质量,接近呼吸触发采集效果,且训练时间从271分钟降至11分钟。

Insight: 零样本学习和浅层训练为临床快速应用提供了实用解决方案,为MRCP检查的普及带来潜力。

Abstract: Purpose: To investigate the feasibility of applying zero-shot self-supervised learning reconstruction to reduce breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Methods: Breath-hold MRCP was acquired from 11 healthy volunteers on a 3T scanner using an incoherent k-space sampling pattern leading to a breath-hold duration of 14s. We evaluated zero-shot reconstruction of breath-hold MRCP against parallel imaging of respiratory-triggered MRCP acquired in 338s on average and compressed sensing reconstruction of breath-hold MRCP. To address the long computation times of zero-shot trainings, we used a training approach that leverages a pretrained network to reduce backpropagation depth during training. Results: Zero-shot learning reconstruction significantly improved visual image quality compared to compressed sensing reconstruction, particularly in terms of signal-to-noise ratio and ductal delineation, and reached a level of quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Shallow training provided nearly equivalent reconstruction performance with a training time of 11 minutes in comparison to 271 minutes for a conventional zero-shot training. Conclusion: Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and shallow training offers a practical solution for translation to time-constrained clinical workflows.

[131] From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations

Yoni Schirris,Eric Marcus,Jonas Teuwen,Hugo Horlings,Efstratios Gavves

Main category: eess.IV

TL;DR: 该论文旨在从可解释AI(XAI)向‘被解释的AI’(Explained AI)迈进,提出了一种结合人机视觉语言模型(VLM)的交互系统,用于验证和量化分类器在病理学图像分析中的解释。

Details Motivation: 深度学习模型在医学图像分析中的临床应用需要可信的解释。现有技术(如GradCAM)能识别关键特征,但无法提供完整解释。论文旨在填补这一空白,确保模型依赖的特征具有临床意义且避免虚假相关性。

Contribution: 论文的主要贡献包括:(1)设计了一个结合AI与视觉语言模型的交互系统,用于验证解释的真实性;(2)提出了一种量化解释预测性的方法,可区分不同解释的优劣。

Method: 方法包括:(1)利用AI集成的切片查看器进行滑动窗口实验,验证解释的合理性;(2)使用通用视觉语言模型量化解释的预测能力。多实例学习用于处理全切片图像。

Result: 实验结果表明,该方法能定性验证解释的合理性,并可量化区分不同解释的预测能力,为病理学及其他领域的‘被解释AI’提供了可行路径。

Insight: 论文强调了从‘可解释’到‘被解释’的转变,验证解释的真实性和量化其价值是关键。这种方法可能在医学图像分析中提升模型的可信度和临床适用性。

Abstract: Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation’s predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at https://github.com/nki-ai/x2x.

[132] AMRG: Extend Vision Language Models for Automatic Mammography Report Generation

Nak-Jun Sung,Donghyun Lee,Bo Hwa Choi,Chae Jung Park

Main category: eess.IV

TL;DR: 该论文提出了AMRG框架,利用大型视觉语言模型(VLM)自动生成乳腺X光检查报告,通过参数高效微调(PEFT)策略实现轻量级适配,并在公开数据集上取得了优异的表现。

Details Motivation: 乳腺X光报告生成是医学AI中重要但研究不足的任务,面临多视图图像推理、高分辨率视觉线索和非结构化放射学语言等挑战。

Contribution: 提出首个端到端框架AMRG,利用MedGemma-4B模型进行参数高效微调,填补了多模态临床AI的空白,并建立了可复现的基准。

Method: 采用Low-Rank Adaptation(LoRA)策略进行参数高效微调,研究了LoRA的超参数配置,并在多个VLM骨干上进行了对比实验。

Result: 在语言生成和临床指标上表现优异,ROUGE-L为0.5691,METEOR为0.6152,CIDEr为0.5818,BI-RADS准确率为0.5582。

Insight: AMRG提供了一种可扩展且适应性强的放射学报告生成方法,为未来多模态医学AI研究奠定了基础。

Abstract: Mammography report generation is a critical yet underexplored task in medical AI, characterized by challenges such as multiview image reasoning, high-resolution visual cues, and unstructured radiologic language. In this work, we introduce AMRG (Automatic Mammography Report Generation), the first end-to-end framework for generating narrative mammography reports using large vision-language models (VLMs). Building upon MedGemma-4B-it-a domain-specialized, instruction-tuned VLM-we employ a parameter-efficient fine-tuning (PEFT) strategy via Low-Rank Adaptation (LoRA), enabling lightweight adaptation with minimal computational overhead. We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports. This work establishes the first reproducible benchmark for mammography report generation, addressing a longstanding gap in multimodal clinical AI. We systematically explore LoRA hyperparameter configurations and conduct comparative experiments across multiple VLM backbones, including both domain-specific and general-purpose models under a unified tuning protocol. Our framework demonstrates strong performance across both language generation and clinical metrics, achieving a ROUGE-L score of 0.5691, METEOR of 0.6152, CIDEr of 0.5818, and BI-RADS accuracy of 0.5582. Qualitative analysis further highlights improved diagnostic consistency and reduced hallucinations. AMRG offers a scalable and adaptable foundation for radiology report generation and paves the way for future research in multimodal medical AI.

[133] Dynamic Survival Prediction using Longitudinal Images based on Transformer

Bingfan Liu,Haolun Shi,Jiguo Cao

Main category: eess.IV

TL;DR: 该论文提出了一种基于Transformer的动态生存预测模型SurLonFormer,用于处理纵向医学图像和结构化数据,解决了现有方法中未充分利用截尾数据、忽视时间相关性及缺乏解释性的问题。

Details Motivation: 目前的方法在利用截尾数据、捕捉纵向图像的时间相关性以及模型解释性方面存在不足,亟需一种能够高效整合多模态数据的模型。

Contribution: 提出了SurLonFormer,一种结合Transformer的神经网络,可有效整合纵向图像与结构化数据,并提升预测性能和解释性。

Method: 模型包含三个关键模块:视觉编码器(提取空间特征)、序列编码器(聚合时间信息)和生存编码器(基于Cox模型),并进行遮挡敏感性和动态生存预测分析。

Result: 仿真和实际阿尔茨海默病分析表明,SurLonFormer在预测性能和识别疾病相关生物标志物方面表现优越。

Insight: 纵向图像的多时间点动态建模对生存预测至关重要,Transformer能有效捕捉时空依赖性,同时增强模型解释性。

Abstract: Survival analysis utilizing multiple longitudinal medical images plays a pivotal role in the early detection and prognosis of diseases by providing insight beyond single-image evaluations. However, current methodologies often inadequately utilize censored data, overlook correlations among longitudinal images measured over multiple time points, and lack interpretability. We introduce SurLonFormer, a novel Transformer-based neural network that integrates longitudinal medical imaging with structured data for survival prediction. Our architecture comprises three key components: a Vision Encoder for extracting spatial features, a Sequence Encoder for aggregating temporal information, and a Survival Encoder based on the Cox proportional hazards model. This framework effectively incorporates censored data, addresses scalability issues, and enhances interpretability through occlusion sensitivity analysis and dynamic survival prediction. Extensive simulations and a real-world application in Alzheimer’s disease analysis demonstrate that SurLonFormer achieves superior predictive performance and successfully identifies disease-related imaging biomarkers.

[134] T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis

Xiaojiao Xiao,Jianfeng Zhao,Qinmin Vivian Hu,Guanghui Wang

Main category: eess.IV

TL;DR: 本文提出了T-CACE框架,通过无对比剂MRI合成多期相增强MRI,解决了传统MRI的对比剂风险与数据标注不足问题。该框架整合了解剖先验与时相信息,并采用动态时间感知注意力机制,提升了合成质量与诊断可靠性。

Details Motivation: 传统MRI在肝脏癌诊断中面临对比剂风险、手动评估耗时和标注数据有限等问题,亟需一种高效、安全的替代方案。

Contribution: 1. 提出条件令牌编码(CTE)机制,统一了解剖先验与时相信息;2. 设计动态时间感知注意力掩码(DTAM),优化跨期相信息流;3. 引入时间分类一致性约束(TCC),提升诊断可靠性。

Method: 通过CTE编码解剖和时相信息,结合DTAM自适应调制信息流,并利用TCC约束分类一致性。

Result: 在两个独立肝脏MRI数据集上,T-CACE在图像合成、分割和病灶分类任务中均超越现有方法。

Insight: T-CACE为无对比剂MRI提供了一种临床可行的替代方案,提升了诊断的安全性和效率。

Abstract: Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: https://github.com/xiaojiao929/T-CACE.

cs.RO [Back]

[135] DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

Haoxiang Shi,Xiang Deng,Zaijing Li,Gongwei Chen,Yaowei Wang,Liqiang Nie

Main category: cs.RO

TL;DR: 论文提出了一种名为DAgger Diffusion Navigation (DifNav)的端到端方法,通过扩散策略统一了传统的两阶段路径点生成与规划,解决了现有方法中的全局次优化问题和性能瓶颈。结合DAgger训练增强了鲁棒性和错误恢复能力。

Details Motivation: 现有的视觉-语言导航(VLN-CE)方法采用两阶段框架,存在全局次优化和对路径点预测质量的强依赖问题。为了克服这些限制,作者提出了端到端的解决方案。

Contribution: 提出了DifNav,首次将扩散策略引入VLN-CE领域,统一了路径点生成与规划;引入DAgger训练提升了鲁棒性和错误恢复能力;实验表明其性能优于现有方法。

Method: 采用条件扩散策略直接建模连续导航空间中的多模态动作分布;结合DAgger进行在线训练和数据增强,优化策略。

Result: 在基准数据集上的实验表明,DifNav性能显著优于现有的两阶段方法,无需依赖路径点预测器。

Insight: 扩散策略在多模态动作建模中表现出色,端到端训练可以更好地优化整体目标;DAgger增强了策略的鲁棒性,特别是在长距离导航任务中。

Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy’s robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: https://github.com/Tokishx/DifNav.

astro-ph.IM [Back]

[136] Robustness analysis of Deep Sky Objects detection models on HPC

Olivier Parisot,Diogo Ramalho Fernandes

Main category: astro-ph.IM

TL;DR: 本文通过高性能计算(HPC)训练和比较了YOLO和RET-DETR等检测模型,以增强深空天体检测的自动化处理方法和鲁棒性。

Details Motivation: 随着天文观测和业余天文学家数量的增加,自动化处理海量天空图像的需求日益迫切,而深空天体的微弱信号和复杂背景使得检测更具挑战性。

Contribution: 本研究的主要贡献是利用HPC并行化计算,训练和比较了不同检测模型(YOLO、RET-DETR),并对其鲁棒性进行了测试。

Method: 采用高性能计算(HPC)并行化训练和测试YOLO和RET-DETR模型,重点分析了它们在深空天体检测中的鲁棒性。

Result: 论文展示了不同检测模型的性能和鲁棒性测试结果,为自动化处理天文图像提供了可行方法。

Insight: 通过HPC加速,结合先进的计算机视觉和深度学习模型,可以显著提升深空天体检测的效率和准确性,尤其在大规模数据处理中表现突出。

Abstract: Astronomical surveys and the growing involvement of amateur astronomers are producing more sky images than ever before, and this calls for automated processing methods that are accurate and robust. Detecting Deep Sky Objects – such as galaxies, nebulae, and star clusters – remains challenging because of their faint signals and complex backgrounds. Advances in Computer Vision and Deep Learning now make it possible to improve and automate this process. In this paper, we present the training and comparison of different detection models (YOLO, RET-DETR) on smart telescope images, using High-Performance Computing (HPC) to parallelise computations, in particular for robustness testing.

cs.IR [Back]

[137] Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Marco De Nadai,Andreas Damianou,Mounia Lalmas

Main category: cs.IR

TL;DR: 论文提出了一种利用多模态大语言模型(MLLM)生成视频的丰富自然语言描述的方法,以增强视频推荐系统的语义理解能力。

Details Motivation: 现有视频推荐系统主要依赖于低级的视觉和声学特征或用户定义的元数据,缺乏对视频深层语义(如意图、幽默等)的理解,而这对个性化推荐至关重要。

Contribution: 提出了一个无需微调的框架,通过MLLM生成视频的详细描述,将高层语义注入推荐系统,提升了推荐效果。

Method: 使用现成的MLLM生成视频的自然语言描述,结合先进的文本编码器,输入到标准的推荐模型中。

Result: 在MicroLens-100K数据集上,该方法显著优于传统视频、音频和元数据特征,验证了MLLM作为知识提取器的潜力。

Insight: MLLM能够动态提取视频的深层语义信息,为推荐系统提供更贴近用户意图的内容描述,从而提升个性化推荐效果。

Abstract: Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. “a superhero parody with slapstick fights and orchestral stabs”), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.