Table of Contents

cs.CL [Back]

[1] GMTRouter: Personalized LLM Router over Multi-turn User Interactions

Encheng Xie,Yihang Sun,Tao Feng,Jiaxuan You

Main category: cs.CL

TL;DR: GMTRouter是一个个性化LLM路由方法,通过异构图表示多轮用户交互,利用轻量级归纳图学习捕获用户偏好,显著优于基线方法。

Details Motivation: 现有LLM路由方法缺乏个性化和对用户偏好动态变化的适应能力,且用户数据稀缺、噪声大、格式不一致,限制了效果提升。

Contribution: 提出了GMTRouter,用异构图建模多轮交互,设计了消息传递机制捕捉用户偏好,实现高效个性化路由。

Method: 将用户、LLM、查询和响应表示为异构图节点,通过轻量级归纳图学习框架从少量数据中学习偏好。

Result: 在多个数据集上表现优异,准确率和AUC分别提升0.9-21.6%和0.006-0.309,且能适应新用户和偏好变化。

Insight: 异构图和消息传递机制的有效性为LLM个性化路由提供了新思路,尤其在数据稀缺场景下优势明显。

Abstract: Large Language Model (LLM) routing has demonstrated strong capability in balancing response quality with computational cost. As users exhibit diverse preferences, personalization has attracted increasing attention in LLM routing, since even identical queries may require different models to generate responses tailored to individual needs. However, existing approaches are not fully personalized and often fail to capture the complex interactions between specific users and LLMs. Moreover, user preference data is typically scarce, noisy, and inconsistent in format, which limits the effectiveness of methods that rely solely on user-specific data. To address these challenges, we propose GMTRouter, which represents multi-turn user-LLM interactions as a heterogeneous graph with four node types: user, LLM, query, and response, thereby preserving the rich relational structure of the interaction. Through a tailored message-passing mechanism, GMTRouter learns to capture user preferences from few-shot data within a lightweight inductive graph learning framework, enabling effective personalization. Extensive experiments demonstrate that GMTRouter consistently outperforms strong baselines, achieving 0.9 to 21.6 percent higher accuracy and 0.006 to 0.309 higher AUC across multiple datasets. More importantly, we demonstrate that GMTRouter can adapt to new users and evolving preferences using only few-shot data, without extensive fine-tuning. The code for GMTRouter is publicly available at https://github.com/ulab-uiuc/GMTRouter.

Abha Jha,Abel Salinas,Fred Morstatter

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在法律理解和漏洞方面的表现,重点关注其对美国法典第18章第175条(生物武器相关法律)的解读能力。尽管LLMs在法律分析中有潜力,但也存在生成不安全输出的风险。作者提出了一种结合知识图谱与检索增强生成(RAG)的方法,评估LLMs的法律理解能力和潜在不安全行为。实验揭示了LLMs在法律推理和安全机制上的显著局限性,并提出了改进方向。

Details Motivation: 随着LLMs在敏感法律领域的应用增加,其在理解和遵守法律方面的能力及其潜在风险成为了重要问题。论文动机在于揭示LLMs在法律解读中的漏洞,并提出改进方法以避免潜在的法律违规风险。

Contribution: 论文的主要贡献包括:1)提出了一种结合知识图谱与检索增强生成(RAG)的系统性评估方法;2)揭示了LLMs在法律推理和安全机制上的局限性;3)提出了通过增强安全协议和改进法律推理框架的未来发展方向。

Method: 作者采用了知识图谱构建与检索增强生成(RAG)相结合的方法,通过结构化实验评估LLMs在法律理解和潜在违规行为方面的表现。实验内容包括LLMs识别法律违规、生成禁止性指令以及检测非法意图的能力。

Result: 实验结果表明,LLMs在法律推理和安全机制上存在显著不足,例如可能生成违反法律的指导性内容。但这些发现也为未来改进提供了方向。

Insight: 论文的关键见解在于LLMs虽然在法律分析中表现出潜力,但其安全性和法律合规性仍需通过技术手段(如增强安全协议和改进推理框架)进一步提升,以确保其在敏感领域的伦理应用。

Abstract: The rise of Large Language Models (LLMs) offers transformative potential for interpreting complex legal frameworks, such as Title 18 Section 175 of the US Code, which governs biological weapons. These systems hold promise for advancing legal analysis and compliance monitoring in sensitive domains. However, this capability comes with a troubling contradiction: while LLMs can analyze and interpret laws, they also demonstrate alarming vulnerabilities in generating unsafe outputs, such as actionable steps for bioweapon creation, despite their safeguards. To address this challenge, we propose a methodology that integrates knowledge graph construction with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs’ understanding of this law, their capacity to assess legal intent (mens rea), and their potential for unsafe applications. Through structured experiments, we assess their accuracy in identifying legal violations, generating prohibited instructions, and detecting unlawful intent in bioweapons-related scenarios. Our findings reveal significant limitations in LLMs’ reasoning and safety mechanisms, but they also point the way forward. By combining enhanced safety protocols with more robust legal reasoning frameworks, this research lays the groundwork for developing LLMs that can ethically and securely assist in sensitive legal domains - ensuring they act as protectors of the law rather than inadvertent enablers of its violation.

[3] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning

Joongho Kim,Xirui Huang,Zarreen Reza,Gabriel Grand,Kevin Zhu,Ryan Lagasse

Main category: cs.CL

TL;DR: SSDP是一种基于语义相似性的动态剪枝方法,用于优化树状思维推理(ToT),通过实时聚类和剪枝冗余推理路径,显著减少计算开销并保持高准确性。

Details Motivation: 树状思维推理(ToT)虽然提升了LLMs的问题解决能力,但其语义冗余导致计算开销巨大,需要一种高效的方法来优化推理过程。

Contribution: 提出了SSDP,首次将在线语义合并引入并行树搜索中,实现了实时的冗余路径聚类与剪枝,显著提升了推理效率。

Method: SSDP通过语义相似性动态聚类和剪枝冗余推理步骤,减少了85-90%的探索节点,同时保持了高准确性。

Result: 在GSM8K和MATH500等基准测试中,SSDP实现了2.3倍的速度提升,同时准确性保持在最强基线的5%以内。

Insight: SSDP展示了在LLM推理中通过动态优化可以减少计算开销而不显著影响性能,为高效推理提供了可扩展的方案。

Abstract: Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at https://github.com/kimjoonghokim/SSDP.

[4] What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs’ Self-consistency Via Adversarial Nudge

Arka Dutta,Sujan Dutta,Rijul Magu,Soumyajit Datta,Munmun De Choudhury,Ashiqur R. KhudaBukhsh

Main category: cs.CL

TL;DR: 该论文提出了HAUNT框架,通过对抗性提示测试大语言模型(LLM)在封闭领域内的自一致性,揭示了不同LLM对对抗性提示的抵御能力差异。

Details Motivation: 幻觉问题是LLM在高风险领域实际部署中的关键挑战,需要一种系统方法来评估其对对抗性提示的鲁棒性。

Contribution: 提出了HAUNT框架,通过生成和验证真假陈述,测试LLM的自一致性,并揭示了不同LLM的抵御能力。

Method: HAUNT框架分为三步:1) 生成封闭领域内的真假陈述;2) 验证这些陈述;3) 测试LLM对自身生成的谎言鲁棒性。

Result: 评估发现Claude抵御能力最强,GPT和Grok中等,Gemini和DeepSeek较弱。

Insight: LLM在信息检索中的广泛使用使其对对抗性提示的脆弱性成为重要问题,需要进一步改进。

Abstract: Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.

[5] Self-HarmLLM: Can Large Language Model Harm Itself?

Heehwan Kim,Sungjune Park,Daeseon Choi

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型(LLM)可能通过自身生成的模糊有害查询(MHQ)进行自我攻击(Self-HarmLLM),实验表明在零样本和少样本条件下攻击成功率较高,同时揭示了自动评估对有害性判断的不准确性。

Details Motivation: 现有防御措施假设有害查询来自外部攻击者,但忽视了模型自身输出可能成为攻击载体的可能性。论文旨在验证LLM是否可能通过自身生成的MHQ实现自我攻击。

Contribution: 提出了Self-HarmLLM概念,利用模型生成的MHQ重新输入同一模型,验证了LLM自我攻击的可行性;同时揭示了自动评估与传统防御的局限性。

Method: 设计了Mitigated Harmful Query(MHQ)作为攻击载体,通过零样本和少样本条件测试多个主流模型(如GPT-3.5-turbo)的攻击成功率,并结合人工评估与自动评估对比。

Result: 实验表明MHQ攻击在零样本条件下的转化成功率达52%,越狱成功率达33%;少样本条件下分别为65%和41%。自动评估显著高估了有害性,平均差异达52%。

Insight: LLM的防御机制需重新设计,以应对自我攻击;仅依赖自动评估可能导致误判,需结合人工评估建立更健壮的评测方法。

Abstract: Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model’s own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.

[6] Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study

Yilan Liu

Main category: cs.CL

TL;DR: 本文提出了一种结合检索增强生成(RAG)与领域知识库的系统,用于生成儿科言语病理学(SLP)临床案例,展示了其技术可行性。

Details Motivation: 临床案例是SLP教育的重要工具,但手动创建耗时耗力。通用大语言模型(LLM)由于缺乏领域知识,生成的文本不准确且需要专家大量修改。

Contribution: 主要贡献是开发了一个多模型RAG系统原型,整合了领域知识库,支持商用和开源LLM生成高质量的SLP案例。

Method: 方法包括设计多模型RAG系统,使用精心设计的提示模板和五个不同的LLM,生成涵盖多种障碍类型和年级的案例。

Result: 结果表明商用模型质量略高,但开源模型表现也可接受,适合隐私敏感的机构部署。知识库的整合确保了生成内容符合专业指南。

Insight: 未来应用可扩展至临床决策支持、自动化IEP目标生成和临床反思训练,但需进一步专家验证和学生测试。

Abstract: Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.

Azmine Toushik Wasi,Wahid Faisal,Mst Rafia Islam

Main category: cs.CL

TL;DR: Mina是一个基于多语言LLM的法律助手,专为孟加拉国的低收入群体设计,旨在解决法律语言复杂、程序不透明和高成本的问题。

Details Motivation: 孟加拉国的低收入群体因法律语言复杂、程序不透明和高成本而难以获得负担得起的法律咨询。现有AI法律助手缺乏孟加拉语支持和针对本地化的适配。

Contribution: 开发了Mina,一个多语言LLM法律助手,提供检索、推理、翻译和文档生成功能,并通过交互式聊天界面提供法律草案、引用和平易解释。

Method: 采用多语言嵌入和基于RAG的工具链框架,支持检索、推理、翻译和文档生成功能。

Result: 在孟加拉国律师理事会考试中的MCQ、笔试和模拟口试中,Mina得分75-80%,达到或超过人类平均表现,展示了强大的情境理解和法律推理能力。

Insight: Mina展示了在多语言、低资源环境下开发领域专用AI系统的潜力,并为可持续的公共服务AI部署提供了案例。

Abstract: Bangladesh’s low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.

[8] Structured Uncertainty guided Clarification for LLM Agents

Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Dinesh Manocha

Main category: cs.CL

TL;DR: 该论文提出了一种基于结构化不确定性的POMDP方法(SAGE-Agent),用于优化LLM代理在工具调用中的澄清问题选择,显著提高了任务覆盖率和交互效率。

Details Motivation: LLM代理在处理模糊用户指令时容易导致工具调用错误和任务失败,现有的不确定性方法缺乏结构化建模和成本效率优化。

Contribution: 1. 提出了基于结构化不确定性的POMDP框架和EVPI目标;2. 设计了ClarifyBench基准测试;3. 证明了结构化不确定性可作为强化学习的有效信号。

Method: 通过POMDP建模工具参数的不确定性,利用EVPI选择最优澄清问题,并结合基于方面的成本模型避免冗余。

Result: 在ClarifyBench上,SAGE-Agent将模糊任务的覆盖率提升了7-39%,同时减少澄清问题1.5-2.7倍。强化学习实验中,When2Call准确率显著提升(3B模型从36.5%到65.2%)。

Insight: 结构化不确定性不仅优化了LLM代理的交互效率,还为强化学习提供了高质量的训练信号。

Abstract: LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$\times$ compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5% to 65.2% (3B model) and 36.7% to 62.9% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.

[9] Toward Automated Cognitive Assessment in Parkinson’s Disease Using Pretrained Language Models

Varada Khanna,Nilay Bhatt,Ikgyu Shin,Sule Tinaz,Yang Ren,Hua Xu,Vipina K. Keloth

Main category: cs.CL

TL;DR: 该论文研究了如何利用预训练语言模型自动化评估帕金森病患者的认知功能,通过不同模型家族的比较,发现微调的Meta-Llama-3-8B-Instruct模型在提取复杂认知类别方面表现最佳。

Details Motivation: 帕金森病患者的认知和情感变化可以通过其日常叙述理解,但传统的非结构化数据提取方法难以捕捉这些细微且重叠的认知概念。

Contribution: 提出并评估了三种NLP模型(Bio_ClinicalBERT、微调的Meta-Llama-3-8B-Instruct和GPT-4o mini),用于从患者叙述中自动识别七种认知类别,展示了微调模型在复杂类别上的优越性。

Method: 使用Bio_ClinicalBERT进行嵌套实体识别,微调Meta-Llama-3-8B-Instruct模型(采用QLoRA技术),并在零样本和少样本设置下评估GPT-4o mini。

Result: 微调的Meta-Llama-3-8B-Instruct模型在整体F1-score上表现最佳(微平均0.74,宏平均0.59),尤其在上下文依赖类别(如思维和社交互动)中表现出色。

Insight: 尽管任务因认知叙述的抽象性而更具挑战性,但NLP模型的持续优化有望成为帕金森病认知功能监测的有力工具,补充传统神经心理学评估。

Abstract: Understanding how individuals with Parkinson’s disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.

cs.CV [Back]

[10] Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants

I. Bailo,F. Buonora,G. Ciarfaglia,L. T. Consoli,A. Evangelista,M. Gabusi,M. Ghiani,C. Petracca Ciavarella,F. Picariello,F. Sarcina,F. Tuosto,V. Zullo,L. Airoldi,G. Bruno,D. D. Gobbo,S. Pezzenati,G. A. Tona

Main category: cs.CV

TL;DR: 论文提出了一种基于Transformer的解决方案,用于自动化燃气工厂的数字化过程,结合OCR、Vision LLM等技术,实现了高精度的数据提取和拓扑结构重建。

Details Motivation: 燃气工厂的数字化过程复杂且缺乏标准化,传统方法效率低下。本文旨在通过AI技术(如Transformer架构)自动化数据提取和结构重建,提升效率和准确性。

Contribution: 1. 提出了一种新的Transformer架构,用于深化工厂组件间复杂关系的分析;2. 结合OCR、Vision LLM等技术,实现了高精度的数据提取(91%)和拓扑结构重建(93%组件识别,80%结构提取)。

Method: 1. 使用OCR和Vision LLM提取文本信息;2. 结合目标检测和关系推理优化算法;3. 扩展了Scene Graph Generation模型,引入新的Transformer架构分析复杂关系。

Result: 文本数据提取准确率91%,组件识别率93%,拓扑结构提取准确率80%。

Insight: Transformer架构在处理复杂关系时表现出色,多技术协同可有效解决数据多样化问题,为工业数字化提供了一种高效解决方案。

Abstract: The energy transition is a key theme of the last decades to determine a future of eco-sustainability, and an area of such importance cannot disregard digitization, innovation and the new technological tools available. This is the context in which the Generative Artificial Intelligence models described in this paper are positioned, developed by Engineering Ingegneria Informatica SpA in order to automate the plant structures acquisition of SNAM energy infrastructure, a leading gas transportation company in Italy and Europe. The digitization of a gas plant consists in registering all its relevant information through the interpretation of the related documentation. The aim of this work is therefore to design an effective solution based on Artificial Intelligence techniques to automate the extraction of the information necessary for the digitization of a plant, in order to streamline the daily work of MGM users. The solution received the P&ID of the plant as input, each one in pdf format, and uses OCR, Vision LLM, Object Detection, Relational Reasoning and optimization algorithms to return an output consisting of two sets of information: a structured overview of the relevant design data and the hierarchical framework of the plant. To achieve convincing results, we extend a state-of-the-art model for Scene Graph Generation introducing a brand new Transformer architecture with the aim of deepening the analysis of the complex relations between the plant’s components. The synergistic use of the listed AI-based technologies allowed to overcome many obstacles arising from the high variety of data, due to the lack of standardization. An accuracy of 91% has been achieved in the extraction of textual information relating to design data. Regarding the plants topology, 93% of components are correctly identified and the hierarchical structure is extracted with an accuracy around 80%.

[11] Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

Dogucan Yaman,Fevziye Irem Eyiokur,Hazım Kemal Ekenel,Alexander Waibel

Main category: cs.CV

TL;DR: 该论文提出了一种系统化的评估方法来分析和量化基于语音驱动的对话人脸生成中的嘴唇泄漏问题,并通过互补的测试设置和衍生指标提供了一个可靠的基准。

Details Motivation: 现有的基于语音驱动的对话人脸生成方法可能在生成过程中引入嘴唇泄漏问题,即生成的嘴唇受参考图像影响而非仅由驱动音频决定。传统的评估方法和测试设置难以检测此类问题。

Contribution: 论文的主要贡献包括:1. 提出了一种系统化的评估框架,包含三种互补的测试设置;2. 引入了基于嘴唇同步差异和静音音频的衍生指标;3. 研究了不同身份参考选择对泄漏的影响。

Method: 论文提出了三种测试设置:静音输入生成、不匹配的音频-视频配对和匹配的音频-视频合成。同时,引入了嘴唇同步差异和静音音频同步分数等新指标。

Result: 该方法能够有效检测和量化嘴唇泄漏问题,并提供了一个模型无关的可靠基准。

Insight: 研究表明,身份参考的选择对泄漏问题有显著影响,这为未来的参考图像设计提供了重要指导。

Abstract: Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

[12] Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Assaf Singer,Noam Rotstein,Amir Mann,Ron Kimmel,Or Litany

Main category: cs.CV

TL;DR: 论文提出了Time-to-Move (TTM),一种无需训练的即插即用框架,用于实现基于图像到视频(I2V)扩散模型的运动控制和外观控制的视频生成。通过双时钟去噪策略,TTM在指定运动区域强制对齐,同时在其他区域保持灵活性。

Details Motivation: 现有基于图像和文本的扩散视频生成方法缺乏精确的运动控制,而传统方法需要针对特定模型进行微调,计算成本高且受限。TTM旨在实现无需训练的运动控制视频生成。

Contribution: 1. 提出TTM框架,无需训练即可实现运动和外观控制的视频生成;2. 引入双时钟去噪策略,平衡用户意图和自然动态;3. 兼容任何骨干模型,无需额外计算成本。

Method: 利用粗糙参考动画作为运动提示,结合图像条件保持外观。双时钟去噪策略在运动指定区域强制对齐,其他区域保持灵活性。

Result: TTM在物体和相机运动基准测试中表现优异,与基于训练的基线方法相当或更优。

Insight: 通过粗略动画和图像条件的结合,TTM展示了无需训练即可实现精确运动控制的潜力,扩展了扩散模型的应用范围。

Abstract: Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit’s use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

[13] CADIC: Continual Anomaly Detection Based on Incremental Coreset

Gen Yang,Zhipeng Deng,Junfeng Man

Main category: cs.CV

TL;DR: 论文提出了一种基于增量核心集的持续异常检测方法CADIC,通过统一内存库和固定大小核心集的增量更新,解决了现有方法中任务特定内存碎片化的问题,并在多个数据集上实现了优异的检测精度。

Details Motivation: 现有持续异常检测方法需要为每个任务构建特定的子内存库,限制了灵活性和可扩展性。为了解决这一问题,论文提出了一种共享统一内存库的方法,避免内存碎片化。

Contribution: 主要贡献包括提出CADIC框架,实现任务间共享统一内存库;通过增量核心集更新实现连续知识获取;提出基于最近邻匹配的推理机制,达到SOTA检测精度。

Method: 方法主要包括增量更新固定大小核心集中的嵌入表示;在推理阶段通过最近邻匹配计算异常分数;实验验证了方法的有效性。

Result: 在MVTec AD和Visa数据集上,CADIC的平均图像级AUROC分别达到0.972和0.891;在真实电子纸数据集上实现了100%的异常检测准确率。

Insight: 核心集技术在持续学习中能够有效避免内存碎片化,统一内存库的设计提升了方法的灵活性和可扩展性。

Abstract: The primary objective of Continual Anomaly Detection (CAD) is to learn the normal patterns of new tasks under dynamic data distribution assumptions while mitigating catastrophic forgetting. Existing embedding-based CAD approaches continuously update a memory bank with new embeddings to adapt to sequential tasks. However, these methods require constructing class-specific sub-memory banks for each task, which restricts their flexibility and scalability. To address this limitation, we propose a novel CAD framework where all tasks share a unified memory bank. During training, the method incrementally updates embeddings within a fixed-size coreset, enabling continuous knowledge acquisition from sequential tasks without task-specific memory fragmentation. In the inference phase, anomaly scores are computed via a nearest-neighbor matching mechanism, achieving state-of-the-art detection accuracy. We validate the method through comprehensive experiments on MVTec AD and Visa datasets. Results show that our approach outperforms existing baselines, achieving average image-level AUROC scores of 0.972 (MVTec AD) and 0.891 (Visa). Notably, on a real-world electronic paper dataset, it demonstrates 100% accuracy in anomaly sample detection, confirming its robustness in practical scenarios. The implementation will be open-sourced on GitHub.

[14] Predict and Resist: Long-Term Accident Anticipation under Sensor Noise

Xingcheng Liu,Bin Rao,Yanchen Guan,Chengyue Wang,Haicheng Liao,Jiaxun Zhang,Chengyu Lin,Meixin Zhu,Zhenning Li

Main category: cs.CV

TL;DR: 该论文提出了一种结合扩散去噪和时域演员-评论家模型的框架,用于长期事故预测,以应对传感器噪声和早期警报的挑战。

Details Motivation: 自动驾驶中事故预测的关键在于及时可靠的预警,但传感器噪声和早期预测与可靠性平衡的问题阻碍了实际部署。

Contribution: 提出了一个统一框架,结合扩散去噪与时域演员-评论家模型,显著提升了噪声环境下的预测准确性和时间敏感性。

Method: 通过扩散模块迭代优化特征,保留运动与交互信息;使用时域演员-评论家模型权衡早期预测与可靠性。

Result: 在多个基准数据集上取得最先进性能,平均事故时间显著提升,噪声环境下表现稳健。

Insight: 扩散去噪与时域强化学习的结合为噪声环境下的长期预测提供了新思路,模型预测更加稳定且符合人类行为。

Abstract: Accident anticipation is essential for proactive and safe autonomous driving, where even a brief advance warning can enable critical evasive actions. However, two key challenges hinder real-world deployment: (1) noisy or degraded sensory inputs from weather, motion blur, or hardware limitations, and (2) the need to issue timely yet reliable predictions that balance early alerts with false-alarm suppression. We propose a unified framework that integrates diffusion-based denoising with a time-aware actor-critic model to address these challenges. The diffusion module reconstructs noise-resilient image and object features through iterative refinement, preserving critical motion and interaction cues under sensor degradation. In parallel, the actor-critic architecture leverages long-horizon temporal reasoning and time-weighted rewards to determine the optimal moment to raise an alert, aligning early detection with reliability. Experiments on three benchmark datasets (DAD, CCD, A3D) demonstrate state-of-the-art accuracy and significant gains in mean time-to-accident, while maintaining robust performance under Gaussian and impulse noise. Qualitative analyses further show that our model produces earlier, more stable, and human-aligned predictions in both routine and highly complex traffic scenarios, highlighting its potential for real-world, safety-critical deployment.

[15] RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation

Hae-Won Jo,Yeong-Jun Cho

Main category: cs.CV

TL;DR: RS-Net是一个用于动态场景图生成的模块化框架,通过空间交互和长距离时序上下文评估对象对的重要性,显著提高了关系预测的精度和召回率。

Details Motivation: 现有的动态场景图生成(DSGG)方法仅针对标注的对象对进行训练,缺乏对非相关对的指导,导致推理时难以识别有意义的关系。

Contribution: 提出了Relation Scoring Network (RS-Net),通过空间上下文编码器和时序编码器评估对象对的上下文重要性,并将其整合到统一的三元组评分机制中。

Method: RS-Net包含一个具有可学习上下文令牌的空间上下文编码器和一个聚合视频级信息的时序编码器,通过关系评分增强关系预测能力。

Result: 在Action Genome数据集上,RS-Net显著提升了召回率和精度,尤其是在长尾关系分布中表现突出,同时保持了较高的效率。

Insight: RS-Net无需更改架构即可整合到现有DSGG模型中,为动态场景图中关系的动态演化提供了更全面的上下文建模能力。

Abstract: Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.

[16] Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

Joseph Fioresi,Ishan Rajendrakumar Dave,Mubarak Shah

Main category: cs.CV

TL;DR: 論文提出了一種新型的視頻隱私保護方法,通過在潛在空間中操作,避免了像素級匿名化的局限,同時保持下游任務的效用。

Details Motivation: 現有視頻隱私保護方法主要集中在像素級匿名化,這需要重新訓練整個模型,且針對特定任務,不適用於視頻基礎模型。這些方法也無法解決潛在空間中隱私洩露的問題。

Contribution: 1. 提出了輕量級的Anonymizing Adapter Module (AAM),可在潛在空間中移除隱私信息,同時保留任務效用。2. 設計了三種新的訓練目標,優化隱私保護與任務性能的平衡。3. 提出了新的性別偏見評估協議,顯示方法能有效減少偏見。

Method: 1. 使用AAM模塊對凍結的視頻編碼器進行插拔式隱私保護。2. 採用三種訓練目標:靜態片段間的自監督隱私目標、協同訓練目標以保留任務效用,以及潛在一致性損失以泛化未見過的任務。

Result: 實驗顯示,隱私洩露減少了35%,同時在多個任務(如動作識別、時序動作檢測、異常檢測)中保持了接近基線的性能。此外,方法有效地減少了性別偏見。

Insight: 潛在空間匿名化是一種有效的隱私保護途徑,尤其適用於視頻基礎模型。通過設計多任務訓練目標,可以同時解決隱私保護和任務性能的平衡問題。

Abstract: We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.

[17] Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

Xinchen Yan,Chen Liang,Lijun Yu,Adams Wei Yu,Yifeng Lu,Quoc V. Le

Main category: cs.CV

TL;DR: 该论文研究了自回归逐像素预测的扩展特性,发现分类和生成任务的最优扩展策略存在显著差异,且计算能力是主要瓶颈而非数据量。

Details Motivation: 探索自回归逐像素预测作为一种简单但未充分研究的统一视觉模型框架在不同任务(分类和生成)中的扩展特性。

Contribution: 揭示了分类和生成任务的扩展策略差异,并指出计算能力是未来逐像素建模的主要瓶颈。

Method: 使用IsoFlops配置训练了一系列Transformer模型,评估了逐像素预测目标、ImageNet分类准确率和生成质量。

Result: 发现分类和生成任务的最优扩展策略不同,计算能力是限制因素,预计五年内可实现逐像素建模。

Insight: 未来研究应关注计算效率的提升,而非单纯增加数据量。

Abstract: This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr’echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

[18] Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification

Abhipsa Basu,Aviral Gupta,Abhijnya Bhat,R. Venkatesh Babu

Main category: cs.CV

TL;DR: 该论文探讨了如何利用扩散模型生成平衡的训练数据以减少图像分类中的偏见。通过LoRA和DreamBooth等技术微调扩散模型,并结合聚类方法提升生成数据的准确性。实验表明,该方法优于普通Stable Diffusion,并在高偏置数据集上超越了Group-DRO等先进方法。

Details Motivation: 图像分类系统常因训练数据中不同群体的不平衡表示而产生偏见,例如金发头发与女性的过度关联。如何生成更平衡的数据以减少偏见是一个重要问题。

Contribution: 论文的主要贡献是提出多种扩散模型微调技术(如LoRA和DreamBooth),并结合聚类方法生成更准确的平衡数据。此外,展示了该方法在高偏置数据集上的优越性。

Method: 主要方法包括:1)使用LoRA和DreamBooth微调扩散模型;2)通过聚类每个群体的图像并训练多个DreamBooth模型以减少组内变化;3)生成平衡数据预训练后,在真实数据上微调。

Result: 实验表明,提出的微调方法优于普通Stable Diffusion,并在多个基准测试中达到与Group-DRO等先进方法相当的结果,尤其在高偏置数据集中表现更优。

Insight: 论文揭示了扩散模型通过适当微调和聚类可以生成更平衡的训练数据,从而有效减少图像分类中的群体偏见,尤其适用于高偏置数据集。

Abstract: Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases.

[19] WiCV at CVPR 2025: The Women in Computer Vision Workshop

Estefania Talavera,Deblina Bhattacharjee,Himangi Mittal,Mengwei Ren,Karen Sanchez,Carla Muntean,JungEun Kim,Mona Jalal

Main category: cs.CV

TL;DR: 本文档是对2025年CVPR会议上举办的‘计算机视觉领域女性研讨会’(WiCV)的总结报告,内容包括研讨会概况、参与统计数据、导师计划成果及历史趋势分析。

Details Motivation: 探讨WiCV在过去16年中如何推动计算机视觉领域女性和少数群体的可见性、包容性和职业发展,并为未来的类似倡议提供参考。

Contribution: 提供了WiCV活动的详细记录和数据,展示了其影响力及演变过程,为其他促进多样性和包容性的倡议提供借鉴。

Method: 通过统计分析和历史趋势比较,总结了2025年WiCV的论文提交和接受情况、导师计划配对数据以及赞助支持等。

Result: 2025年WiCV共接受14篇长论文和36篇摘要海报,导师计划配对80名学员与37名导师,吸引了100多名现场参与者,并获得10家赞助商支持。

Insight: WiCV的成功经验表明,持续的导师计划、赞助支持和社区参与是推动多样性和包容性的关键因素。

Abstract: The Women in Computer Vision Workshop (WiCV@CVPR 2025) was held in conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) in Nashville, Tennessee, United States. This report presents an overview of the workshop program, participation statistics, mentorship outcomes, and historical trends from previous WiCV editions. The goal is to document the impact and evolution of WiCV as a reference for future editions and for other initiatives aimed at advancing diversity, equity, and inclusion within the AI and computer vision communities. WiCV@CVPR 2025 marked the 16th edition of this long-standing event dedicated to increasing the visibility, inclusion, and professional growth of women and underrepresented minorities in the computer vision community. This year’s workshop featured 14 accepted papers in the CVPR Workshop Proceedings out of 32 full-paper submissions. Five of these were selected for oral presentations, while all 14 were also presented as posters, along with 36 extended abstract posters accepted from 62 short-paper submissions, which are not included in the proceedings. The mentoring program matched 80 mentees with 37 mentors from both academia and industry. The 2025 edition attracted over 100 onsite participants, fostering rich technical and networking interactions across all career stages. Supported by 10 sponsors and approximately $44,000 USD in travel grants and diversity awards, WiCV continued its mission to empower emerging researchers and amplify diverse voices in computer vision.

[20] SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph

Jingjie He,Weijie Liang,Zihan Shan,Matthew Caesar

Main category: cs.CV

TL;DR: 论文提出了一种名为SIFT-Graph的多模态防御框架,通过结合手工提取和学习到的特征,增强了传统视觉模型对对抗攻击的鲁棒性。

Details Motivation: 现代深度视觉模型依赖于密集的像素级表示,这些表示对不可察觉的扰动高度敏感,而传统的防御策略缺乏整合鲁棒视觉特征的机制。因此,需要一种新的方法来解决这一脆弱性。

Contribution: 提出了SIFT-Graph,一种通过结合手工特征(SIFT关键点)和学习特征(图注意力网络)的多模态防御框架,以增强模型对对抗攻击的鲁棒性。

Method: 整合了尺度不变特征变换(SIFT)关键点和图注意力网络(GAT),捕捉对抗扰动下依然稳定的局部结构特征,并将这些特征与传统视觉模型(如Vision Transformer和CNN)融合。

Result: 实验表明,SIFT-Graph显著提高了视觉模型对梯度白盒对抗攻击的鲁棒性,同时仅略微降低了干净数据的准确率。

Insight: 通过结合手工特征和学习特征,可以有效增强模型对对抗攻击的鲁棒性,同时保持对干净数据的性能。多模态融合可能是未来防御策略的重要方向。

Abstract: Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.

[21] DT-NVS: Diffusion Transformers for Novel View Synthesis

Wonbong Jang,Jonathan Tremblay,Lourdes Agapito

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散变换器(DT-NVS)的新视角合成方法,突破了现有方法在小幅度相机运动或对象中心场景下的限制,能够从单张图片生成多样化的3D场景视图。

Details Motivation: 现有基于扩散的方法仅适用于小幅相机运动或不自然的对象中心场景,限制了其在实际应用中的潜力。本文旨在解决这一问题,提出一种更通用的新视角合成方法。

Contribution: 1. 提出DT-NVS,一种3D感知扩散模型,利用变换器架构生成新视角;2. 创新性地改进了变换器和自注意力架构;3. 设计了新颖的相机条件策略和训练范式。

Method: 采用3D扩散模型,结合图像损失在大规模真实世界视频数据集上训练;通过变换器架构将图像转换为3D表示;引入新的相机条件策略和训练范式。

Result: 在单张输入图像的3D新视角合成任务中,DT-NVS优于现有基于扩散的3D感知模型和确定性方法,并能生成多样化输出。

Insight: 变换器架构在3D视觉任务中表现出强大潜力,尤其是在处理非对齐数据集时;训练范式的创新可以提升模型性能。

Abstract: Generating novel views of a natural scene, e.g., every-day scenes both indoors and outdoors, from a single view is an under-explored problem, even though it is an organic extension to the object-centric novel view synthesis. Existing diffusion-based approaches focus rather on small camera movements in real scenes or only consider unnatural object-centric scenes, limiting their potential applications in real-world settings. In this paper we move away from these constrained regimes and propose a 3D diffusion model trained with image-only losses on a large-scale dataset of real-world, multi-category, unaligned, and casually acquired videos of everyday scenes. We propose DT-NVS, a 3D-aware diffusion model for generalized novel view synthesis that exploits a transformer-based architecture backbone. We make significant contributions to transformer and self-attention architectures to translate images to 3d representations, and novel camera conditioning strategies to allow training on real-world unaligned datasets. In addition, we introduce a novel training paradigm swapping the role of reference frame between the conditioning image and the sampled noisy input. We evaluate our approach on the 3D task of generalized novel view synthesis from a single input image and show improvements over state-of-the-art 3D aware diffusion models and deterministic approaches, while generating diverse outputs.