Table of Contents
- cs.CL [Total: 17]
- cs.CV [Total: 66]
- cs.IR [Total: 1]
- q-bio.QM [Total: 2]
- eess.IV [Total: 7]
- eess.AS [Total: 1]
- cs.GR [Total: 1]
- cs.RO [Total: 1]
- cs.HC [Total: 2]
- cs.CR [Total: 1]
- cs.AI [Total: 2]
- cs.LG [Total: 4]
cs.CL [Back]
[1] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gheorghe Comanici,Eric Bieber,Mike Schaekermann,Ice Pasupat,Noveen Sachdeva,Inderjit Dhillon,Marcel Blistein,Ori Ram,Dan Zhang,Evan Rosen,Luke Marris,Sam Petulla,Colin Gaffney,Asaf Aharoni,Nathan Lintz,Tiago Cardal Pais,Henrik Jacobsson,Idan Szpektor,Nan-Jiang Jiang,Krishna Haridasan,Ahmed Omran,Nikunj Saunshi,Dara Bahri,Gaurav Mishra,Eric Chu,Toby Boyd,Brad Hekman,Aaron Parisi,Chaoyi Zhang,Kornraphop Kawintiranon,Tania Bedrax-Weiss,Oliver Wang,Ya Xu,Ollie Purkiss,Uri Mendlovic,Ilaï Deutel,Nam Nguyen,Adam Langley,Flip Korn,Lucia Rossazza,Alexandre Ramé,Sagar Waghmare,Helen Miller,Vaishakh Keshava,Ying Jian,Xiaofan Zhang,Raluca Ada Popa,Kedar Dhamdhere,Blaž Bratanič,Kyuyeun Kim,Terry Koo,Ferran Alet,Yi-ting Chen,Arsha Nagrani,Hannah Muckenhirn,Zhiyuan Zhang,Corbin Quick,Filip Pavetić,Duc Dung Nguyen,Joao Carreira,Michael Elabd,Haroon Qureshi,Fabian Mentzer,Yao-Yuan Yang,Danielle Eisenbud,Anmol Gulati,Ellie Talius,Eric Ni,Sahra Ghalebikesabi,Edouard Yvinec,Alaa Saade,Thatcher Ulrich,Lorenzo Blanco,Dan A. Calian,Muhuan Huang,Aäron van den Oord,Naman Goyal,Terry Chen,Praynaa Rawlani,Christian Schallhart,Swachhand Lokhande,Xianghong Luo,Jyn Shan,Ceslee Montgomery,Victoria Krakovna,Federico Piccinini,Omer Barak,Jingyu Cui,Yiling Jia,Mikhail Dektiarev,Alexey Kolganov,Shiyu Huang,Zhe Chen,Xingyu Wang,Jessica Austin,Peter de Boursac,Evgeny Sluzhaev,Frank Ding,Huijian Li,Surya Bhupatiraju,Mohit Agarwal,Sławek Kwasiborski,Paramjit Sandhu,Patrick Siegler,Ahmet Iscen,Eyal Ben-David,Shiraz Butt,Miltos Allamanis,Seth Benjamin,Robert Busa-Fekete,Felix Hernandez-Campos,Sasha Goldshtein,Matt Dibb,Weiyang Zhang,Annie Marsden,Carey Radebaugh,Stephen Roller,Abhishek Nayyar,Jacob Austin,Tayfun Terzi,Bhargav Kanagal Shamanna,Pete Shaw,Aayush Singh,Florian Luisier,Artur Mendonça,Vaibhav Aggarwal,Larisa Markeeva,Claudio Fantacci,Sergey Brin,HyunJeong Choe,Guanyu Wang,Hartwig Adam,Avigail Dabush,Tatsuya Kiyono,Eyal Marcus,Jeremy Cole,Theophane Weber,Hongrae Lee,Ronny Huang,Alex Muzio,Leandro Kieliger,Maigo Le,Courtney Biles,Long Le,Archit Sharma,Chengrun Yang,Avery Lamp,Dave Dopson,Nate Hurley,Katrina,Xu,Zhihao Shan,Shuang Song,Jiewen Tan,Alexandre Senges,George Zhang,Chong You,Yennie Jun,David Raposo,Susanna Ricco,Xuan Yang,Weijie Chen,Prakhar Gupta,Arthur Szlam,Kevin Villela,Chun-Sung Ferng,Daniel Kasenberg,Chen Liang,Rui Zhu,Arunachalam Narayanaswamy,Florence Perot,Paul Pucciarelli,Anna Shekhawat,Alexey Stern,Rishikesh Ingale,Stefani Karp,Sanaz Bahargam,Adrian Goedeckemeyer,Jie Han,Sicheng Li,Andrea Tacchetti,Dian Yu,Abhishek Chakladar,Zhiying Zhang,Mona El Mahdy,Xu Gao,Dale Johnson,Samrat Phatale,AJ Piergiovanni,Hyeontaek Lim,Clement Farabet,Carl Lebsack,Theo Guidroz,John Blitzer,Nico Duduta,David Madras,Steve Li,Daniel von Dincklage,Xin Li,Mahdis Mahdieh,George Tucker,Ganesh Jawahar,Owen Xiao,Danny Tarlow,Robert Geirhos,Noam Velan,Daniel Vlasic,Kalesha Bullard,SK Park,Nishesh Gupta,Kellie Webster,Ayal Hitron,Jieming Mao,Julian Eisenschlos,Laurel Prince,Nina D’Souza,Kelvin Zheng,Sara Nasso,Gabriela Botea,Carl Doersch,Caglar Unlu,Chris Alberti,Alexey Svyatkovskiy,Ankita Goel,Krzysztof Choromanski,Pan-Pan Jiang,Richard Nguyen,Four Flynn,Daria Ćurko,Peter Chen,Nicholas Roth,Kieran Milan,Caleb Habtegebriel,Shashi Narayan,Michael Moffitt,Jake Marcus,Thomas Anthony,Brendan McMahan,Gowoon Cheon,Ruibo Liu,Megan Barnes,Lukasz Lew,Rebeca Santamaria-Fernandez,Mayank Upadhyay,Arjun Akula,Arnar Mar Hrafnkelsson,Alvaro Caceres,Andrew Bunner,Michal Sokolik,Subha Puttagunta,Lawrence Moore,Berivan Isik,Weilun Chen,Jay Hartford,Lawrence Chan,Pradeep Shenoy,Dan Holtmann-Rice,Jane Park,Fabio Viola,Alex Salcianu,Sujeevan Rajayogam,Ian Stewart-Binks,Zelin Wu,Richard Everett,Xi Xiong,Pierre-Antoine Manzagol,Gary Leung,Carl Saroufim,Bo Pang,Dawid Wegner,George Papamakarios,Jennimaria Palomaki,Helena Pankov,Guangda Lai,Guilherme Tubone,Shubin Zhao,Theofilos Strinopoulos,Seth Neel,Mingqiu Wang,Joe Kelley,Li Li,Pingmei Xu,Anitha Vijayakumar,Andrea D’olimpio,Omer Levy,Massimo Nicosia,Grigory Rozhdestvenskiy,Ni Lao,Sirui Xie,Yash Katariya,Jon Simon,Sanjiv Kumar,Florian Hartmann,Michael Kilgore,Jinhyuk Lee,Aroma Mahendru,Roman Ring,Tom Hennigan,Fiona Lang,Colin Cherry,David Steiner,Dawsen Hwang,Ray Smith,Pidong Wang,Jeremy Chen,Ming-Hsuan Yang,Sam Kwei,Philippe Schlattner,Donnie Kim,Ganesh Poomal Girirajan,Nikola Momchev,Ayushi Agarwal,Xingyi Zhou,Ilkin Safarli,Zachary Garrett,AJ Pierigiovanni,Sarthak Jauhari,Alif Raditya Rochman,Shikhar Vashishth,Quan Yuan,Christof Angermueller,Jon Blanton,Xinying Song,Nitesh Bharadwaj Gundavarapu,Thi Avrahami,Maxine Deines,Subhrajit Roy,Manish Gupta,Christopher Semturs,Shobha Vasudevan,Aditya Srikanth Veerubhotla,Shriya Sharma,Josh Jacob,Zhen Yang,Andreas Terzis,Dan Karliner,Auriel Wright,Tania Rojas-Esponda,Ashley Brown,Abhijit Guha Roy,Pawan Dogra,Andrei Kapishnikov,Peter Young,Wendy Kan,Vinodh Kumar Rajendran,Maria Ivanova,Salil Deshmukh,Chia-Hua Ho,Mike Kwong,Stav Ginzburg,Annie Louis,KP Sawhney,Slav Petrov,Jing Xie,Yunfei Bai,Georgi Stoyanov,Alex Fabrikant,Rajesh Jayaram,Yuqi Li,Joe Heyward,Justin Gilmer,Yaqing Wang,Radu Soricut,Luyang Liu,Qingnan Duan,Jamie Hayes,Maura O’Brien,Gaurav Singh Tomar,Sivan Eiger,Bahar Fatemi,Jeffrey Hui,Catarina Barros,Adaeze Chukwuka,Alena Butryna,Saksham Thakur,Austin Huang,Zhufeng Pan,Haotian Tang,Serkan Cabi,Tulsee Doshi,Michiel Bakker,Sumit Bagri,Ruy Ley-Wild,Adam Lelkes,Jennie Lees,Patrick Kane,David Greene,Shimu Wu,Jörg Bornschein,Gabriela Surita,Sarah Hodkinson,Fangtao Li,Chris Hidey,Sébastien Pereira,Sean Ammirati,Phillip Lippe,Adam Kraft,Pu Han,Sebastian Gerlach,Zifeng Wang,Liviu Panait,Feng Han,Brian Farris,Yingying Bi,Hannah DeBalsi,Miaosen Wang,Gladys Tyen,James Cohan,Susan Zhang,Jarred Barber,Da-Woon Chung,Jaeyoun Kim,Markus Kunesch,Steven Pecht,Nami Akazawa,Abe Friesen,James Lyon,Ali Eslami,Junru Wu,Jie Tan,Yue Song,Ravi Kumar,Chris Welty,Ilia Akolzin,Gena Gibson,Sean Augenstein,Arjun Pillai,Nancy Yuen,Du Phan,Xin Wang,Iain Barr,Heiga Zen,Nan Hua,Casper Liu,Jilei,Wang,Tanuj Bhatia,Hao Xu,Oded Elyada,Pushmeet Kohli,Mirek Olšák,Ke Chen,Azalia Mirhoseini,Noam Shazeer,Shoshana Jakobovits,Maggie Tran,Nolan Ramsden,Tarun Bharti,Fred Alcober,Yunjie Li,Shilpa Shetty,Jing Chen,Dmitry Kalashnikov,Megha Nawhal,Sercan Arik,Hanwen Chen,Michiel Blokzijl,Shubham Gupta,James Rubin,Rigel Swavely,Sophie Bridgers,Ian Gemp,Chen Su,Arun Suggala,Juliette Pluto,Mary Cassin,Alain Vaucher,Kaiyang Ji,Jiahao Cai,Andrew Audibert,Animesh Sinha,David Tian,Efrat Farkash,Amy Hua,Jilin Chen,Duc-Hieu Tran,Edward Loper,Nicole Brichtova,Lara McConnaughey,Ballie Sandhu,Robert Leland,Doug DeCarlo,Andrew Over,James Huang,Xing Wu,Connie Fan,Eric Li,Yun Lei,Deepak Sharma,Cosmin Paduraru,Luo Yu,Matko Bošnjak,Phuong Dao,Min Choi,Sneha Kudugunta,Jakub Adamek,Carlos Guía,Ali Khodaei,Jie Feng,Wenjun Zeng,David Welling,Sandeep Tata,Christina Butterfield,Andrey Vlasov,Seliem El-Sayed,Swaroop Mishra,Tara Sainath,Shentao Yang,RJ Skerry-Ryan,Jeremy Shar,Robert Berry,Arunkumar Rajendran,Arun Kandoor,Andrea Burns,Deepali Jain,Tom Stone,Wonpyo Park,Shibo Wang,Albin Cassirer,Guohui Wang,Hayato Kobayashi,Sergey Rogulenko,Vineetha Govindaraj,Mikołaj Rybiński,Nadav Olmert,Colin Evans,Po-Sen Huang,Kelvin Xu,Premal Shah,Terry Thurk,Caitlin Sikora,Mu Cai,Jin Xie,Elahe Dabir,Saloni Shah,Norbert Kalb,Carrie Zhang,Shruthi Prabhakara,Amit Sabne,Artiom Myaskovsky,Vikas Raunak,Blanca Huergo,Behnam Neyshabur,Jon Clark,Ye Zhang,Shankar Krishnan,Eden Cohen,Dinesh Tewari,James Lottes,Yumeya Yamamori,Hui,Li,Mohamed Elhawaty,Ada Maksutaj Oflazer,Adrià Recasens,Sheryl Luo,Duy Nguyen,Taylor Bos,Kalyan Andra,Ana Salazar,Ed Chi,Jeongwoo Ko,Matt Ginsberg,Anders Andreassen,Anian Ruoss,Todor Davchev,Elnaz Davoodi,Chenxi Liu,Min Kim,Santiago Ontanon,Chi Ming To,Dawei Jia,Rosemary Ke,Jing Wang,Anna Korsun,Moran Ambar,Ilya Kornakov,Irene Giannoumis,Toni Creswell,Denny Zhou,Yi Su,Ishaan Watts,Aleksandr Zaks,Evgenii Eltyshev,Ziqiang Feng,Sidharth Mudgal,Alex Kaskasoli,Juliette Love,Kingshuk Dasgupta,Sam Shleifer,Richard Green,Sungyong Seo,Chansoo Lee,Dale Webster,Prakash Shroff,Ganna Raboshchuk,Isabel Leal,James Manyika,Sofia Erell,Daniel Murphy,Zhisheng Xiao,Anton Bulyenov,Julian Walker,Mark Collier,Matej Kastelic,Nelson George,Sushant Prakash,Sailesh Sidhwani,Alexey Frolov,Steven Hansen,Petko Georgiev,Tiberiu Sosea,Chris Apps,Aishwarya Kamath,David Reid,Emma Cooney,Charlotte Magister,Oriana Riva,Alec Go,Pu-Chin Chen,Sebastian Krause,Nir Levine,Marco Fornoni,Ilya Figotin,Nick Roy,Parsa Mahmoudieh,Vladimir Magay,Mukundan Madhavan,Jin Miao,Jianmo Ni,Yasuhisa Fujii,Ian Chou,George Scrivener,Zak Tsai,Siobhan Mcloughlin,Jeremy Selier,Sandra Lefdal,Jeffrey Zhao,Abhijit Karmarkar,Kushal Chauhan,Shivanker Goel,Zhaoyi Zhang,Vihan Jain,Parisa Haghani,Mostafa Dehghani,Jacob Scott,Erin Farnese,Anastasija Ilić,Steven Baker,Julia Pawar,Li Zhong,Josh Camp,Yoel Zeldes,Shravya Shetty,Anand Iyer,Vít Listík,Jiaxian Guo,Luming Tang,Mark Geller,Simon Bucher,Yifan Ding,Hongzhi Shi,Carrie Muir,Dominik Grewe,Ramy Eskander,Octavio Ponce,Boqing Gong,Derek Gasaway,Samira Khan,Umang Gupta,Angelos Filos,Weicheng Kuo,Klemen Kloboves,Jennifer Beattie,Christian Wright,Leon Li,Alicia Jin,Sandeep Mariserla,Miteyan Patel,Jens Heitkaemper,Dilip Krishnan,Vivek Sharma,David Bieber,Christian Frank,John Lambert,Paul Caron,Martin Polacek,Mai Giménez,Himadri Choudhury,Xing Yu,Sasan Tavakkol,Arun Ahuja,Franz Och,Rodolphe Jenatton,Wojtek Skut,Bryan Richter,David Gaddy,Andy Ly,Misha Bilenko,Megh Umekar,Ethan Liang,Martin Sevenich,Mandar Joshi,Hassan Mansoor,Rebecca Lin,Sumit Sanghai,Abhimanyu Singh,Xiaowei Li,Sudheendra Vijayanarasimhan,Zaheer Abbas,Yonatan Bitton,Hansa Srinivasan,Manish Reddy Vuyyuru,Alexander Frömmgen,Yanhua Sun,Ralph Leith,Alfonso Castaño,DJ Strouse,Le Yan,Austin Kyker,Satish Kambala,Mary Jasarevic,Thibault Sellam,Chao Jia,Alexander Pritzel,Raghavender R,Huizhong Chen,Natalie Clay,Sudeep Gandhe,Sean Kirmani,Sayna Ebrahimi,Hannah Kirkwood,Jonathan Mallinson,Chao Wang,Adnan Ozturel,Kuo Lin,Shyam Upadhyay,Vincent Cohen-Addad,Sean Purser-haskell,Yichong Xu,Ebrahim Songhori,Babi Seal,Alberto Magni,Almog Gueta,Tingting Zou,Guru Guruganesh,Thais Kagohara,Hung Nguyen,Khalid Salama,Alejandro Cruzado Ruiz,Justin Frye,Zhenkai Zhu,Matthias Lochbrunner,Simon Osindero,Wentao Yuan,Lisa Lee,Aman Prasad,Lam Nguyen Thiet,Daniele Calandriello,Victor Stone,Qixuan Feng,Han Ke,Maria Voitovich,Geta Sampemane,Lewis Chiang,Ling Wu,Alexander Bykovsky,Matt Young,Luke Vilnis,Ishita Dasgupta,Aditya Chawla,Qin Cao,Bowen Liang,Daniel Toyama,Szabolcs Payrits,Anca Stefanoiu,Dimitrios Vytiniotis,Ankesh Anand,Tianxiao Shen,Blagoj Mitrevski,Michael Tschannen,Sreenivas Gollapudi,Aishwarya P S,José Leal,Zhe Shen,Han Fu,Wei Wang,Arvind Kannan,Doron Kukliansky,Sergey Yaroshenko,Svetlana Grant,Umesh Telang,David Wood,Alexandra Chronopoulou,Alexandru Ţifrea,Tao Zhou,Tony,Nguy~ên,Muge Ersoy,Anima Singh,Meiyan Xie,Emanuel Taropa,Woohyun Han,Eirikur Agustsson,Andrei Sozanschi,Hui Peng,Alex Chen,Yoel Drori,Efren Robles,Yang Gao,Xerxes Dotiwalla,Ying Chen,Anudhyan Boral,Alexei Bendebury,John Nham,Chris Tar,Luis Castro,Jiepu Jiang,Canoee Liu,Felix Halim,Jinoo Baek,Andy Wan,Jeremiah Liu,Yuan Cao,Shengyang Dai,Trilok Acharya,Ruoxi Sun,Fuzhao Xue,Saket Joshi,Morgane Lustman,Yongqin Xian,Rishabh Joshi,Deep Karkhanis,Nora Kassner,Jamie Hall,Xiangzhuo Ding,Gan Song,Gang Li,Chen Zhu,Yana Kulizhskaya,Bin Ni,Alexey Vlaskin,Solomon Demmessie,Lucio Dery,Salah Zaiem,Yanping Huang,Cindy Fan,Felix Gimeno,Ananth Balashankar,Koji Kojima,Hagai Taitelbaum,Maya Meng,Dero Gharibian,Sahil Singla,Wei Chen,Ambrose Slone,Guanjie Chen,Sujee Rajayogam,Max Schumacher,Suyog Kotecha,Rory Blevins,Qifei Wang,Mor Hazan Taege,Alex Morris,Xin Liu,Fayaz Jamil,Richard Zhang,Pratik Joshi,Ben Ingram,Tyler Liechty,Ahmed Eleryan,Scott Baird,Alex Grills,Gagan Bansal,Shan Han,Kiran Yalasangi,Shawn Xu,Majd Al Merey,Isabel Gao,Felix Weissenberger,Igor Karpov,Robert Riachi,Ankit Anand,Gautam Prasad,Kay Lamerigts,Reid Hayes,Jamie Rogers,Mandy Guo,Ashish Shenoy,Qiong,Hu,Kyle He,Yuchen Liu,Polina Zablotskaia,Sagar Gubbi,Yifan Chang,Jay Pavagadhi,Kristian Kjems,Archita Vadali,Diego Machado,Yeqing Li,Renshen Wang,Dipankar Ghosh,Aahil Mehta,Dana Alon,George Polovets,Alessio Tonioni,Nate Kushman,Joel D’sa,Lin Zhuo,Allen Wu,Rohin Shah,John Youssef,Jiayu Ye,Justin Snyder,Karel Lenc,Senaka Buthpitiya,Matthew Tung,Jichuan Chang,Tao Chen,David Saxton,Jenny Lee,Lydia Lihui Zhang,James Qin,Prabakar Radhakrishnan,Maxwell Chen,Piotr Ambroszczyk,Metin Toksoz-Exley,Yan Zhong,Nitzan Katz,Brendan O’Donoghue,Tamara von Glehn,Adi Gerzi Rosenthal,Aga Świetlik,Xiaokai Zhao,Nick Fernando,Jinliang Wei,Jieru Mei,Sergei Vassilvitskii,Diego Cedillo,Pranjal Awasthi,Hui Zheng,Koray Kavukcuoglu,Itay Laish,Joseph Pagadora,Marc Brockschmidt,Christopher A. Choquette-Choo,Arunkumar Byravan,Yifeng Lu,Xu Chen,Mia Chen,Kenton Lee,Rama Pasumarthi,Sijal Bhatnagar,Aditya Shah,Qiyin Wu,Zhuoyuan Chen,Zack Nado,Bartek Perz,Zixuan Jiang,David Kao,Ganesh Mallya,Nino Vieillard,Lantao Mei,Sertan Girgin,Mandy Jordan,Yeongil Ko,Alekh Agarwal,Yaxin Liu,Yasemin Altun,Raoul de Liedekerke,Anastasios Kementsietsidis,Daiyi Peng,Dangyi Liu,Utku Evci,Peter Humphreys,Austin Tarango,Xiang Deng,Yoad Lewenberg,Kevin Aydin,Chengda Wu,Bhavishya Mittal,Tsendsuren Munkhdalai,Kleopatra Chatziprimou,Rodrigo Benenson,Uri First,Xiao Ma,Jinning Li,Armand Joulin,Hamish Tomlinson,Tingnan Zhang,Milad Nasr,Zhi Hong,Michaël Sander,Lisa Anne Hendricks,Anuj Sharma,Andrew Bolt,Eszter Vértes,Jiri Simsa,Tomer Levinboim,Olcan Sercinoglu,Divyansh Shukla,Austin Wu,Craig Swanson,Danny Vainstein,Fan Bu,Bo Wang,Ryan Julian,Charles Yoon,Sergei Lebedev,Antonious Girgis,Bernd Bandemer,David Du,Todd Wang,Xi Chen,Ying Xiao,Peggy Lu,Natalie Ha,Vlad Ionescu,Simon Rowe,Josip Matak,Federico Lebron,Andreas Steiner,Lalit Jain,Manaal Faruqui,Nicolas Lacasse,Georgie Evans,Neesha Subramaniam,Dean Reich,Giulia Vezzani,Aditya Pandey,Joe Stanton,Tianhao Zhou,Liam McCafferty,Henry Griffiths,Verena Rieser,Soheil Hassas Yeganeh,Eleftheria Briakou,Lu Huang,Zichuan Wei,Liangchen Luo,Erik Jue,Gabby Wang,Victor Cotruta,Myriam Khan,Jongbin Park,Qiuchen Guo,Peiran Li,Rong Rong,Diego Antognini,Anastasia Petrushkina,Chetan Tekur,Eli Collins,Parul Bhatia,Chester Kwak,Wenhu Chen,Arvind Neelakantan,Immanuel Odisho,Sheng Peng,Vincent Nallatamby,Vaibhav Tulsyan,Fabian Pedregosa,Peng Xu,Raymond Lin,Yulong Wang,Emma Wang,Sholto Douglas,Reut Tsarfaty,Elena Gribovskaya,Renga Aravamudhan,Manu Agarwal,Mara Finkelstein,Qiao Zhang,Elizabeth Cole,Phil Crone,Sarmishta Velury,Anil Das,Chris Sauer,Luyao Xu,Danfeng Qin,Chenjie Gu,Dror Marcus,CJ Zheng,Wouter Van Gansbeke,Sobhan Miryoosefi,Haitian Sun,YaGuang Li,Charlie Chen,Jae Yoo,Pavel Dubov,Alex Tomala,Adams Yu,Paweł Wesołowski,Alok Gunjan,Eddie Cao,Jiaming Luo,Nikhil Sethi,Arkadiusz Socala,Laura Graesser,Tomas Kocisky,Arturo BC,Minmin Chen,Edward Lee,Sophie Wang,Weize Kong,Qiantong Xu,Nilesh Tripuraneni,Yiming Li,Xinxin Yu,Allen Porter,Paul Voigtlaender,Biao Zhang,Arpi Vezer,Sarah York,Qing Wei,Geoffrey Cideron,Mark Kurzeja,Seungyeon Kim,Benny Li,Angéline Pouget,Hyo Lee,Kaspar Daugaard,Yang Li,Dave Uthus,Aditya Siddhant,Paul Cavallaro,Sriram Ganapathy,Maulik Shah,Rolf Jagerman,Jeff Stanway,Piermaria Mendolicchio,Li Xiao,Kayi Lee,Tara Thompson,Shubham Milind Phal,Jason Chase,Sun Jae Lee,Adrian N Reyes,Disha Shrivastava,Zhen Qin,Roykrong Sukkerd,Seth Odoom,Lior Madmoni,John Aslanides,Jonathan Herzig,Elena Pochernina,Sheng Zhang,Parker Barnes,Daisuke Ikeda,Qiujia Li,Shuo-yiin Chang,Shakir Mohamed,Jim Sproch,Richard Powell,Bidisha Samanta,Domagoj Ćevid,Anton Kovsharov,Shrestha Basu Mallick,Srinivas Tadepalli,Anne Zheng,Kareem Ayoub,Andreas Noever,Christian Reisswig,Zhuo Xu,Junhyuk Oh,Martin Matysiak,Tim Blyth,Shereen Ashraf,Julien Amelot,Boone Severson,Michele Bevilacqua,Motoki Sano,Ethan Dyer,Ofir Roval,Anu Sinha,Yin Zhong,Sagi Perel,Tea Sabolić,Johannes Mauerer,Willi Gierke,Mauro Verzetti,Rodrigo Cabrera,Alvin Abdagic,Steven Hemingray,Austin Stone,Jong Lee,Farooq Ahmad,Karthik Raman,Lior Shani,Jonathan Lai,Orhan Firat,Nathan Waters,Eric Ge,Mo Shomrat,Himanshu Gupta,Rajeev Aggarwal,Tom Hudson,Bill Jia,Simon Baumgartner,Palak Jain,Joe Kovac,Junehyuk Jung,Ante Žužul,Will Truong,Morteza Zadimoghaddam,Songyou Peng,Marco Liang,Rachel Sterneck,Balaji Lakshminarayanan,Machel Reid,Oliver Woodman,Tong Zhou,Jianling Wang,Vincent Coriou,Arjun Narayanan,Jay Hoover,Yenai Ma,Apoorv Jindal,Clayton Sanford,Doug Reid,Swaroop Ramaswamy,Alex Kurakin,Roland Zimmermann,Yana Lunts,Dragos Dena,Zalán Borsos,Vered Cohen,Shujian Zhang,Will Grathwohl,Robert Dadashi,Morgan Redshaw,Joshua Kessinger,Julian Odell,Silvano Bonacina,Zihang Dai,Grace Chen,Ayush Dubey,Pablo Sprechmann,Mantas Pajarskas,Wenxuan Zhou,Niharika Ahuja,Tara Thomas,Martin Nikoltchev,Matija Kecman,Bharath Mankalale,Andrey Ryabtsev,Jennifer She,Christian Walder,Jiaming Shen,Lu Li,Carolina Parada,Sheena Panthaplackel,Okwan Kwon,Matt Lawlor,Utsav Prabhu,Yannick Schroecker,Marc’aurelio Ranzato,Pete Blois,Iurii Kemaev,Ting Yu,Dmitry,Lepikhin,Hao Xiong,Sahand Sharifzadeh,Oleaser Johnson,Jeremiah Willcock,Rui Yao,Greg Farquhar,Sujoy Basu,Hidetoshi Shimokawa,Nina Anderson,Haiguang Li,Khiem Pham,Yizhong Liang,Sebastian Borgeaud,Alexandre Moufarek,Hideto Kazawa,Blair Kutzman,Marcin Sieniek,Sara Smoot,Ruth Wang,Natalie Axelsson,Nova Fallen,Prasha Sundaram,Yuexiang Zhai,Varun Godbole,Petros Maniatis,Alek Wang,Ilia Shumailov,Santhosh Thangaraj,Remi Crocker,Nikita Gupta,Gang Wu,Phil Chen,Gellért Weisz,Celine Smith,Mojtaba Seyedhosseini,Boya Fang,Xiyang Luo,Roey Yogev,Zeynep Cankara,Andrew Hard,Helen Ran,Rahul Sukthankar,George Necula,Gaël Liu,Honglong Cai,Praseem Banzal,Daniel Keysers,Sanjay Ghemawat,Connie Tao,Emma Dunleavy,Aditi Chaudhary,Wei Li,Maciej Mikuła,Chen-Yu Lee,Tiziana Refice,Krishna Somandepalli,Alexandre Fréchette,Dan Bahir,John Karro,Keith Rush,Sarah Perrin,Bill Rosgen,Xiaomeng Yang,Clara Huiyi Hu,Mahmoud Alnahlawi,Justin Mao-Jones,Roopal Garg,Hoang Nguyen,Bat-Orgil Batsaikhan,Iñaki Iturrate,Anselm Levskaya,Avi Singh,Ashyana Kachra,Tony Lu,Denis Petek,Zheng Xu,Mark Graham,Lukas Zilka,Yael Karov,Marija Kostelac,Fangyu Liu,Yaohui Guo,Weiyue Wang,Bernd Bohnet,Emily Pitler,Tony Bruguier,Keisuke Kinoshita,Chrysovalantis Anastasiou,Nilpa Jha,Ting Liu,Jerome Connor,Phil Wallis,Philip Pham,Eric Bailey,Shixin Li,Heng-Tze Cheng,Sally Ma,Haiqiong Li,Akanksha Maurya,Kate Olszewska,Manfred Warmuth,Christy Koh,Dominik Paulus,Siddhartha Reddy Jonnalagadda,Enrique Piqueras,Ali Elqursh,Geoff Brown,Hadar Shemtov,Loren Maggiore,Fei Xia,Ryan Foley,Beka Westberg,George van den Driessche,Livio Baldini Soares,Arjun Kar,Michael Quinn,Siqi Zuo,Jialin Wu,Kyle Kastner,Anna Bortsova,Aijun Bai,Ales Mikhalap,Luowei Zhou,Jennifer Brennan,Vinay Ramasesh,Honglei Zhuang,John Maggs,Johan Schalkwyk,Yuntao Xu,Hui Huang,Andrew Howard,Sasha Brown,Linting Xue,Gloria Shen,Brian Albert,Neha Jha,Daniel Zheng,Varvara Krayvanova,Spurthi Amba Hombaiah,Olivier Lacombe,Gautam Vasudevan,Dan Graur,Tian Xie,Meet Gandhi,Bangju Wang,Dustin Zelle,Harman Singh,Dahun Kim,Sébastien Cevey,Victor Ungureanu,Natasha Noy,Fei Liu,Annie Xie,Fangxiaoyu Feng,Katerina Tsihlas,Daniel Formoso,Neera Vats,Quentin Wellens,Yinan Wang,Niket Kumar Bhumihar,Samrat Ghosh,Matt Hoffman,Tom Lieber,Oran Lang,Kush Bhatia,Tom Paine,Aroonalok Pyne,Ronny Votel,Madeleine Clare Elish,Benoit Schillings,Alex Panagopoulos,Haichuan Yang,Adam Raveret,Zohar Yahav,Shuang Liu,Warren Chen,Dalia El Badawy,Nishant Agrawal,Mohammed Badawi,Mahdi Mirzazadeh,Carla Bromberg,Fan Ye,Chang Liu,Tatiana Sholokhova,George-Cristian Muraru,Gargi Balasubramaniam,Jonathan Malmaud,Alen Carin,Danilo Martins,Irina Jurenka,Pankil Botadra,Dave Lacey,Richa Singh,Mariano Schain,Dan Zheng,Isabelle Guyon,Victor Lavrenko,Seungji Lee,Xiang Zhou,Demis Hassabis,Jeshwanth Challagundla,Derek Cheng,Nikhil Mehta,Matthew Mauger,Michela Paganini,Pushkar Mishra,Kate Lee,Zhang Li,Lexi Baugher,Ondrej Skopek,Max Chang,Amir Zait,Gaurav Menghani,Lizzetth Bellot,Guangxing Han,Jean-Michel Sarr,Sharat Chikkerur,Himanshu Sahni,Rohan Anil,Arun Narayanan,Chandu Thekkath,Daniele Pighin,Hana Strejček,Marko Velic,Fred Bertsch,Manuel Tragut,Keran Rong,Alicia Parrish,Kai Bailey,Jiho Park,Isabela Albuquerque,Abhishek Bapna,Rajesh Venkataraman,Alec Kosik,Johannes Griesser,Zhiwei Deng,Alek Andreev,Qingyun Dou,Kevin Hui,Fanny Wei,Xiaobin Yu,Lei Shu,Avia Aharon,David Barker,Badih Ghazi,Sebastian Flennerhag,Chris Breaux,Yuchuan Liu,Matthew Bilotti,Josh Woodward,Uri Alon,Stephanie Winkler,Tzu-Kuo Huang,Kostas Andriopoulos,João Gabriel Oliveira,Penporn Koanantakool,Berkin Akin,Michael Wunder,Cicero Nogueira dos Santos,Mohammad Hossein Bateni,Lin Yang,Dan Horgan,Beer Changpinyo,Keyvan Amiri,Min Ma,Dayeong Lee,Lihao Liang,Anirudh Baddepudi,Tejasi Latkar,Raia Hadsell,Jun Xu,Hairong Mu,Michael Han,Aedan Pope,Snchit Grover,Frank Kim,Ankit Bhagatwala,Guan Sun,Yamini Bansal,Amir Globerson,Alireza Nazari,Samira Daruki,Hagen Soltau,Jane Labanowski,Laurent El Shafey,Matt Harvey,Yanif Ahmad,Elan Rosenfeld,William Kong,Etienne Pot,Yi-Xuan Tan,Aurora Wei,Victoria Langston,Marcel Prasetya,Petar Veličković,Richard Killam,Robin Strudel,Darren Ni,Zhenhai Zhu,Aaron Archer,Kavya Kopparapu,Lynn Nguyen,Emilio Parisotto,Hussain Masoom,Sravanti Addepalli,Jordan Grimstad,Hexiang Hu,Joss Moore,Avinatan Hassidim,Le Hou,Mukund Raghavachari,Jared Lichtarge,Adam R. Brown,Hilal Dib,Natalia Ponomareva,Justin Fu,Yujing Zhang,Altaf Rahman,Joana Iljazi,Edouard Leurent,Gabriel Dulac-Arnold,Cosmo Du,Chulayuth Asawaroengchai,Larry Jin,Ela Gruzewska,Ziwei Ji,Benigno Uria,Daniel De Freitas,Paul Barham,Lauren Beltrone,Víctor Campos,Jun Yan,Neel Kovelamudi,Arthur Nguyen,Elinor Davies,Zhichun Wu,Zoltan Egyed,Kristina Toutanova,Nithya Attaluri,Hongliang Fei,Peter Stys,Siddhartha Brahma,Martin Izzard,Siva Velusamy,Scott Lundberg,Vincent Zhuang,Kevin Sequeira,Adam Santoro,Ehsan Amid,Ophir Aharoni,Shuai Ye,Mukund Sundararajan,Lijun Yu,Yu-Cheng Ling,Stephen Spencer,Hugo Song,Josip Djolonga,Christo Kirov,Sonal Gupta,Alessandro Bissacco,Clemens Meyer,Mukul Bhutani,Andrew Dai,Weiyi Wang,Siqi Liu,Ashwin Sreevatsa,Qijun Tan,Maria Wang,Lucy Kim,Yicheng Wang,Alex Irpan,Yang Xiao,Stanislav Fort,Yifan He,Alex Gurney,Bryan Gale,Yue Ma,Monica Roy,Viorica Patraucean,Taylan Bilal,Golnaz Ghiasi,Anahita Hosseini,Melvin Johnson,Zhuowan Li,Yi Tay,Benjamin Beyret,Katie Millican,Josef Broder,Mayank Lunayach,Danny Swisher,Eugen Vušak,David Parkinson,MH Tessler,Adi Mayrav Gilady,Richard Song,Allan Dafoe,Yves Raimond,Masa Yamaguchi,Itay Karo,Elizabeth Nielsen,Kevin Kilgour,Mike Dusenberry,Rajiv Mathews,Jiho Choi,Siyuan Qiao,Harsh Mehta,Sahitya Potluri,Chris Knutsen,Jialu Liu,Tat Tan,Kuntal Sengupta,Keerthana Gopalakrishnan,Abodunrinwa Toki,Mencher Chiang,Mike Burrows,Grace Vesom,Zafarali Ahmed,Ilia Labzovsky,Siddharth Vashishtha,Preeti Singh,Ankur Sharma,Ada Ma,Jinyu Xie,Pranav Talluri,Hannah Forbes-Pollard,Aarush Selvan,Joel Wee,Loic Matthey,Tom Funkhouser,Parthasarathy Gopavarapu,Lev Proleev,Cheng Li,Matt Thomas,Kashyap Kolipaka,Zhipeng Jia,Ashwin Kakarla,Srinivas Sunkara,Joan Puigcerver,Suraj Satishkumar Sheth,Emily Graves,Chen Wang,Sadh MNM Khan,Kai Kang,Shyamal Buch,Fred Zhang,Omkar Savant,David Soergel,Kevin Lee,Linda Friso,Xuanyi Dong,Rahul Arya,Shreyas Chandrakaladharan,Connor Schenck,Greg Billock,Tejas Iyer,Anton Bakalov,Leslie Baker,Alex Ruiz,Angad Chandorkar,Trieu Trinh,Matt Miecnikowski,Yanqi Zhou,Yangsibo Huang,Jiazhong Nie,Ali Shah,Ashish Thapliyal,Sam Haves,Lun Wang,Uri Shaham,Patrick Morris-Suzuki,Soroush Radpour,Leonard Berrada,Thomas Strohmann,Chaochao Yan,Jingwei Shen,Sonam Goenka,Tris Warkentin,Petar Dević,Dan Belov,Albert Webson,Madhavi Yenugula,Puranjay Datta,Jerry Chang,Nimesh Ghelani,Aviral Kumar,Vincent Perot,Jessica Lo,Yang Song,Herman Schmit,Jianmin Chen,Vasilisa Bashlovkina,Xiaoyue Pan,Diana Mincu,Paul Roit,Isabel Edkins,Andy Davis,Yujia Li,Ben Horn,Xinjian Li,Pradeep Kumar S,Eric Doi,Wanzheng Zhu,Sri Gayatri Sundara Padmanabhan,Siddharth Verma,Jasmine Liu,Heng Chen,Mihajlo Velimirović,Malcolm Reynolds,Priyanka Agrawal,Nick Sukhanov,Abhinit Modi,Siddharth Goyal,John Palowitch,Nima Khajehnouri,Wing Lowe,David Klinghoffer,Sharon Silver,Vinh Tran,Candice Schumann,Francesco Piccinno,Xi Liu,Mario Lučić,Xiaochen Yang,Sandeep Kumar,Ajay Kannan,Ragha Kotikalapudi,Mudit Bansal,Fabian Fuchs,Javad Hosseini,Abdelrahman Abdelhamed,Dawn Bloxwich,Tianhe Yu,Ruoxin Sang,Gregory Thornton,Karan Gill,Yuchi Liu,Virat Shejwalkar,Jason Lin,Zhipeng Yan,Kehang Han,Thomas Buschmann,Michael Pliskin,Zhi Xing,Susheel Tatineni,Junlin Zhang,Sissie Hsiao,Gavin Buttimore,Marcus Wu,Zefei Li,Geza Kovacs,Legg Yeung,Tao Huang,Aaron Cohen,Bethanie Brownfield,Averi Nowak,Mikel Rodriguez,Tianze Shi,Hado van Hasselt,Kevin Cen,Deepanway Ghoshal,Kushal Majmundar,Weiren Yu,Warren,Chen,Danila Sinopalnikov,Hao Zhang,Vlado Galić,Di Lu,Zeyu Zheng,Maggie Song,Gary Wang,Gui Citovsky,Swapnil Gawde,Isaac Galatzer-Levy,David Silver,Ivana Balazevic,Dipanjan Das,Kingshuk Majumder,Yale Cong,Praneet Dutta,Dustin Tran,Hui Wan,Junwei Yuan,Daniel Eppens,Alanna Walton,Been Kim,Harry Ragan,James Cobon-Kerr,Lu Liu,Weijun Wang,Bryce Petrini,Jack Rae,Rakesh Shivanna,Yan Xiong,Chace Lee,Pauline Coquinot,Yiming Gu,Lisa Patel,Blake Hechtman,Aviel Boag,Orion Jankowski,Alex Wertheim,Alex Lee,Paul Covington,Hila Noga,Sam Sobell,Shanthal Vasanth,William Bono,Chirag Nagpal,Wei Fan,Xavier Garcia,Kedar Soparkar,Aybuke Turker,Nathan Howard,Sachit Menon,Yuankai Chen,Vikas Verma,Vladimir Pchelin,Harish Rajamani,Valentin Dalibard,Ana Ramalho,Yang Guo,Kartikeya Badola,Seojin Bang,Nathalie Rauschmayr,Julia Proskurnia,Sudeep Dasari,Xinyun Chen,Mikhail Sushkov,Anja Hauth,Pauline Sho,Abhinav Singh,Bilva Chandra,Allie Culp,Max Dylla,Olivier Bachem,James Besley,Heri Zhao,Timothy Lillicrap,Wei Wei,Wael Al Jishi,Ning Niu,Alban Rrustemi,Raphaël Lopez Kaufman,Ryan Poplin,Jewel Zhao,Minh Truong,Shikhar Bharadwaj,Ester Hlavnova,Eli Stickgold,Cordelia Schmid,Georgi Stephanov,Zhaoqi Leng,Frederick Liu,Léonard Hussenot,Shenil Dodhia,Juliana Vicente Franco,Lesley Katzen,Abhanshu Sharma,Sarah Cogan,Zuguang Yang,Aniket Ray,Sergi Caelles,Shen Yan,Ravin Kumar,Daniel Gillick,Renee Wong,Joshua Ainslie,Jonathan Hoech,Séb Arnold,Dan Abolafia,Anca Dragan,Ben Hora,Grace Hu,Alexey Guseynov,Yang Lu,Chas Leichner,Jinmeng Rao,Abhimanyu Goyal,Nagabhushan Baddi,Daniel Hernandez Diaz,Tim McConnell,Max Bain,Jake Abernethy,Qiqi Yan,Rylan Schaeffer,Paul Vicol,Will Thompson,Montse Gonzalez Arenas,Mathias Bellaiche,Pablo Barrio,Stefan Zinke,Riccardo Patana,Pulkit Mehta,JK Kearns,Avraham Ruderman,Scott Pollom,David D’Ambrosio,Cath Hope,Yang Yu,Andrea Gesmundo,Kuang-Huei Lee,Aviv Rosenberg,Yiqian Zhou,Yaoyiran Li,Drew Garmon,Yonghui Wu,Safeen Huda,Gil Fidel,Martin Baeuml,Jian Li,Phoebe Kirk,Rhys May,Tao Tu,Sara Mc Carthy,Toshiyuki Fukuzawa,Miranda Aperghis,Chih-Kuan Yeh,Toshihiro Yoshino,Bo Li,Austin Myers,Kaisheng Yao,Ben Limonchik,Changwan Ryu,Rohun Saxena,Alex Goldin,Ruizhe Zhao,Rocky Rhodes,Tao Zhu,Divya Tyam,Heidi Howard,Nathan Byrd,Hongxu Ma,Yan Wu,Ryan Mullins,Qingze Wang,Aida Amini,Sebastien Baur,Yiran Mao,Subhashini Venugopalan,Will Song,Wen Ding,Paul Collins,Sashank Reddi,Megan Shum,Andrei Rusu,Luisa Zintgraf,Kelvin Chan,Sheela Goenka,Mathieu Blondel,Michael Collins,Renke Pan,Marissa Giustina,Nikolai Chinaev,Christian Schuler,Ce Zheng,Jonas Valfridsson,Alyssa Loo,Alex Yakubovich,Jamie Smith,Tao Jiang,Rich Munoz,Gabriel Barcik,Rishabh Bansal,Mingyao Yang,Yilun Du,Pablo Duque,Mary Phuong,Alexandra Belias,Kunal Lad,Zeyu Liu,Tal Schuster,Karthik Duddu,Jieru Hu,Paige Kunkle,Matthew Watson,Jackson Tolins,Josh Smith,Denis Teplyashin,Garrett Bingham,Marvin Ritter,Marco Andreetto,Divya Pitta,Mohak Patel,Shashank Viswanadha,Trevor Strohman,Catalin Ionescu,Jincheng Luo,Yogesh Kalley,Jeremy Wiesner,Dan Deutsch,Derek Lockhart,Peter Choy,Rumen Dangovski,Chawin Sitawarin,Cat Graves,Tanya Lando,Joost van Amersfoort,Ndidi Elue,Zhouyuan Huo,Pooya Moradi,Jean Tarbouriech,Henryk Michalewski,Wenting Ye,Eunyoung Kim,Alex Druinsky,Florent Altché,Xinyi Chen,Artur Dwornik,Da-Cheng Juan,Rivka Moroshko,Horia Toma,Jarrod Kahn,Hai Qian,Maximilian Sieb,Irene Cai,Roman Goldenberg,Praneeth Netrapalli,Sindhu Raghuram,Yuan Gong,Lijie Fan,Evan Palmer,Yossi Matias,Valentin Gabeur,Shreya Pathak,Tom Ouyang,Don Metzler,Geoff Bacon,Srinivasan Venkatachary,Sridhar Thiagarajan,Alex Cullum,Eran Ofek,Vytenis Sakenas,Mohamed Hammad,Cesar Magalhaes,Mayank Daswani,Oscar Chang,Ashok Popat,Ruichao Li,Komal Jalan,Yanhan Hou,Josh Lipschultz,Antoine He,Wenhao Jia,Pier Giuseppe Sessa,Prateek Kolhar,William Wong,Sumeet Singh,Lukas Haas,Jay Whang,Hanna Klimczak-Plucińska,Georges Rotival,Grace Chung,Yiqing Hua,Anfal Siddiqui,Nicolas Serrano,Dongkai Chen,Billy Porter,Libin Bai,Keshav Shivam,Sho Arora,Partha Talukdar,Tom Cobley,Sangnie Bhardwaj,Evgeny Gladchenko,Simon Green,Kelvin Guu,Felix Fischer,Xiao Wu,Eric Wang,Achintya Singhal,Tatiana Matejovicova,James Martens,Hongji Li,Roma Patel,Elizabeth Kemp,Jiaqi Pan,Lily Wang,Blake JianHang Chen,Jean-Baptiste Alayrac,Navneet Potti,Erika Gemzer,Eugene Ie,Kay McKinney,Takaaki Saeki,Edward Chou,Pascal Lamblin,SQ Mah,Zach Fisher,Martin Chadwick,Jon Stritar,Obaid Sarvana,Andrew Hogue,Artem Shtefan,Hadi Hashemi,Yang Xu,Jindong Gu,Sharad Vikram,Chung-Ching Chang,Sabela Ramos,Logan Kilpatrick,Weijuan Xi,Jenny Brennan,Yinghao Sun,Abhishek Jindal,Ionel Gog,Dawn Chen,Felix Wu,Jason Lee,Sudhindra Kopalle,Srinadh Bhojanapalli,Oriol Vinyals,Natan Potikha,Burcu Karagol Ayan,Yuan Yuan,Michael Riley,Piotr Stanczyk,Sergey Kishchenko,Bing Wang,Dan Garrette,Antoine Yang,Vlad Feinberg,CJ Carey,Javad Azizi,Viral Shah,Erica Moreira,Chongyang Shi,Josh Feldman,Elizabeth Salesky,Thomas Lampe,Aneesh Pappu,Duhyeon Kim,Jonas Adler,Avi Caciularu,Brian Walker,Yunhan Xu,Yochai Blau,Dylan Scandinaro,Terry Huang,Sam El-Husseini,Abhishek Sinha,Lijie Ren,Taylor Tobin,Patrik Sundberg,Tim Sohn,Vikas Yadav,Mimi Ly,Emily Xue,Jing Xiong,Afzal Shama Soudagar,Sneha Mondal,Nikhil Khadke,Qingchun Ren,Ben Vargas,Stan Bileschi,Sarah Chakera,Cindy Wang,Boyu Wang,Yoni Halpern,Joe Jiang,Vikas Sindhwani,Petre Petrov,Pranavaraj Ponnuramu,Sanket Vaibhav Mehta,Yu Watanabe,Betty Chan,Matheus Wisniewski,Trang Pham,Jingwei Zhang,Conglong Li,Dario de Cesare,Art Khurshudov,Alex Vasiloff,Melissa Tan,Zoe Ashwood,Bobak Shahriari,Maryam Majzoubi,Garrett Tanzer,Olga Kozlova,Robin Alazard,James Lee-Thorp,Nguyet Minh Phu,Isaac Tian,Junwhan Ahn,Andy Crawford,Lauren Lax,Yuan,Shangguan,Iftekhar Naim,David Ross,Oleksandr Ferludin,Tongfei Guo,Andrea Banino,Hubert Soyer,Xiaoen Ju,Dominika Rogozińska,Ishaan Malhi,Marcella Valentine,Daniel Balle,Apoorv Kulshreshtha,Maciej Kula,Yiwen Song,Sophia Austin,John Schultz,Roy Hirsch,Arthur Douillard,Apoorv Reddy,Michael Fink,Summer Yue,Khyatti Gupta,Adam Zhang,Norman Rink,Daniel McDuff,Lei Meng,András György,Yasaman Razeghi,Ricky Liang,Kazuki Osawa,Aviel Atias,Matan Eyal,Tyrone Hill,Nikolai Grigorev,Zhengdong Wang,Nitish Kulkarni,Rachel Soh,Ivan Lobov,Zachary Charles,Sid Lall,Kazuma Hashimoto,Ido Kessler,Victor Gomes,Zelda Mariet,Danny Driess,Alessandro Agostini,Canfer Akbulut,Jingcao Hu,Marissa Ikonomidis,Emily Caveness,Kartik Audhkhasi,Saurabh Agrawal,Ioana Bica,Evan Senter,Jayaram Mudigonda,Kelly Chen,Jingchen Ye,Xuanhui Wang,James Svensson,Philipp Fränken,Josh Newlan,Li Lao,Eva Schnider,Sami Alabed,Joseph Kready,Jesse Emond,Afief Halumi,Tim Zaman,Chengxi Ye,Naina Raisinghani,Vilobh Meshram,Bo Chang,Ankit Singh Rawat,Axel Stjerngren,Sergey Levi,Rui Wang,Xiangzhu Long,Mitchelle Rasquinha,Steven Hand,Aditi Mavalankar,Lauren Agubuzu,Sudeshna Roy,Junquan Chen,Jarek Wilkiewicz,Hao Zhou,Michal Jastrzebski,Qiong Hu,Agustin Dal Lago,Ramya Sree Boppana,Wei-Jen Ko,Jennifer Prendki,Yao Su,Zhi Li,Eliza Rutherford,Girish Ramchandra Rao,Ramona Comanescu,Adrià Puigdomènech,Qihang Chen,Dessie Petrova,Christine Chan,Vedrana Milutinovic,Felipe Tiengo Ferreira,Chin-Yi Cheng,Ming Zhang,Tapomay Dey,Sherry Yang,Ramesh Sampath,Quoc Le,Howard Zhou,Chu-Cheng Lin,Hoi Lam,Christine Kaeser-Chen,Kai Hui,Dean Hirsch,Tom Eccles,Basil Mustafa,Shruti Rijhwani,Morgane Rivière,Yuanzhong Xu,Junjie Wang,Xinyang Geng,Xiance Si,Arjun Khare,Cheolmin Kim,Vahab Mirrokni,Kamyu Lee,Khuslen Baatarsukh,Nathaniel Braun,Lisa Wang,Pallavi LV,Richard Tanburn,Yuvein,Zhu,Fangda Li,Setareh Ariafar,Dan Goldberg,Ken Burke,Daniil Mirylenka,Meiqi Guo,Olaf Ronneberger,Hadas Natalie Vogel,Liqun Cheng,Nishita Shetty,Johnson Jia,Thomas Jimma,Corey Fry,Ted Xiao,Martin Sundermeyer,Ryan Burnell,Yannis Assael,Mario Pinto,JD Chen,Rohit Sathyanarayana,Donghyun Cho,Jing Lu,Rishabh Agarwal,Sugato Basu,Lucas Gonzalez,Dhruv Shah,Meng Wei,Dre Mahaarachchi,Rohan Agrawal,Tero Rissa,Yani Donchev,Ramiro Leal-Cavazos,Adrian Hutter,Markus Mircea,Alon Jacovi,Faruk Ahmed,Jiageng Zhang,Shuguang Hu,Bo-Juen Chen,Jonni Kanerva,Guillaume Desjardins,Andrew Lee,Nikos Parotsidis,Asier Mujika,Tobias Weyand,Jasper Snoek,Jo Chick,Kai Chen,Paul Chang,Ethan Mahintorabi,Zi Wang,Tolly Powell,Orgad Keller,Abhirut Gupta,Claire Sha,Kanav Garg,Nicolas Heess,Ágoston Weisz,Cassidy Hardin,Bartek Wydrowski,Ben Coleman,Karina Zainullina,Pankaj Joshi,Alessandro Epasto,Terry Spitz,Binbin Xiong,Kai Zhao,Arseniy Klimovskiy,Ivy Zheng,Johan Ferret,Itay Yona,Waleed Khawaja,Jean-Baptiste Lespiau,Maxim Krikun,Siamak Shakeri,Timothee Cour,Bonnie Li,Igor Krivokon,Dan Suh,Alex Hofer,Jad Al Abdallah,Nikita Putikhin,Oscar Akerlund,Silvio Lattanzi,Anurag Kumar,Shane Settle,Himanshu Srivastava,Folawiyo Campbell-Ajala,Edouard Rosseel,Mihai Dorin Istin,Nishanth Dikkala,Anand Rao,Nick Young,Kate Lin,Dhruva Bhaswar,Yiming Wang,Jaume Sanchez Elias,Kritika Muralidharan,James Keeling,Dayou Du,Siddharth Gopal,Gregory Dibb,Charles Blundell,Manolis Delakis,Jacky Liang,Marco Tulio Ribeiro,Georgi Karadzhov,Guillermo Garrido,Ankur Bapna,Jiawei Cao,Adam Sadovsky,Pouya Tafti,Arthur Guez,Coline Devin,Yixian Di,Jinwei Xing,Chuqiao,Xu,Hanzhao Lin,Chun-Te Chu,Sameera Ponda,Wesley Helmholz,Fan Yang,Yue Gao,Sara Javanmardi,Wael Farhan,Alex Ramirez,Ricardo Figueira,Khe Chai Sim,Yuval Bahat,Ashwin Vaswani,Liangzhe Yuan,Gufeng Zhang,Leland Rechis,Hanjun Dai,Tayo Oguntebi,Alexandra Cordell,Eugénie Rives,Kaan Tekelioglu,Naveen Kumar,Bing Zhang,Aurick Zhou,Nikolay Savinov,Andrew Leach,Alex Tudor,Sanjay Ganapathy,Yanyan Zheng,Mirko Rossini,Vera Axelrod,Arnaud Autef,Yukun Zhu,Zheng Zheng,Mingda Zhang,Baochen Sun,Jie Ren,Nenad Tomasev,Nithish Kannan,Amer Sinha,Charles Chen,Louis O’Bryan,Alex Pak,Aditya Kusupati,Weel Yang,Deepak Ramachandran,Patrick Griffin,Seokhwan Kim,Philipp Neubeck,Craig Schiff,Tammo Spalink,Mingyang Ling,Arun Nair,Ga-Young Joung,Linda Deng,Avishkar Bhoopchand,Lora Aroyo,Tom Duerig,Jordan Griffith,Gabe Barth-Maron,Jake Ades,Alex Haig,Ankur Taly,Yunting Song,Paul Michel,Dave Orr,Dean Weesner,Corentin Tallec,Carrie Grimes Bostock,Paul Niemczyk,Andy Twigg,Mudit Verma,Rohith Vallu,Henry Wang,Marco Gelmi,Kiranbir Sodhia,Aleksandr Chuklin,Omer Goldman,Jasmine George,Liang Bai,Kelvin Zhang,Petar Sirkovic,Efrat Nehoran,Golan Pundak,Jiaqi Mu,Alice Chen,Alex Greve,Paulo Zacchello,David Amos,Heming Ge,Eric Noland,Colton Bishop,Jeffrey Dudek,Youhei Namiki,Elena Buchatskaya,Jing Li,Dorsa Sadigh,Masha Samsikova,Dan Malkin,Damien Vincent,Robert David,Rob Willoughby,Phoenix Meadowlark,Shawn Gao,Yan Li,Raj Apte,Amit Jhindal,Stein Xudong Lin,Alex Polozov,Zhicheng Wang,Tomas Mery,Anirudh GP,Varun Yerram,Sage Stevens,Tianqi Liu,Noah Fiedel,Charles Sutton,Matthew Johnson,Xiaodan Song,Kate Baumli,Nir Shabat,Muqthar Mohammad,Hao Liu,Marco Selvi,Yichao Zhou,Mehdi Hafezi Manshadi,Chu-ling Ko,Anthony Chen,Michael Bendersky,Jorge Gonzalez Mendez,Nisarg Kothari,Amir Zandieh,Yiling Huang,Daniel Andor,Ellie Pavlick,Idan Brusilovsky,Jitendra Harlalka,Sally Goldman,Andrew Lampinen,Guowang Li,Asahi Ushio,Somit Gupta,Lei Zhang,Chuyuan Kelly Fu,Madhavi Sewak,Timo Denk,Jed Borovik,Brendan Jou,Avital Zipori,Prateek Jain,Junwen Bai,Thang Luong,Jonathan Tompson,Alice Li,Li Liu,George Powell,Jiajun Shen,Alex Feng,Grishma Chole,Da Yu,Yinlam Chow,Tongxin Yin,Eric Malmi,Kefan Xiao,Yash Pande,Shachi Paul,Niccolò Dal Santo,Adil Dostmohamed,Sergio Guadarrama,Aaron Phillips,Thanumalayan Sankaranarayana Pillai,Gal Yona,Amin Ghafouri,Preethi Lahoti,Benjamin Lee,Dhruv Madeka,Eren Sezener,Simon Tokumine,Adrian Collister,Nicola De Cao,Richard Shin,Uday Kalra,Parker Beak,Emily Nottage,Ryo Nakashima,Ivan Jurin,Vikash Sehwag,Meenu Gaba,Junhao Zeng,Kevin R. McKee,Fernando Pereira,Tamar Yakar,Amayika Panda,Arka Dhar,Peilin Zhong,Daniel Sohn,Mark Brand,Lars Lowe Sjoesund,Viral Carpenter,Sharon Lin,Shantanu Thakoor,Marcus Wainwright,Ashwin Chaugule,Pranesh Srinivasan,Muye Zhu,Bernett Orlando,Jack Weber,Ayzaan Wahid,Gilles Baechler,Apurv Suman,Jovana Mitrović,Gabe Taubman,Honglin Yu,Helen King,Josh Dillon,Cathy Yip,Dhriti Varma,Tomas Izo,Levent Bolelli,Borja De Balle Pigem,Julia Di Trapani,Fotis Iliopoulos,Adam Paszke,Nishant Ranka,Joe Zou,Francesco Pongetti,Jed McGiffin,Alex Siegman,Rich Galt,Ross Hemsley,Goran Žužić,Victor Carbune,Tao Li,Myle Ott,Félix de Chaumont Quitry,David Vilar Torres,Yuri Chervonyi,Tomy Tsai,Prem Eruvbetine,Samuel Yang,Matthew Denton,Jake Walker,Slavica Andačić,Idan Heimlich Shtacher,Vittal Premachandran,Harshal Tushar Lehri,Cip Baetu,Damion Yates,Lampros Lamprou,Mariko Iinuma,Ioana Mihailescu,Ben Albrecht,Shachi Dave,Susie Sargsyan,Bryan Perozzi,Lucas Manning,Chiyuan Zhang,Denis Vnukov,Igor Mordatch,Raia Hadsell Wolfgang Macherey,Ryan Kappedal,Jim Stephan,Aditya Tripathi,Klaus Macherey,Jun Qian,Abhishek Bhowmick,Shekoofeh Azizi,Rémi Leblond,Shiva Mohan Reddy Garlapati,Timothy Knight,Matthew Wiethoff,Wei-Chih Hung,Anelia Angelova,Georgios Evangelopoulos,Pawel Janus,Dimitris Paparas,Matthew Rahtz,Ken Caluwaerts,Vivek Sampathkumar,Daniel Jarrett,Shadi Noghabi,Antoine Miech,Chak Yeung,Geoff Clark,Henry Prior,Fei Zheng,Jean Pouget-Abadie,Indro Bhattacharya,Kalpesh Krishna,Will Bishop,Zhe Yuan,Yunxiao Deng,Ashutosh Sathe,Kacper Krasowiak,Ciprian Chelba,Cho-Jui Hsieh,Kiran Vodrahalli,Buhuang Liu,Thomas Köppe,Amr Khalifa,Lubo Litchev,Pichi Charoenpanit,Reed Roberts,Sachin Yadav,Yasumasa Onoe,Desi Ivanov,Megha Mohabey,Vighnesh Birodkar,Nemanja Rakićević,Pierre Sermanet,Vaibhav Mehta,Krishan Subudhi,Travis Choma,Will Ng,Luheng He,Kathie Wang,Tasos Kementsietsidis,Shane Gu,Mansi Gupta,Andrew Nystrom,Mehran Kazemi,Timothy Chung,Nacho Cano,Nikhil Dhawan,Yufei Wang,Jiawei Xia,Trevor Yacovone,Eric Jia,Mingqing Chen,Simeon Ivanov,Ashrith Sheshan,Sid Dalmia,Paweł Stradomski,Pengcheng Yin,Salem Haykal,Congchao Wang,Dennis Duan,Neslihan Bulut,Greg Kochanski,Liam MacDermed,Namrata Godbole,Shitao Weng,Jingjing Chen,Rachana Fellinger,Ramin Mehran,Daniel Suo,Hisham Husain,Tong He,Kaushal Patel,Joshua Howland,Randall Parker,Kelvin Nguyen,Sharath Maddineni,Chris Rawles,Mina Khan,Shlomi Cohen-Ganor,Amol Mandhane,Xinyi Wu,Chenkai Kuang,Iulia Comşa,Ramya Ganeshan,Hanie Sedghi,Adam Bloniarz,Nuo Wang Pierse,Anton Briukhov,Petr Mitrichev,Anita Gergely,Serena Zhan,Allan Zhou,Nikita Saxena,Eva Lu,Josef Dean,Ashish Gupta,Nicolas Perez-Nieves,Renjie Wu,Cory McLean,Wei Liang,Disha Jindal,Anton Tsitsulin,Wenhao Yu,Kaiz Alarakyia,Tom Schaul,Piyush Patil,Peter Sung,Elijah Peake,Hongkun Yu,Feryal Behbahani,JD Co-Reyes,Alan Ansell,Sean Sun,Clara Barbu,Jonathan Lee,Seb Noury,James Allingham,Bilal Piot,Mohit Sharma,Christopher Yew,Ivan Korotkov,Bibo Xu,Demetra Brady,Goran Petrovic,Shibl Mourad,Claire Cui,Aditya Gupta,Parker Schuh,Saarthak Khanna,Anna Goldie,Abhinav Arora,Vadim Zubov,Amy Stuart,Mark Epstein,Yun Zhu,Jianqiao Liu,Yury Stuken,Ziyue Wang,Karolis Misiunas,Dee Guo,Ashleah Gill,Ale Hartman,Zaid Nabulsi,Aurko Roy,Aleksandra Faust,Jason Riesa,Ben Withbroe,Mengchao Wang,Marco Tagliasacchi,Andreea Marzoca,James Noraky,Serge Toropov,Malika Mehrotra,Bahram Raad,Sanja Deur,Steve Xu,Marianne Monteiro,Zhongru Wu,Yi Luan,Sam Ritter,Nick Li,Håvard Garnes,Yanzhang He,Martin Zlocha,Jifan Zhu,Matteo Hessel,Will Wu,Spandana Raj Babbula,Chizu Kawamoto,Yuanzhen Li,Mehadi Hassen,Yan Wang,Brian Wieder,James Freedman,Yin Zhang,Xinyi Bai,Tianli Yu,David Reitter,XiangHai Sheng,Mateo Wirth,Aditya Kini,Dima Damen,Mingcen Gao,Rachel Hornung,Michael Voznesensky,Brian Roark,Adhi Kuncoro,Yuxiang Zhou,Rushin Shah,Anthony Brohan,Kuangyuan Chen,James Wendt,David Rim,Paul Kishan Rubenstein,Jonathan Halcrow,Michelle Liu,Ty Geri,Yunhsuan Sung,Jane Shapiro,Shaan Bijwadia,Chris Duvarney,Christina Sorokin,Paul Natsev,Reeve Ingle,Pramod Gupta,Young Maeng,Ndaba Ndebele,Kexin Zhu,Valentin Anklin,Katherine Lee,Yuan Liu,Yaroslav Akulov,Shaleen Gupta,Guolong Su,Flavien Prost,Tianlin Liu,Vitaly Kovalev,Pol Moreno,Martin Scholz,Sam Redmond,Zongwei Zhou,Alex Castro-Ros,André Susano Pinto,Dia Kharrat,Michal Yarom,Rachel Saputro,Jannis Bulian,Ben Caine,Ji Liu,Abbas Abdolmaleki,Shariq Iqbal,Tautvydas Misiunas,Mikhail Sirotenko,Shefali Garg,Guy Bensky,Huan Gui,Xuezhi Wang,Raphael Koster,Mike Bernico,Da Huang,Romal Thoppilan,Trevor Cohn,Ben Golan,Wenlei Zhou,Andrew Rosenberg,Markus Freitag,Tynan Gangwani,Vincent Tsang,Anand Shukla,Xiaoqi Ren,Minh Giang,Chi Zou,Andre Elisseeff,Charline Le Lan,Dheeru Dua,Shuba Lall,Pranav Shyam,Frankie Garcia,Sarah Nguyen,Michael Guzman,AJ Maschinot,Marcello Maggioni,Ming-Wei Chang,Karol Gregor,Lotte Weerts,Kumaran Venkatesan,Bogdan Damoc,Leon Liu,Jan Wassenberg,Lewis Ho,Becca Roelofs,Majid Hadian,François-Xavier Aubet,Yu Liang,Sami Lachgar,Danny Karmon,Yong Cheng,Amelio Vázquez-Reina,Angie Chen,Zhuyun Dai,Andy Brock,Shubham Agrawal,Chenxi Pang,Peter Garst,Mariella Sanchez-Vargas,Ivor Rendulic,Aditya Ayyar,Andrija Ražnatović,Olivia Ma,Roopali Vij,Neha Sharma,Ashwin Balakrishna,Bingyuan Liu,Ian Mackinnon,Sorin Baltateanu,Petra Poklukar,Gabriel Ibagon,Colin Ji,Hongyang Jiao,Isaac Noble,Wojciech Stokowiec,Zhihao Li,Jeff Dean,David Lindner,Mark Omernick,Kristen Chiafullo,Mason Dimarco,Vitor Rodrigues,Vittorio Selo,Garrett Honke,Xintian,Wu,Wei He,Adam Hillier,Anhad Mohananey,Vihari Piratla,Chang Ye,Chase Malik,Sebastian Riedel,Samuel Albanie,Zi Yang,Kenny Vassigh,Maria Bauza,Sheng Li,Yiqing Tao,Nevan Wichers,Andrii Maksai,Abe Ittycheriah,Ross Mcilroy,Bryan Seybold,Noah Goodman,Romina Datta,Steven M. Hernandez,Tian Shi,Yony Kochinski,Anna Bulanova,Ken Franko,Mikita Sazanovich,Nicholas FitzGerald,Praneeth Kacham,Shubha Srinivas Raghvendra,Vincent Hellendoorn,Alexander Grushetsky,Julian Salazar,Angeliki Lazaridou,Jason Chang,Jan-Thorsten Peter,Sushant Kafle,Yann Dauphin,Abhishek Rao,Filippo Graziano,Izhak Shafran,Yuguo Liao,Tianli Ding,Geng Yan,Grace Chu,Zhao Fu,Vincent Roulet,Gabriel Rasskin,Duncan Williams,Shahar Drath,Alex Mossin,Raphael Hoffmann,Jordi Orbay,Francesco Bertolini,Hila Sheftel,Justin Chiu,Siyang Xue,Yuheng Kuang,Ferjad Naeem,Swaroop Nath,Nana Nti,Phil Culliton,Kashyap Krishnakumar,Michael Isard,Pei Sun,Ayan Chakrabarti,Nathan Clement,Regev Cohen,Arissa Wongpanich,GS Oh,Ashwin Murthy,Hao Zheng,Jessica Hamrick,Oskar Bunyan,Suhas Ganesh,Nitish Gupta,Roy Frostig,John Wieting,Yury Malkov,Pierre Marcenac,Zhixin,Lai,Xiaodan Tang,Mohammad Saleh,Fedir Zubach,Chinmay Kulkarni,Huanjie Zhou,Vicky Zayats,Nan Ding,Anshuman Tripathi,Arijit Pramanik,Patrik Zochbauer,Harish Ganapathy,Vedant Misra,Zach Behrman,Hugo Vallet,Mingyang Zhang,Mukund Sridhar,Ye Jin,Mohammad Babaeizadeh,Siim Põder,Megha Goel,Divya Jain,Tajwar Nasir,Shubham Mittal,Tim Dozat,Diego Ardila,Aliaksei Severyn,Fabio Pardo,Sammy Jerome,Siyang Qin,Louis Rouillard,Amir Yazdanbakhsh,Zizhao Zhang,Shivani Agrawal,Kaushik Shivakumar,Caden Lu,Praveen Kallakuri,Rachita Chhaparia,Kanishka Rao,Charles Kwong,Asya Fadeeva,Shitij Nigam,Yan Virin,Yuan Zhang,Balaji Venkatraman,Beliz Gunel,Marc Wilson,Huiyu Wang,Abhinav Gupta,Xiaowei Xu,Adrien Ali Taïga,Kareem Mohamed,Doug Fritz,Daniel Rodriguez,Zoubin Ghahramani,Harry Askham,Lior Belenki,James Zhao,Rahul Gupta,Krzysztof Jastrzębski,Takahiro Kosakai,Kaan Katircioglu,Jon Schneider,Rina Panigrahy,Konstantinos Bousmalis,Peter Grabowski,Prajit Ramachandran,Chaitra Hegde,Mihaela Rosca,Angelo Scorza Scarpati,Kyriakos Axiotis,Ying Xu,Zach Gleicher,Assaf Hurwitz Michaely,Mandar Sharma,Sanil Jain,Christoph Hirnschall,Tal Marian,Xuhui Jia,Kevin Mather,Kilol Gupta,Linhai Qiu,Nigamaa Nayakanti,Lucian Ionita,Steven Zheng,Lucia Loher,Kurt Shuster,Igor Petrovski,Roshan Sharma,Rahma Chaabouni,Angel Yeh,James An,Arushi Gupta,Steven Schwarcz,Seher Ellis,Sam Conway-Rahman,Javier Snaider,Alex Zhai,James Atwood,Daniel Golovin,Liqian Peng,Te I,Vivian Xia,Salvatore Scellato,Mahan Malihi,Arthur Bražinskas,Vlad-Doru Ion,Younghoon Jun,James Swirhun,Soroosh Mariooryad,Jiao Sun,Steve Chien,Rey Coaguila,Ariel Brand,Yi Gao,Tom Kwiatkowski,Roee Aharoni,Cheng-Chun Lee,Mislav Žanić,Yichi Zhang,Dan Ethier,Vitaly Nikolaev,Pranav Nair,Yoav Ben Shalom,Hen Fitoussi,Jai Gupta,Hongbin Liu,Dee Cattle,Tolga Bolukbasi,Ben Murdoch,Fantine Huot,Yin Li,Chris Hahn
Main category: cs.CL
TL;DR: Gemini 2.5模型家族引入了Gemini 2.5 Pro和Gemini 2.5 Flash,以及早期的Gemini 2.0 Flash和Flash-Lite模型,提升了推理、多模态、长上下文和代理能力的边界。
Details
Motivation: 目的是推动模型在复杂代理问题解决中的能力边界,提供从高性能到低成本的全方位解决方案。Contribution: Gemini 2.5 Pro在代码和推理任务上达到了SoTA性能,支持长达3小时的视频处理;Gemini 2.5 Flash则在低计算和延迟需求下提供优秀推理能力。
Method: 通过结合长上下文、多模态理解和高级推理能力,解锁新的代理工作流程。
Result: Gemini 2.X模型在能力与成本之间实现了Pareto最优,支持复杂的代理问题解决。
Insight: 该研究表明,结合多种能力的模型能够显著扩展代理任务的应用范围,同时优化成本效益。
Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
[2] ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time
Kiarash Zahirnia,Zahra Golpayegani,Walid Ahmad,Yang Liu
Main category: cs.CL
TL;DR: ETT通过测试时的高效微调,扩展了短上下文Transformer模型的上下文长度,提升长文本理解能力,且计算和内存开销线性增长。
Details
Motivation: Transformer模型的长序列处理能力和计算开销成二次增长,限制了其在长上下文任务中的应用。ETT旨在解决这一问题。Contribution: 提出ETT方法,以线性计算开销和常数内存需求扩展短上下文模型的上下文长度,并验证其在多个模型上的有效性。
Method: 通过将输入上下文分块为重叠子序列,高效微调模型参数(特别是FFN第二层),实现测试时上下文扩展。
Result: 在GPT-Large和Phi-2上扩展上下文长度至32k,准确率提升30%,且微调部分模块效果优于全微调。
Insight: 发现仅微调FFN第二层比全模型微调更有效,为高效长上下文扩展提供了新思路。
Abstract: Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.
[3] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning
Zeming Chen,Angelika Romanou,Gail Weiss,Antoine Bosselut
Main category: cs.CL
TL;DR: PERK是一种高效参数化的测试时学习方法,用于长上下文推理,通过轻量级模型适配器在测试时编码长输入上下文,显著提升了推理性能。
Details
Motivation: 传统测试时学习方法在长上下文推理中因内存问题无法适用,需要一种更高效的参数化方法。Contribution: 提出PERK方法,利用双层优化循环和低秩适配器(LoRA)高效编码长上下文,显著提升了推理性能。
Method: 采用双层优化循环(内循环编码上下文到LoRA,外循环学习使用适配器进行推理)和梯度更新的轻量级适配器。
Result: 在长上下文推理任务中显著优于基线方法,GPT-2性能提升达90%,大模型Qwen-2.5-0.5B提升27%,且对推理复杂度、长度扩展和信息位置更鲁棒。
Insight: PERK在训练阶段内存需求较高,但在推理时效率优于基于提示的方法,为长上下文推理提供了高效解决方案。
Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.
[4] Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders
Shun Wang,Tyler Loakman,Youbo Lei,Yi Liu,Bohao Yang,Yuting Zhao,Dong Yang,Chenghua Lin
Main category: cs.CL
TL;DR: 该论文提出了一种基于稀疏自编码器的字典学习方法,用于分解大型语言模型(LLMs),提取单语义特征并揭示模型内部的误解,从而通过提示优化提升下游任务性能。
Details
Motivation: 传统LLMs被视为黑盒算法,缺乏可解释性且难以优化性能。通过分解模型神经元,论文旨在提高可解释性并提升任务表现。Contribution: 1. 提出稀疏自编码器方法分解LLMs;2. 识别模型内部误解并优化提示;3. 在数学推理和隐喻检测任务中显著提升性能。
Method: 采用字典学习和稀疏自编码器技术,从LLMs的多语义神经元中提取单语义特征,并通过模型分析自动优化提示。
Result: 方法显著提升了数学推理和隐喻检测等下游任务的性能,同时提高了模型的可解释性。
Insight: 稀疏自编码器能够有效分解LLMs的复杂特征,揭示模型行为的内部机制,为优化模型表现提供新思路。
Abstract: Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.
[5] Temporal Analysis of Climate Policy Discourse: Insights from Dynamic Embedded Topic Modeling
Rafiu Adekoya Badekale,Adewale Akinfaderin
Main category: cs.CL
TL;DR: 该论文提出了一种动态嵌入主题模型(DETM),用于分析全球气候政策话语的演变,揭示了从早期关注温室气体到近年强调实施与技术合作等主题的变化。
Details
Motivation: 传统手动主题编码方法耗时且难以捕捉全球政策话语的复杂性和动态性,因此需要一种自动化的、基于机器学习的方法来分析政策语言的演变。Contribution: 论文的主要贡献是应用DETM模型分析气候变化政策文本的时间动态性,并展示了其在高维数据下的有效性和可扩展性。
Method: 论文采用动态嵌入主题模型(DETM),通过预处理、模型训练和时间词分布可视化,分析了1995年至2023年UNFCCC政策决策的文本数据。
Result: 结果显示DETM能够有效捕捉气候政策主题的演变,例如从温室气体到技术合作和全球协议的转变。
Insight: 动态主题模型可以成为分析政策话语演变的强大工具,帮助政策制定者和研究者识别趋势并制定应对策略。
Abstract: Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.
[6] Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang,Xuehang Guo,Sofia Stoica,Haiyang Xu,Hongru Wang,Hyeonjeong Ha,Xiusi Chen,Yangyi Chen,Ming Yan,Fei Huang,Heng Ji
Main category: cs.CL
TL;DR: 论文提出了一种名为PAPO的方法,通过在多模态推理任务中引入感知意识监督信号,显著提升了模型在视觉依赖任务上的表现。
Details
Motivation: 现有的RLVR方法在多模态推理任务中表现不佳,主要原因是视觉输入的感知能力不足。Contribution: 提出了PAPO方法,通过隐式感知损失(KL散度)优化模型在多模态任务中的感知与推理能力,无需额外数据或外部奖励模型。
Method: 扩展了GRPO目标,引入Implicit Perception Loss(KL散度项),并提出Double Entropy Loss解决损失篡改问题。
Result: 在多模态基准测试上提升了4.4%,视觉依赖任务上提升了8.0%,感知错误降低了30.5%。
Insight: 感知意识的监督信号能够显著提升多模态推理任务的性能,且无需依赖外部资源。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
[7] Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis
Srihari K B,Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: 本文提出了一种结合多模态知识图谱(MMKG)和生成式AI的统一食品领域问答框架,显著提升了问答的可靠性和多样性。
Details
Motivation: 食品领域的问答需要结合多模态信息(如食谱、食材和图像)来提供全面的答案,传统方法在可靠性和多样性上存在不足。Contribution: 1. 构建了一个大规模的多模态知识图谱(MMKG);2. 提出了一种混合问答生成方法;3. 通过联合微调提升了问答和图像生成的质量。
Method: 1. 构建了包含13,000食谱、3,000食材、140,000关系和14,000图像的MMKG;2. 使用40个模板和LLaVA/DeepSeek生成40,000问答对;3. 联合微调Meta LLaMA 3.1-8B和Stable Diffusion 3.5-Large。
Result: BERTScore提升16.2%,FID降低37.8%,CLIP对齐提高31.1%。通过诊断分析,不匹配率从35.2%降至7.3%,图像重用准确率达94.1%。
Insight: 结构化知识与多模态生成的结合显著提升了食品领域问答的可靠性和多样性,为类似任务提供了新思路。
Abstract: We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2%, reduces FID by 37.8%, and boosts CLIP alignment by 31.1%. Diagnostic analyses-CLIP-based mismatch detection (35.2% to 7.3%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1% accurate image reuse and 85% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
[8] Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Liliang Ren,Congcong Chen,Haoran Xu,Young Jin Kim,Adam Atkinson,Zheng Zhan,Jiankai Sun,Baolin Peng,Liyuan Liu,Shuohang Wang,Hao Cheng,Jianfeng Gao,Weizhu Chen,Yelong Shen
Main category: cs.CL
TL;DR: 该论文提出了一种名为SambaY的解码器-混合-解码器架构,通过引入门控记忆单元(GMU)实现了跨层的高效表征共享,显著提升了长序列生成的推理效率和性能。
Details
Motivation: 尽管混合架构(如Samba和YOCO)在序列建模中表现出色,但现有研究未充分探索状态空间模型(SSMs)层间表征共享的效率潜力。论文旨在填补这一空白。Contribution: 1. 提出门控记忆单元(GMU),实现跨层的高效记忆共享;2. 设计SambaY架构,结合GMU和Samba,优化解码效率和长上下文性能;3. 在多项推理任务中展示了显著的性能提升和解码吞吐量增益。
Method: 1. 设计GMU机制,用于跨层共享记忆读取状态;2. 构建SambaY架构,将GMU应用于Samba自解码器和交叉解码器;3. 结合差分注意力(Differential Attention)优化模型性能。
Result: SambaY在Math500、AIME24/25和GPQA Diamond等推理任务中表现优于基准模型Phi4-mini-Reasoning,解码吞吐量提升高达10倍,且无需强化学习。
Insight: GMU通过跨层记忆共享显著提升了模型效率,表明表征共享在长序列建模中具有巨大潜力。差分注意力的引入进一步优化了模型性能。
Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
[9] FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation
Boshko Koloski,Senja Pollak,Roberto Navigli,Blaž Škrlj
Main category: cs.CL
TL;DR: FuDoBa 是一种基于贝叶斯优化的方法,将基于 LLM 的嵌入与领域特定结构化知识融合,生成低维、任务相关的表征,提升分类性能并降低计算复杂度。
Details
Motivation: 现有的 LLM 生成的嵌入虽然性能强大,但在领域特定应用中可能过于通用或计算昂贵,FuDoBa 旨在解决这一问题。Contribution: 提出一种融合文档和知识图谱表征的方法,通过贝叶斯优化生成低维、高效的嵌入,显著提升分类性能。
Method: 结合 LLM 嵌入与领域知识(如本地或 WikiData 的结构化数据),使用贝叶斯优化进行早期融合,并生成解释性权重。
Result: 在六个数据集上的实验表明,该方法性能与或超过仅依赖 LLM 嵌入的基线。
Insight: 领域特定知识的融合可以显著提升嵌入的任务相关性,同时降低计算复杂度,为文档表征提供新方向。
Abstract: Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
[10] Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework
Zenan Xu,Zexuan Qiu,Guanhua Huang,Kun Li,Siheng Li,Chenchen Zhang,Kejiao Li,Qi Yi,Yuhao Jiang,Bo Zhou,Fengzong Lian,Zhanhui Kang
Main category: cs.CL
TL;DR: 该论文提出了一种基于语义熵的自适应终止框架,结合顺序推理和并行推理的优点,通过动态控制和早期终止提升推理效率和质量。
Details
Motivation: 当前的大语言模型推理方法(顺序推理和并行推理)存在效率低下或缺乏协调的问题,亟需一种灵活的协作推理框架来解决这些局限性。Contribution: 1) 提出语义熵(SE)作为推理质量的指标;2) 设计了一个灵活的自适应终止框架,结合顺序和并行推理的优势。
Method: 通过语义熵量化并行响应的语义多样性,动态评估推理质量并实现早期终止,避免低效或过早截断。
Result: 语义熵与准确性呈强负相关,能够高效指导推理终止,提升推理效率和质量。
Insight: 结合顺序和并行推理的优势,并利用语义熵动态控制推理过程,是提升模型推理能力的有效途径。
Abstract: Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy…
[11] Shifting from Ranking to Set Selection for Retrieval Augmented Generation
Dahyun Lee,Yongrae Jo,Haeju Park,Moontae Lee
Main category: cs.CL
TL;DR: 论文提出了一种从传统排序转向集合选择的检索增强生成(RAG)方法——SETR,通过显式识别查询的信息需求并选择最优的段落集合,以提升多跳问答中的检索质量。
Details
Motivation: 现有的检索增强生成方法主要基于段落个体相关性进行重排序,但在复杂查询(如多跳问答)中无法确保段落集合的全面性,从而影响生成结果的质量。Contribution: 提出了一种集合选择方法SETR,通过显式识别查询的信息需求并选择最优段落集合,显著提升了多跳问答中的检索和生成性能。
Method: SETR利用链式思维(Chain-of-Thought)推理识别查询的信息需求,然后选择能全面满足这些需求的段落集合,而非仅依赖个体相关性。
Result: 在多跳RAG基准测试中,SETR在答案正确性和检索质量上均优于现有方法,包括专有LLM重排序器和开源基线模型。
Insight: 集合选择方法(而非个体排序)能更有效地满足复杂查询的信息需求,为RAG系统提供了一种高效替代方案。
Abstract: Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR
[12] SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN
Luca Mariotti,Veronica Guidetti,Federica Mandreoli
Main category: cs.CL
TL;DR: SCoRE是一个高效、模块化的关系抽取系统,结合多标签对比学习和贝叶斯kNN分类器,适用于低监督环境,性能优于现有方法且能耗更低。
Details
Motivation: 知识图谱(KG)的扩展需要高效且适应性强的关系抽取(RE)方法,尤其是在低监督和噪声环境下。SCoRE旨在提供一种无需微调、可灵活切换预训练模型(PLM)的解决方案。Contribution: 1. 提出SCoRE系统,结合对比学习和贝叶斯kNN分类器;2. 设计新评估指标CSD和P@R;3. 发布Wiki20d基准数据集;4. 性能优于现有方法且更节能。
Method: 使用多标签对比学习优化特征表示,结合贝叶斯kNN分类器进行关系分类,无需微调预训练模型。
Result: 在五个基准测试中,SCoRE性能达到或超过最优方法,同时显著降低能耗。分析表明,复杂模型设计可能降低性能。
Insight: 简单高效的设计(如SCoRE)在实际应用中更具优势,复杂模型可能因噪声数据而表现不佳。
Abstract: The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE’s minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.
[13] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Ziang Ye,Yang Zhang,Wentao Shi,Xiaoyu You,Fuli Feng,Tat-Seng Chua
Main category: cs.CL
TL;DR: 论文揭示了GUI代理在视觉接地(visual grounding)中的漏洞,提出了一种名为VisualTrap的隐蔽后门攻击方法,通过误导代理将任务计划映射到触发位置而非目标位置,实现攻击。
Details
Motivation: GUI代理与个人设备的高度集成带来了安全风险,尤其是后门攻击的潜在威胁未被充分研究。本文希望通过视觉接地漏洞的研究,揭示这一新型攻击方式的可行性。Contribution: 1. 首次提出了针对GUI代理视觉接地的后门攻击方法VisualTrap;2. 证明了攻击的隐蔽性和泛化性,即使在微调后仍能生效;3. 展示了跨GUI环境的攻击适配性。
Method: 通过在视觉接地预训练阶段注入毒化数据,引导代理将任务计划错误映射到触发位置,而非目标位置。攻击使用隐蔽的视觉触发器(人眼不可见),且仅需少量毒化数据(5%)。
Result: 实验表明,VisualTrap能以5%的毒化数据高效攻击,并泛化到下游任务和不同GUI环境(如从移动端/网页到桌面端)。
Insight: GUI代理的视觉接地机制存在严重安全隐患,亟需进一步研究防御手段,以避免后门攻击的潜在威胁。
Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
[14] Rethinking Verification for LLM Code Generation: From Generation to Testing
Zihan Ma,Taolin Zhang,Maosong Cao,Wenwei Zhang,Minnan Luo,Songyang Zhang,Kai Chen
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLM)在代码生成评估中的局限性,提出了一种量化测试套件全面性的多维指标,并引入了一种人机协作方法(SAGA)以提高测试用例的质量和覆盖率,实验结果显著优于现有基准。
Details
Motivation: 现有代码生成评测基准(如HumanEval和LiveCodeBench)仅包含有限的同质测试用例,导致细微错误未被发现,从而高估模型性能并影响强化学习中的奖励估算。Contribution: 提出了多维指标量化测试套件的全面性;设计了人机协作方法SAGA以提高测试用例质量和覆盖率;开发了TCGBench作为测试用例生成任务的研究平台。
Method: 通过人机协作(SAGA),结合人类编程专长与LLM的推理能力,显著提升测试用例的覆盖率和质量。
Result: SAGA在TCGBench上的检测率达90.62%,验证器准确率32.58%,合成的代码生成评测基准验证器准确率比LiveCodeBench-v6高10.78%。
Insight: 人机协作可显著提升测试用例的全面性和质量,为可靠的LLM代码评估奠定了基础,并推动代码生成中强化学习的进一步发展。
Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
[15] Investigating the Robustness of Retrieval-Augmented Generation at the Query Level
Sezen Perçin,Xin Su,Qutub Sha Syed,Phillip Howard,Aleksei Kuvshinov,Leo Schwinn,Kay-Ulrich Scholl
Main category: cs.CL
TL;DR: 本文研究了检索增强生成(RAG)系统在查询层面的鲁棒性,发现其性能容易受到查询微小变化的影响,并提出了一套评估框架和实用建议。
Details
Motivation: 大型语言模型(LLMs)难以高效更新新信息,检索增强生成(RAG)通过动态整合外部知识来解决这一问题,但其性能高度依赖输入查询的质量。本文旨在分析RAG对查询扰动的敏感性。Contribution: 揭示了RAG系统中检索模块对查询微小变化的高度敏感性;提出了一个系统评估RAG查询层面鲁棒性的框架;基于大量实验给出了实践建议。
Method: 分析了RAG流水线中各模块对查询扰动的敏感度,包括单独模块和端到端问答场景;使用通用和领域专用数据集;设计了超过1092次实验。
Result: 实验表明,常见检索器性能在查询微小变化下显著下降;评估框架能有效量化RAG鲁棒性。
Insight: RAG系统的鲁棒性问题可能成为其实际应用的瓶颈,需进一步优化检索模块或设计更鲁棒的查询处理方法。
Abstract: Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.
[16] FRaN-X: FRaming and Narratives-eXplorer
Artur Muratov,Hana Fatima Shaikh,Vanshikaa Jani,Tarek Mahmoud,Zhuohan Xie,Daniil Orel,Aaryamonvikram Singh,Yuxia Wang,Aadi Joshi,Hasan Iqbal,Ming Shan Hee,Dhruv Sahnan,Nikolaos Nikolaidis,Purificação Silvano,Dimitar Dimitrov,Roman Yangarber,Ricardo Campos,Alípio Jorge,Nuno Guimarães,Elisa Sartori,Nicolas Stefanovitch,Giovanni Da San Martino,Jakub Piskorski,Preslav Nakov
Main category: cs.CL
TL;DR: FRaN-X是一个自动检测实体并分类其叙事角色的工具,支持五种语言和两种领域,提供交互式可视化分析媒体中的叙事框架。
Details
Motivation: 媒体分析中,如何自动检测和标记实体的叙事角色(如主角、反派或无辜者)是一个挑战。FRaN-X旨在解决这一问题,帮助分析人员理解不同来源的叙事框架。Contribution: 1. 提出了一个两阶段系统,结合序列标记和细粒度角色分类。2. 支持五种语言和两种领域。3. 提供交互式可视化工具,帮助分析叙事框架。
Method: 采用两阶段方法:第一阶段通过序列标记检测实体,第二阶段通过分类模型分配细粒度叙事角色(共22种)。系统支持多个语言和领域。
Result: 开发了一个公开可用的工具(FRaN-X),支持多语言和多领域分析,并提供了直观的图形可视化功能。
Insight: 该工具能够帮助媒体分析师快速识别和比较不同来源中的叙事框架,揭示实体的角色变化,适合跨文化和跨领域的叙事分析。
Abstract: We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity’s role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.
[17] Discrete Diffusion Models for Language Generation
Ashen Weligalle
Main category: cs.CL
TL;DR: 论文探讨了离散扩散模型(D3PM)在自然语言生成中的可行性和性能,并与自回归(AR)模型进行了比较,结果显示D3PM在并行生成速度上具有优势,但压缩性能略逊于AR模型。
Details
Motivation: 扩散模型在连续数据生成(如图像和视频)中表现出色,但在离散数据(如自然语言)中的应用仍具挑战性。本文旨在研究离散扩散模型在语言生成中的潜力。Contribution: 论文提出了离散扩散模型(D3PM)在自然语言生成中的实现方法,并对其性能进行了系统评估,揭示其在并行生成速度上的优势。
Method: 使用离散去噪扩散概率模型(D3PM),并与传统自回归模型对比,评估生成性能的指标包括BPT、NLL、PPL和批处理速度。
Result: D3PM的最佳BPT为5.72(平均8.05),略逊于AR模型的4.59,但其批处理速度高达3.97批次/秒,显示出并行生成的潜力。
Insight: 扩散模型在离散数据生成中具有效率优势,但在生成质量(如压缩性能)上与AR模型仍有差距,为未来的非自回归语言生成研究提供了方向。
Abstract: Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that gradually transforms structured data into a Gaussian-like distribution, followed by a learned reverse process to reconstruct the data. While successful in continuous modalities, applying this framework to discrete data-particularly natural language-remains challenging due to token dependency complexities and the lack of a defined generation order.This thesis investigates the feasibility and performance of discrete diffusion models for natural language generation. Specifically, we evaluate the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compare it with traditional autoregressive (AR) language models. To assess generative performance, we use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. Results show the best-performing D3PM model achieves a BPT of 5.72, with a mean of 8.05. The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec., indicating potential for parallel generation.All evaluations were conducted under consistent conditions-generating 100,000 tokens per model with a fixed batch size of four-for fair comparison. This research presents a detailed analysis of diffusion-based vs. autoregressive models, highlighting trade-offs in generative quality and efficiency. Findings emphasize both the promise and limitations of diffusion models for discrete data, supporting future work in non-autoregressive language generation.
cs.CV [Back]
[18] Unveiling the Underwater World: CLIP Perception Model-Guided Underwater Image Enhancement
Jiangzhong Cao,Zekai Zeng,Xu Zhang,Huan Zhang,Chunling Fan,Gangyi Jiang,Weisi Lin
Main category: cs.CV
TL;DR: 该论文提出了一种基于CLIP感知模型的水下图像增强方法,结合课程对比正则化,以改进增强图像的感知质量和内容恢复能力。
Details
Motivation: 现有的水下图像增强方法往往忽视人类感知需求,且缺乏对解空间的充分约束,导致增强图像的质量下降或内容恢复不足。Contribution: 主要贡献包括:(1) 利用CLIP模型的视觉语义特征提取能力,学习适合水下图像质量评估的提示对;(2) 将CLIP感知模型作为感知损失模块集成到增强网络中;(3) 结合课程对比正则化,进一步约束增强图像的CLIP感知空间。
Method: 方法包括:(1) 使用CLIP模型提取视觉语义特征并学习提示对;(2) 将CLIP感知模型作为感知损失模块;(3) 结合课程对比正则化优化增强过程。
Result: 实验表明,该方法在视觉质量和泛化能力上优于现有方法。
Insight: 通过CLIP感知模型结合课程对比学习,可以更有效地平衡增强图像的感知质量与内容恢复,避免过增强或欠增强的问题。
Abstract: High-quality underwater images are essential for both machine vision tasks and viewers with their aesthetic appeal.However, the quality of underwater images is severely affected by light absorption and scattering. Deep learning-based methods for Underwater Image Enhancement (UIE) have achieved good performance. However, these methods often overlook considering human perception and lack sufficient constraints within the solution space. Consequently, the enhanced images often suffer from diminished perceptual quality or poor content restoration.To address these issues, we propose a UIE method with a Contrastive Language-Image Pre-Training (CLIP) perception loss module and curriculum contrastive regularization. Above all, to develop a perception model for underwater images that more aligns with human visual perception, the visual semantic feature extraction capability of the CLIP model is leveraged to learn an appropriate prompt pair to map and evaluate the quality of underwater images. This CLIP perception model is then incorporated as a perception loss module into the enhancement network to improve the perceptual quality of enhanced images. Furthermore, the CLIP perception model is integrated with the curriculum contrastive regularization to enhance the constraints imposed on the enhanced images within the CLIP perceptual space, mitigating the risk of both under-enhancement and over-enhancement. Specifically, the CLIP perception model is employed to assess and categorize the learning difficulty level of negatives in the regularization process, ensuring comprehensive and nuanced utilization of distorted images and negatives with varied quality levels. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability.
[19] SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability
Ali Nasiri-Sarvi,Hassan Rivaz,Mahdi S. Hosseini
Main category: cs.CV
TL;DR: SPARC提出了一种新框架,通过全局TopK稀疏机制和跨重构损失,实现了跨模型和跨模态的统一潜在空间,显著提升了概念对齐性能。
Details
Motivation: 现有解释性方法(如稀疏自编码器)为每个模型独立生成潜在概念,导致概念空间不兼容,限制了跨模型的可解释性。SPARC旨在解决这一问题。Contribution: 1. 提出了SPARC框架,实现了跨模型和跨模态的统一潜在空间;2. 引入了全局TopK稀疏机制和跨重构损失,显著提升概念对齐。
Method: 1. 使用全局TopK稀疏机制,确保不同输入的潜在维度对齐;2. 通过跨重构损失强制语义一致性。
Result: 在Open Images数据集上,SPARC将Jaccard相似度提升至0.80,比之前方法提升了三倍以上。
Insight: SPARC的学习框架不仅实现了概念对齐,还支持文本引导的空间定位和跨模态检索等应用。
Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC’s alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC.
[20] A Probabilistic Approach to Uncertainty Quantification Leveraging 3D Geometry
Rushil Desai,Frederik Warburg,Trevor Darrell,Marissa Ramirez de Chanlatte
Main category: cs.CV
TL;DR: 该论文提出了BayesSDF,一种用于神经隐式SDF模型不确定性量化的概率框架,解决了现有方法在几何一致性和计算效率上的不足。
Details
Motivation: 科学模拟应用中(如森林流体建模),需要精确的3D几何表示和不确定性量化,但现有方法通常忽略几何一致性,导致不确定性校准不佳。Contribution: 提出了BayesSDF框架,通过拉普拉斯近似和基于Hessian的度量,实现了高效、几何感知的不确定性量化。
Method: 利用拉普拉斯近似计算局部表面不稳定性,使用Hessian矩阵度量,实现表面感知的不确定性估计。
Result: 在合成和真实数据集上,BayesSDF在校准和几何一致性上优于现有方法,为下游任务提供了可靠的不确定性度量。
Insight: SDF的连续性和可微性使其比辐射场模型更适用于物理建模,而几何感知的不确定性量化对科学模拟和机器人决策至关重要。
Abstract: Quantifying uncertainty in neural implicit 3D representations, particularly those utilizing Signed Distance Functions (SDFs), remains a substantial challenge due to computational inefficiencies, scalability issues, and geometric inconsistencies. Existing methods typically neglect direct geometric integration, leading to poorly calibrated uncertainty maps. We introduce BayesSDF, a novel probabilistic framework for uncertainty quantification in neural implicit SDF models, motivated by scientific simulation applications with 3D environments (e.g., forests) such as modeling fluid flow through forests, where precise surface geometry and awareness of fidelity surface geometric uncertainty are essential. Unlike radiance-based models such as NeRF or 3D Gaussian splatting, which lack explicit surface formulations, SDFs define continuous and differentiable geometry, making them better suited for physical modeling and analysis. BayesSDF leverages a Laplace approximation to quantify local surface instability via Hessian-based metrics, enabling computationally efficient, surface-aware uncertainty estimation. Our method shows that uncertainty predictions correspond closely with poorly reconstructed geometry, providing actionable confidence measures for downstream use. Extensive evaluations on synthetic and real-world datasets demonstrate that BayesSDF outperforms existing methods in both calibration and geometric consistency, establishing a strong foundation for uncertainty-aware 3D scene reconstruction, simulation, and robotic decision-making.
[21] LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
Zhang Li,Biao Yang,Qiang Liu,Shuo Zhang,Zhiyin Ma,Shuo Zhang,Liang Yin,Linger Deng,Yabo Sun,Yuliang Liu,Xiang Bai
Main category: cs.CV
TL;DR: LIRA通过结合语义增强特征提取器和交错局部视觉耦合,解决了大型多模态模型在分割和视觉理解中的局限性,提升了分割准确性和减少了幻觉问题。
Details
Motivation: 大型多模态模型(LMMs)在分割和理解任务中存在分割不准确和理解幻觉的问题,主要由于视觉理解能力不足和细粒度感知缺失。Contribution: 1. 提出LIRA框架,通过语义增强特征提取器(SEFE)和交错局部视觉耦合(ILVC)提升分割和理解的准确性。2. 引入Attributes Evaluation (AttrEval)数据集,量化语义推断能力。
Method: 1. SEFE融合语义和像素级特征,改进对象属性推断。2. ILVC基于分割掩码生成局部描述,提供细粒度监督。3. 研究分割精度与语义关系的相关性。
Result: LIRA在分割和理解任务中达到了最先进的性能。
Insight: 分割精度与潜在语义关系存在正相关,细粒度监督能有效减少幻觉。
Abstract: While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the
[22] Advancing Offline Handwritten Text Recognition: A Systematic Review of Data Augmentation and Generation Techniques
Yassin Hussein Rassul,Aram M. Ahmed,Polla Fattah,Bryar A. Hassan,Arwaa W. Abdulkareem,Tarik A. Rashid,Joan Lu
Main category: cs.CV
TL;DR: 本文综述了离线手写文本识别中数据增强与生成技术的研究现状,探讨了传统方法与深度学习方法(如GAN、扩散模型和基于Transformer的方法)的优缺点,并提出了未来研究方向。
Details
Motivation: 离线手写文本识别(HTR)在历史文档数字化等领域有广泛应用,但标注数据的稀缺性限制了其性能,尤其对低资源语言和复杂脚本。本文旨在解决这一问题。Contribution: 1. 系统综述了离线手写数据增强与生成技术;2. 分析了传统方法与深度学习方法的优劣;3. 提出了数据稀缺性和脚本真实性的挑战与未来方向。
Method: 采用PRISMA方法筛选了848篇研究,评估了传统增强技术(如几何变换)和深度学习方法(如GAN、扩散模型)。
Result: 总结了现有数据集的局限性,指出了评估指标的不足,并提出了改进生成模型多样性的建议。
Insight: 生成真实多样的手写样本对提升HTR性能至关重要,未来研究需关注多语言支持和复杂脚本的生成能力。
Abstract: Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.
[23] When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking
Weiran Li,Yeqiang Liu,Qiannan Guo,Yijie Wei,Hwa Liang Leo,Zhenbo Li
Main category: cs.CV
TL;DR: 论文提出了首个专门用于水下多鱼跟踪的数据集MFT25,并开发了一种基于Unscented Kalman Filter和FishIoU匹配的跟踪框架SU-T,显著提升了水下鱼类跟踪的性能。
Details
Motivation: 陆地多目标跟踪技术已成熟,但水下场景因环境复杂、鱼类运动模式独特而缺乏研究,这对海洋生态和水产养殖至关重要。Contribution: 1. 发布首个水下多鱼跟踪数据集MFT25;2. 提出Scale-aware和Unscented Tracker (SU-T)框架,优化非线性鱼类运动跟踪;3. 提出FishIoU匹配方法,适应鱼类形态特征。
Method: SU-T结合Unscented Kalman Filter处理非线性运动,并设计FishIoU匹配方法解决鱼类形态和遮挡问题。
Result: SU-T在MFT25上达到34.1 HOTA和44.6 IDF1,性能领先。
Insight: 水下鱼类跟踪与陆地目标跟踪存在显著差异,需针对鱼类形态和运动模式设计专用方法。
Abstract: Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. We present Multiple Fish Tracking Dataset 2025 (MFT25), the first comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear fish swimming patterns and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. MFT25 establishes a robust foundation for advancing research in underwater tracking systems with important applications in marine biology, aquaculture monitoring, and ecological conservation. The dataset and codes are released at https://vranlee.github.io/SU-T/.
[24] SImpHAR: Advancing impedance-based human activity recognition using 3D simulation and text-to-motion models
Lala Shakti Swarup Ray,Mengxi Liu,Deepika Gurung,Bo Zhou,Sungho Suh,Paul Lukowicz
Main category: cs.CV
TL;DR: SImpHAR 提出了一种基于生物阻抗传感的人体活动识别新框架,通过3D模拟和文本到动作模型生成合成数据,解决了标签数据稀缺的问题,并在性能上显著优于现有方法。
Details
Motivation: 生物阻抗传感在细粒度动作捕捉中具有独特优势,但缺乏标签数据限制了其应用。SImpHAR 旨在通过模拟数据和模块化训练克服这一限制。Contribution: 1. 提出一种模拟管道,通过3D人体网格生成逼真的生物阻抗信号;2. 设计了一种两阶段训练策略,无需对齐标签的合成数据即可扩展活动覆盖范围。
Method: 1. 使用最短路径估计、软体物理和文本到动作生成作为数字孪生;2. 两阶段训练策略,解耦合成数据的使用。
Result: 在 ImpAct 数据集和两个公共基准测试中表现优异,准确率和宏 F1 分数分别提升了 22.3% 和 21.8%。
Insight: 模拟驱动的数据增强和模块化训练对基于阻抗的人体活动识别有显著潜力,为数据稀缺领域提供了新思路。
Abstract: Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.
[25] Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization
Hayat Ullah,Arslan Munir,Oliver Nina
Main category: cs.CV
TL;DR: 该论文提出了一种名为PCL-Former的层次化多阶段Transformer架构,用于上下文感知的时间动作定位任务,通过三个专用Transformer模块分别处理候选段识别、动作分类和时间边界预测,显著提升了性能。
Details
Motivation: 受Transformer和多阶段架构在视频识别和目标检测领域成功的启发,论文旨在探索这些方法在时间动作定位任务中的潜力。Contribution: 提出了PCL-Former,一种层次化多阶段Transformer架构,通过专用模块分别处理动作定位的各个子任务,并结合专用损失函数优化性能。
Method: 使用三个Transformer模块:Proposal-Former识别候选动作段,Classification-Former分类动作类别,Localization-Former精确预测动作时间边界。
Result: 在THUMOS14、ActivityNet-1.3和HACS数据集上的实验结果表明,PCL-Former在性能上分别超过当前最佳方法2.8%、1.2%和4.8%。
Insight: 通过模块化设计和专用损失函数,Transformer架构在多阶段任务中可以显著提升时间动作定位的精度和泛化能力。
Abstract: Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
[26] THOR: Thermal-guided Hand-Object Reasoning via Adaptive Vision Sampling
Soroush Shahi,Farzad Shahabi,Rama Nabulsi,Glenn Fernandes,Aggelos Katsaggelos,Nabil Alshurafa
Main category: cs.CV
TL;DR: THOR提出了一种基于热成像的自适应RGB帧采样方法,通过热感数据动态调整RGB采样率,减少能耗与数据量,同时保持高精度的手-物体活动识别。
Details
Motivation: 穿戴式相机持续处理RGB图像存在高能耗、大数据量和隐私问题。THOR通过热成像技术智能调节RGB采样,以更高效的方式实现实时活动监测。Contribution: 1. 提出自适应时空RGB帧采样方法,利用热成像检测活动切换;2. 通过热成像定位手-物体交互区域,仅处理关键图像部分;3. 在真实环境中验证了方法的有效性,显著减少数据量(仅需3%RGB数据)且识别精度几乎无损。
Method: 结合低分辨率热成像数据,动态调整RGB采样率,并在过渡时段增加采样;利用热成像定位手-物体交互区域,裁剪并仅处理关键图像部分。
Result: 实验表明,仅使用3%的原始RGB数据,THOR能捕获所有活动片段,手-活动识别F1分数达到95%,与使用全部RGB数据(94%)相当。
Insight: 热成像可作为高效触发机制,显著减少穿戴设备的数据处理负担,同时保持高精度活动识别,为穿戴式相机的长期使用提供实用方案。
Abstract: Wearable cameras are increasingly used as an observational and interventional tool for human behaviors by providing detailed visual data of hand-related activities. This data can be leveraged to facilitate memory recall for logging of behavior or timely interventions aimed at improving health. However, continuous processing of RGB images from these cameras consumes significant power impacting battery lifetime, generates a large volume of unnecessary video data for post-processing, raises privacy concerns, and requires substantial computational resources for real-time analysis. We introduce THOR, a real-time adaptive spatio-temporal RGB frame sampling method that leverages thermal sensing to capture hand-object patches and classify them in real-time. We use low-resolution thermal camera data to identify moments when a person switches from one hand-related activity to another, and adjust the RGB frame sampling rate by increasing it during activity transitions and reducing it during periods of sustained activity. Additionally, we use the thermal cues from the hand to localize the region of interest (i.e., the hand-object interaction) in each RGB frame, allowing the system to crop and process only the necessary part of the image for activity recognition. We develop a wearable device to validate our method through an in-the-wild study with 14 participants and over 30 activities, and further evaluate it on Ego4D (923 participants across 9 countries, totaling 3,670 hours of video). Our results show that using only 3% of the original RGB video data, our method captures all the activity segments, and achieves hand-related activity recognition F1-score (95%) comparable to using the entire RGB video (94%). Our work provides a more practical path for the longitudinal use of wearable cameras to monitor hand-related activities and health-risk behaviors in real time.
[27] EA: An Event Autoencoder for High-Speed Vision Sensing
Riadul Islam,Joey Mulé,Dhandeep Challagundla,Shahmir Rizvi,Sean Carson
Main category: cs.CV
TL;DR: 本文提出了一种新型事件自动编码器(Event Autoencoder, EA),用于高效压缩和重建事件相机数据,同时保留关键时空特征,解决了高动态环境下稀疏噪声事件流的物体检测问题。该方法在性能和效率上均优于现有技术。
Details
Motivation: 传统帧式视觉系统在动态环境中存在运动模糊、高延迟和数据冗余问题,而事件相机虽能异步捕捉亮度变化,但其稀疏噪声事件流对物体检测提出了挑战。Contribution: 提出了一种事件自动编码器架构,结合自适应阈值选择和轻量级分类器,显著提升了事件数据的处理效率和识别精度。
Method: 采用卷积编码结构,通过自适应阈值选择和轻量级分类器优化事件数据压缩与重建,同时降低计算复杂度。
Result: 在SEFD数据集上,EA的准确性与YOLO-v4相当,但参数减少35.5倍;在嵌入式设备上实现了高帧率(8-44.8 FPS),性能提升87.84倍。
Insight: 事件自动编码器为低功耗、高动态边缘计算场景提供了一种高效解决方案,平衡了精度与效率的需求。
Abstract: High-speed vision sensing is essential for real-time perception in applications such as robotics, autonomous vehicles, and industrial automation. Traditional frame-based vision systems suffer from motion blur, high latency, and redundant data processing, limiting their performance in dynamic environments. Event cameras, which capture asynchronous brightness changes at the pixel level, offer a promising alternative but pose challenges in object detection due to sparse and noisy event streams. To address this, we propose an event autoencoder architecture that efficiently compresses and reconstructs event data while preserving critical spatial and temporal features. The proposed model employs convolutional encoding and incorporates adaptive threshold selection and a lightweight classifier to enhance recognition accuracy while reducing computational complexity. Experimental results on the existing Smart Event Face Dataset (SEFD) demonstrate that our approach achieves comparable accuracy to the YOLO-v4 model while utilizing up to $35.5\times$ fewer parameters. Implementations on embedded platforms, including Raspberry Pi 4B and NVIDIA Jetson Nano, show high frame rates ranging from 8 FPS up to 44.8 FPS. The proposed classifier exhibits up to 87.84x better FPS than the state-of-the-art and significantly improves event-based vision performance, making it ideal for low-power, high-speed applications in real-time edge computing.
[28] Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Ziyang Wang,Jaehong Yoon,Shoubin Yu,Md Mohaiminul Islam,Gedas Bertasius,Mohit Bansal
Main category: cs.CV
TL;DR: Video-RTS通过结合高效RL训练和视频自适应测试时缩放策略,显著提升了视频推理能力,数据效率高且无需资源密集型SFT步骤。
Details
Motivation: 传统基于RL的视频推理方法依赖大规模监督微调和长链标注,成本高且难以扩展。Contribution: 1. 提出数据效率高的纯RL训练方法;2. 引入稀疏到密集的视频测试时缩放策略;3. 显著提升推理性能,同时减少训练数据需求。
Method: 1. 基于输出奖励的纯RL训练;2. 迭代式稀疏到密集TTS策略,根据输出一致性动态添加帧。
Result: 在多个视频推理基准测试中,Video-RTS平均准确率提升2.4%,训练样本仅需3.6%。Video-Holmes提升4.2%,MMVU提升2.6%。
Insight: 纯RL训练与自适应TTS策略互补,为高效视频推理提供了新思路。
Abstract: Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.
[29] Mask6D: Masked Pose Priors For 6D Object Pose Estimation
Yuechen Xie,Haobo Jiang,Jin Xie
Main category: cs.CV
TL;DR: Mask6D引入了一种新的6D物体姿态估计预训练策略,通过结合2D-3D对应图和可见掩码图,有效提升了在遮挡或杂乱场景中的姿态估计性能。
Details
Motivation: 当前基于单目RGB图像的6D姿态估计方法在目标遮挡或场景杂乱时表现不佳,原因是2D特征主干难以提取判别性姿态特征。Mask6D旨在通过引入额外模态信息解决这一问题。Contribution: 1. 提出了结合2D-3D对应图和可见掩码图的预训练策略;2. 设计了面向物体的预训练损失函数以减少背景干扰;3. 通过微调实现了可靠的姿态预测。
Method: Mask6D利用2D-3D对应图(映射3D模型到2D像素)和可见掩码图作为额外输入,结合RGB图像进行重建式预训练,并通过对象聚焦损失函数优化网络。
Result: 实验表明,Mask6D在6D姿态估计任务中优于现有的端到端方法,尤其在遮挡和杂乱场景中表现突出。
Insight: 引入姿态感知的多模态信息(如2D-3D对应图)可以显著提升姿态估计的鲁棒性,尤其是在复杂场景中。
Abstract: Robust 6D object pose estimation in cluttered or occluded conditions using monocular RGB images remains a challenging task. One reason is that current pose estimation networks struggle to extract discriminative, pose-aware features using 2D feature backbones, especially when the available RGB information is limited due to target occlusion in cluttered scenes. To mitigate this, we propose a novel pose estimation-specific pre-training strategy named Mask6D. Our approach incorporates pose-aware 2D-3D correspondence maps and visible mask maps as additional modal information, which is combined with RGB images for the reconstruction-based model pre-training. Essentially, this 2D-3D correspondence maps a transformed 3D object model to 2D pixels, reflecting the pose information of the target in camera coordinate system. Meanwhile, the integrated visible mask map can effectively guide our model to disregard cluttered background information. In addition, an object-focused pre-training loss function is designed to further facilitate our network to remove the background interference. Finally, we fine-tune our pre-trained pose prior-aware network via conventional pose training strategy to realize the reliable pose prediction. Extensive experiments verify that our method outperforms previous end-to-end pose estimation methods.
[30] Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Yupeng Hu,Changxing Ding,Chang Sun,Shaoli Huang,Xiangmin Xu
Main category: cs.CV
TL;DR: 该论文提出了一种双边协作框架(BC-HOI),通过注意力偏差引导(ABG)和大语言模型监督引导(LSG)实现开放词汇的人类-物体交互检测,解决了现有方法中视觉特征过于粗粒度的问题。
Details
Motivation: 开放词汇的人类-物体交互检测任务需要检测图像中所有可能的三元组(人、动词、物体),但现有方法依赖大型视觉语言模型生成的视觉特征通常过于整体和粗粒度,与检测任务的细粒度需求不符。Contribution: 1. 提出双边协作框架(BC-HOI),结合ABG和LSG,生成细粒度的实例级交互特征;2. 通过LLM提供细粒度的词级监督,提升交互检测性能。
Method: 1. ABG组件通过注意力偏差指导视觉语言模型生成细粒度特征;2. LSG组件利用大型语言模型为HOI检测器提供细粒度监督信号。
Result: 在HICO-DET和V-COCO基准测试中,BC-HOI在开放和封闭场景下均表现优异。
Insight: 结合视觉语言模型和大型语言模型的协作机制,可以有效提升开放词汇交互检测的性能,尤其是在细粒度特征生成方面。
Abstract: Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
[31] What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies
Yaoqi Huang,Julie Stephany Berrio,Mao Shan,Stewart Worrall
Main category: cs.CV
TL;DR: 这篇论文系统性地总结了交通场景中需要关注的视觉元素,提出了新的分类法,并分析了35个视觉任务和73个数据集,旨在促进道路安全研究。
Details
Motivation: 为了将计算机视觉技术的进步应用于道路安全,论文通过整合多个领域的关键元素,提供统一的分类框架和分析,帮助研究者更高效地选择资源。Contribution: 提出了一个新颖的分类法,将交通场景中的实体分为异常和正常但关键的两大类,整合了10个类别和20个子类;并全面分析了35个视觉任务和73个数据集。
Method: 通过系统性文献调研和分类法构建,将交通场景中的视觉元素归类,并对相关任务和数据集进行跨领域分析。
Result: 论文总结了现有研究的不足之处,强调了标准统一和资源优化的需求,并指出了未来的研究方向。
Insight: 整合的分类法和跨领域分析为道路安全研究提供了一个统一的框架,有助于填补研究空白和优化资源分配。
Abstract: Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.
[32] FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
Liqiang Jing,Viet Lai,Seunghyun Yoon,Trung Bui,Xinya Du
Main category: cs.CV
TL;DR: 本文提出了FIFA框架,用于统一评估文本到视频和视频到文本生成任务的忠实性,解决了现有方法仅针对单一任务且无法评估开放性问题中幻觉内容的局限性。
Details
Motivation: 现有的视频多模态大语言模型(VideoMLLMs)在视频到文本和文本到视频任务中表现突出,但常因生成内容与视觉输入矛盾而出现幻觉问题。现有评估方法仅针对单一任务,且无法评估开放式回答中的幻觉。Contribution: 1. 提出了统一的忠实性评估框架FIFA,涵盖视频到文本和文本到视频任务;2. 设计了时空语义依赖图(Spatio-Temporal Semantic Dependency Graph)建模语义关系;3. 引入了后校正(Post-Correction)工具,修正生成内容中的幻觉问题。
Method: FIFA框架通过提取全面的描述性事实,利用时空语义依赖图建模其语义关系,并使用视频问答(VideoQA)模型验证其忠实性。后校正框架则通过工具修正幻觉内容。
Result: 实验表明,FIFA比现有评估方法更贴近人类判断,且后校正能有效提升文本与视频生成的事实一致性。
Insight: FIFA为多模态生成任务提供了一种统一的评估和修正方法,强调了语义依赖建模和工具辅助修正的重要性。
Abstract: Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.
[33] Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation
Kazi Mahathir Rahman,Naveed Imtiaz Nafis,Md. Farhan Sadik,Mohammad Al Rafi,Mehedi Hasan Shahed
Main category: cs.CV
TL;DR: 该论文提出了一种多模态管道Speak2Sign3D,将英语语音转化为流畅的3D美国手语动画,结合语音识别、文本转手语翻译和动画生成技术,解决了现有研究中反向翻译的缺失问题。
Details
Motivation: 旨在帮助聋人和听力障碍者更轻松地沟通,填补了现有研究中从语音到手语动画的空白,解决了多步骤转换的技术挑战。Contribution: 1. 提出一个完整的语音到手语动画的管道;2. 引入新数据集BookGlossCorpus-CG和Sign3D-WLASL;3. 使用Whisper和MarianMT模型实现高准确率的语音转文本和文本转手语翻译;4. 基于关键点的3D动画生成技术。
Method: 1. Whisper模型将语音转为文本;2. MarianMT模型将文本转为手语gloss;3. 使用Word2Vec和FastText优化gloss翻译;4. 利用Sign3D-WLASL数据集生成3D关键点动画。
Result: 实现了BLEU分数0.7714和0.8923的高翻译准确率,并生成了流畅的3D手语动画。
Insight: 多模态结合(语音、文本、动画)和高质量数据集是提升手语翻译系统的关键;关键点动画技术能为手语生成提供更自然的动作。
Abstract: Helping deaf and hard-of-hearing people communicate more easily is the main goal of Automatic Sign Language Translation. Although most past research has focused on turning sign language into text, doing the reverse, turning spoken English into sign language animations, has been largely overlooked. That’s because it involves multiple steps, such as understanding speech, translating it into sign-friendly grammar, and generating natural human motion. In this work, we introduce a complete pipeline that converts English speech into smooth, realistic 3D sign language animations. Our system starts with Whisper to translate spoken English into text. Then, we use a MarianMT machine translation model to translate that text into American Sign Language (ASL) gloss, a simplified version of sign language that captures meaning without grammar. This model performs well, reaching BLEU scores of 0.7714 and 0.8923. To make the gloss translation more accurate, we also use word embeddings such as Word2Vec and FastText to understand word meanings. Finally, we animate the translated gloss using a 3D keypoint-based motion system trained on Sign3D-WLASL, a dataset we created by extracting body, hand, and face key points from real ASL videos in the WLASL dataset. To support the gloss translation stage, we also built a new dataset called BookGlossCorpus-CG, which turns everyday English sentences from the BookCorpus dataset into ASL gloss using grammar rules. Our system stitches everything together by smoothly interpolating between signs to create natural, continuous animations. Unlike previous works like How2Sign and Phoenix-2014T that focus on recognition or use only one type of data, our pipeline brings together audio, text, and motion in a single framework that goes all the way from spoken English to lifelike 3D sign language animation.
[34] ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture
Mingjin Zeng,Nan Ouyang,Wenkang Wan,Lei Ao,Qing Cai,Kai Sheng
Main category: cs.CV
TL;DR: ILNet提出了一种多智能体轨迹预测方法,结合逆学习注意力和动态锚点选择模块,显著提升了意图捕获能力和预测准确性,在多个数据集上达到最优性能。
Details
Motivation: 现有方法在捕捉交互意图时缺乏时空协调的动态建模,且固定锚点策略难以适应不同未来环境。受人类驾驶行为的启发,作者提出通过逆学习注意力和动态锚点选择来优化轨迹预测。Contribution: 1. 提出逆学习注意力(IL Attention),动态编码交互的时空协调;2. 设计动态锚点选择模块(DAS),高效提取关键轨迹锚点。
Method: 1. IL Attention通过逆学习建模相邻时刻交互,动态编码意图;2. DAS模块以可学习方式提取轨迹变化关键点作为锚点。
Result: 在INTERACTION和Argoverse数据集上取得最优性能,尤其在复杂交互场景中表现出更高的准确性和多模态分布能力。
Insight: 动态建模交互意图和灵活锚点选择是提升轨迹预测的关键,逆学习方法有效增强了模型的意图捕获能力。
Abstract: Trajectory prediction for multi-agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio-temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed-anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio-temporal coordination of interactions, thereby enhancing the model’s ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at https://github.com/mjZeng11/ILNet.
[35] A model-agnostic active learning approach for animal detection from camera traps
Thi Thu Thuy Nguyen,Duc Thanh Nguyen
Main category: cs.CV
TL;DR: 论文提出了一种模型无关的主动学习方法,用于优化相机陷阱数据中的动物检测任务,通过结合不确定性和多样性指标,显著减少标注数据量。
Details
Motivation: 相机陷阱捕获的野生动物数据量庞大,标注和模型训练成本高昂。现有主动学习方法需要完全访问模型,限制了其应用。Contribution: 提出了一种模型无关的主动学习方法,结合对象级和图像级的不确定性与多样性指标筛选样本,显著减少标注数据需求。
Method: 通过整合样本在对象和图像层次的不确定性与多样性指标,设计主动学习的样本选择策略,实现模型无关的优化。
Result: 实验表明,仅使用30%的标注数据即可达到或超过全量数据的检测性能。
Insight: 模型无关的主动学习方法在减少标注成本的同时保持性能,为野生动物监测提供高效解决方案。
Abstract: Smart data selection is becoming increasingly important in data-driven machine learning. Active learning offers a promising solution by allowing machine learning models to be effectively trained with optimal data including the most informative samples from large datasets. Wildlife data captured by camera traps are excessive in volume, requiring tremendous effort in data labelling and animal detection models training. Therefore, applying active learning to optimise the amount of labelled data would be a great aid in enabling automated wildlife monitoring and conservation. However, existing active learning techniques require that a machine learning model (i.e., an object detector) be fully accessible, limiting the applicability of the techniques. In this paper, we propose a model-agnostic active learning approach for detection of animals captured by camera traps. Our approach integrates uncertainty and diversity quantities of samples at both the object-based and image-based levels into the active learning sample selection process. We validate our approach in a benchmark animal dataset. Experimental results demonstrate that, using only 30% of the training data selected by our approach, a state-of-the-art animal detector can achieve a performance of equal or greater than that with the use of the complete training dataset.
[36] Token Bottleneck: One Token to Remember Dynamics
Taekyung Kim,Dongyoon Han,Byeongho Heo,Jeongeun Park,Sangdoo Yun
Main category: cs.CV
TL;DR: 论文提出Token Bottleneck (ToBo),一种自监督学习框架,通过将动态场景压缩为瓶颈标记并预测后续场景,实现了紧凑且时序感知的视觉表示。
Details
Motivation: 动态场景的视觉表示需要紧凑且具备时序感知能力,以支持视频追踪和机器人操作等任务。现有方法通常缺乏高效的时序建模能力。Contribution: 提出了ToBo框架,通过压缩和扩展步骤学习时序表示;在多项任务中优于基线方法,并验证了其在真实环境中的鲁棒性和扩展性。
Method: ToBo分为压缩步骤(将场景编码为瓶颈标记)和扩展步骤(预测目标场景)。通过少量目标补丁作为提示,模型学习动态过渡。
Result: 在视频标签传播和机器人操作等任务上表现优越,预训练模型在真实环境中也验证了有效性。
Insight: ToBo通过简单设计实现了高效的时序建模,展示了自监督学习在动态场景理解中的潜力,且适用于不同规模的模型。
Abstract: Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.
[37] Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution
Yonghyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Naoki Murata,Wei-Hsiang Liao,Woosung Choi,Kin Wai Cheuk,Junghyun Koo,Yuki Mitsufuji
Main category: cs.CV
TL;DR: 论文提出了Concept-TRAK方法,通过概念级归因理解扩散模型如何学习概念,改进现有归因方法,关注特定元素(如风格或对象)。
Details
Motivation: 随着扩散模型在图像生成中的广泛应用,版权问题和模型透明性成为关键挑战。现有归因方法只能识别影响整张图像的训练样本,无法聚焦特定元素。Contribution: 提出了Concept-TRAK方法,通过扩散后验采样和概念感知奖励函数实现概念级归因,优于现有方法。
Method: 1. 基于扩散后验采样重新定义训练损失,实现鲁棒的样本特定归因;2. 引入概念感知奖励函数,强调语义相关性。
Result: 在AbC基准测试中显著优于现有方法,通过案例研究展示了其在版权保护、安全内容分析和组合学习中的实用价值。
Insight: 概念级归因为生成AI的负责任开发和治理提供了可操作的洞察,有助于解决透明性和版权问题。
Abstract: While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{concept-level attribution} via a novel method called \emph{Concept-TRAK}. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies–ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning–we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.
[38] Divergence-Based Similarity Function for Multi-View Contrastive Learning
Jae Hyoung Jeon,Cheolsu Lim,Myungjoo Kang
Main category: cs.CV
TL;DR: 本文提出了一种基于散度的相似性函数(DSF),用于多视角对比学习中显式捕捉所有视角的联合结构,通过将每组增强视角表示为分布并计算分布之间的散度来衡量相似性。实验表明,DSF在多种任务中表现优异且高效,且无需调参的温度超参数。
Details
Motivation: 已有的多视角对比学习方法主要在损失或特征层面整合多视角,但仅捕捉了成对关系,未能有效建模所有视角的联合结构。Contribution: 引入了DSF,显式建模多视角的联合分布结构;理论上证明了DSF与余弦相似性的联系,并指出DSF无需温度超参数即可高效工作。
Method: 将每组增强视角表示为分布,通过散度衡量分布间的相似性,从而捕捉多视角的联合结构。
Result: DSF在kNN分类和线性评估等任务中表现优异且高效,优于其他多视角方法。
Insight: 分布级相似性函数能更全面建模多视角关系,且避免了调参复杂度。
Abstract: Recent success in contrastive learning has sparked growing interest in more effectively leveraging multiple augmented views of an instance. While prior methods incorporate multiple views at the loss or feature level, they primarily capture pairwise relationships and fail to model the joint structure across all views. In this work, we propose a divergence-based similarity function (DSF) that explicitly captures the joint structure by representing each set of augmented views as a distribution and measuring similarity as the divergence between distributions. Extensive experiments demonstrate that DSF consistently improves performance across various tasks, including kNN classification and linear evaluation, while also offering greater efficiency compared to other multi-view methods. Furthermore, we establish a theoretical connection between DSF and cosine similarity, and show that, unlike cosine similarity, DSF operates effectively without requiring a temperature hyperparameter.
[39] Edge-Boundary-Texture Loss: A Tri-Class Generalization of Weighted Binary Cross-Entropy for Enhanced Edge Detection
Hao Shu
Main category: cs.CV
TL;DR: 论文提出了一种新的损失函数EBT Loss,将像素分为边缘、边界和纹理三类,通过差异化的权重分配提升边缘检测的精度和边界定位能力,实验证明其优于常用的WBCE Loss,且无需复杂调参。
Details
Motivation: 传统的WBCE Loss将所有非边缘像素视为同类,忽略了边缘附近的结构差异,导致模糊预测。为解决这一问题,需要一种更精细的监督方式。Contribution: 1. 提出EBT Loss,将像素分为边缘、边界和纹理三类,分别赋予权重;2. 理论证明EBT Loss是WBCE Loss的泛化形式;3. 通过实验验证其性能和易部署性。
Method: EBT Loss通过三分类监督方式重新定义像素权重,边缘像素优先级最高,边界次之,纹理最低。统一超参数设计简化了调参过程。
Result: 在多个基准测试中,EBT Loss在定量和定性上均优于WBCE Loss,且对超参数变化鲁棒,易于实际应用。
Insight: 损失函数的设计应结合任务特性,差异化监督能显著提升模型性能。EBT Loss为边缘检测任务提供了更灵活和高效的优化方向。
Abstract: Edge detection (ED) remains a fundamental task in computer vision, yet its performance is often hindered by the ambiguous nature of non-edge pixels near object boundaries. The widely adopted Weighted Binary Cross-Entropy (WBCE) loss treats all non-edge pixels uniformly, overlooking the structural nuances around edges and often resulting in blurred predictions. In this paper, we propose the Edge-Boundary-Texture (EBT) loss, a novel objective that explicitly divides pixels into three categories, edge, boundary, and texture, and assigns each a distinct supervisory weight. This tri-class formulation enables more structured learning by guiding the model to focus on both edge precision and contextual boundary localization. We theoretically show that the EBT loss generalizes the WBCE loss, with the latter becoming a limit case. Extensive experiments across multiple benchmarks demonstrate the superiority of the EBT loss both quantitatively and perceptually. Furthermore, the consistent use of unified hyperparameters across all models and datasets, along with robustness to their moderate variations, indicates that the EBT loss requires minimal fine-tuning and is easily deployable in practice.
[40] MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction
Yin Wang,Mu li,Zhiying Leng,Frederick W. B. Li,Xiaohui Liang
Main category: cs.CV
TL;DR: MOST提出了一种新的运动扩散模型,通过时间片段Banzhaf交互解决从罕见文本提示生成人体运动的挑战,实现了细粒度的文本-运动匹配并消除了冗余。
Details
Motivation: 现有方法在生成罕见文本提示对应的人体运动时存在粗粒度匹配和语义忽略的问题,MOST旨在通过细粒度的时间片段交互解决这些问题。Contribution: 1) 提出时间片段Banzhaf交互,首次在检索阶段量化文本-运动片段一致性;2) 设计了运动提示模块,利用检索到的运动片段生成语义一致的运动。
Method: 1) 在检索阶段引入时间片段Banzhaf交互,实现细粒度的文本-运动匹配;2) 在生成阶段通过运动提示模块利用检索结果生成运动。
Result: MOST在文本-运动检索和生成任务中达到SOTA性能,尤其在罕见文本提示上表现优异。
Insight: 通过细粒度的片段交互可以有效解决运动冗余问题,提升罕见文本提示下的运动生成质量。
Abstract: We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST’s retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.
[41] Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning
Yang Chen,Yueqi Duan,Haowen Sun,Jiwen Lu,Yap-Peng Tan
Main category: cs.CV
TL;DR: 该论文提出了一种基于自适应边界对比学习的点云语义分割方法,解决了现有方法中因忽略点云中模糊区域导致的性能不佳问题。
Details
Motivation: 现有方法对点云中所有点使用相同的惩罚目标,忽略了过渡区域的点模糊性和特征区分度不足的问题,导致模型性能受限。Contribution: 1. 提出了AMContrast3D方法,通过对比学习适应性地为不同模糊程度的点设计目标;2. 进一步提出AMContrast3D++,通过并行分支训练和掩码优化机制提升模糊点的特征可靠性。
Method: 1. 基于模糊性评估的自适应对比学习框架;2. 并行训练的模糊预测模块与掩码优化机制。
Result: 在S3DIS和ScanNet数据集上的实验证明,该方法显著提升了点云分割的性能和鲁棒性。
Insight: 点云的模糊性评估和自适应优化可以显著提升语义分割模型的性能,尤其是对于边界模糊的区域。
Abstract: This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at https://github.com/YangChenApril/AMContrast3D.
[42] Capturing Stable HDR Videos Using a Dual-Camera System
Qianyu Zhang,Bolun Zheng,Hangjia Pan,Lingyu Zhu,Zunjie Zhu,Zongpeng Li,Shiqi Wang
Main category: cs.CV
TL;DR: 提出了一种双摄像头系统(DCS)用于HDR视频重建,通过一个摄像头捕获稳定的参考序列,另一个摄像头补充信息,并结合曝光自适应融合网络(EAFNet)解决曝光波动导致的闪烁问题。
Details
Motivation: HDR视频重建中,交替曝光方法中的曝光波动会导致闪烁问题,亟需一种更稳定的解决方案。Contribution: 1. 提出双摄像头系统(DCS)设计;2. 开发曝光自适应融合网络(EAFNet);3. 通过实验验证其性能优于现有方法。
Method: 1. 双摄像头分工捕获序列;2. EAFNet包含预对齐子网络、非对称交叉特征融合子网络和多尺度重建子网络,用于特征增强与融合。
Result: 在多个数据集上实现SOTA性能,有效减少闪烁和伪影。
Insight: 双摄像头分工和曝光自适应融合能显著提升HDR视频的稳定性和质量。
Abstract: In HDR video reconstruction, exposure fluctuations in reference images from alternating exposure methods often result in flickering. To address this issue, we propose a dual-camera system (DCS) for HDR video acquisition, where one camera is assigned to capture consistent reference sequences, while the other is assigned to capture non-reference sequences for information supplementation. To tackle the challenges posed by video data, we introduce an exposure-adaptive fusion network (EAFNet) to achieve more robust results. EAFNet introduced a pre-alignment subnetwork to explore the influence of exposure, selectively emphasizing the valuable features across different exposure levels. Then, the enhanced features are fused by the asymmetric cross-feature fusion subnetwork, which explores reference-dominated attention maps to improve image fusion by aligning cross-scale features and performing cross-feature fusion. Finally, the reconstruction subnetwork adopts a DWT-based multiscale architecture to reduce ghosting artifacts and refine features at different resolutions. Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance on different datasets, validating the great potential of the DCS in HDR video reconstruction. The codes and data captured by DCS will be available at https://github.com/zqqqyu/DCS.
[43] Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
Xu Shaowu,Jia Xibin,Gao Junyu,Sun Qianmei,Chang Jing,Fan Chao
Main category: cs.CV
TL;DR: 论文提出了跨模态双因果学习(CMDCL),通过结构因果模型解决视频与标签文本之间的因果关系,解决了视觉语言模型在长时动作识别(LTAR)中的统计相关性问题和跨模态偏差。
Details
Motivation: 长时动作识别因时间跨度长、动作相关性复杂及视觉干扰问题而具有挑战性。现有视觉语言模型依赖统计相关性而非因果机制,且缺乏跨模态因果建模。为解决这些问题,论文提出CMDCL方法。Contribution: 提出了跨模态双因果学习(CMDCL),通过结构因果模型建模视频与文本标签之间的因果关系,并设计了文本因果干预和视觉因果干预以消除模态偏差和视觉干扰。
Method: CMDCL通过1)文本因果干预解决文本嵌入中的跨模态偏差,2)基于去偏文本的视觉因果干预消除视觉模态中的干扰,从而构建鲁棒的动作表示。
Result: 在Charades、Breakfast和COIN三个基准测试中验证了CMDCL的有效性。
Insight: 因果机制在跨模态任务中至关重要,双因果干预能有效提升长时动作识别的鲁棒性。
Abstract: Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.
[44] Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation
Qing Zhang,Guoquan Pei,Yan Wang
Main category: cs.CV
TL;DR: 提出了一种名为 Omni-Fuse 的新方法,用于医学高光谱图像分割,通过跨维特征融合和双向注意力机制,显著提升了分割性能,优于现有方法。
Details
Motivation: 医学高光谱成像(MHSI)在疾病诊断中具有潜力,但高维性和光谱冗余使得空间和光谱信息的有效融合成为挑战。Contribution: 1. 提出 Omni-Fuse 网络,通过跨维增强模块、光谱引导的空间查询选择和两阶段跨维解码器实现了高效的空间-光谱信息融合。2. 在图像分割任务中性能显著提升(DSC提升5.73%)。
Method: 1. 使用双向注意力机制优化空间和光谱特征。2. 引入光谱引导的空间查询选择机制。3. 设计两阶段跨维解码器动态引导模型关注选定的空间查询。
Result: 在两个微观高光谱图像数据集上,Omni-Fuse 的分割性能显著优于现有方法,DSC指标提升了5.73%。
Insight: 跨维特征融合和双向注意力机制能够有效解决医学高光谱图像中空间与光谱信息融合的挑战,同时保持计算效率。
Abstract: Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: https://github.com/DeepMed-Lab-ECNU/Omni-Fuse.
[45] EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision
Myungjang Pyeon,Janghyeon Lee,Minsoo Lee,Juseung Yun,Hwanil Choi,Jonghyun Kim,Jiwon Kim,Yi Hu,Jongseong Jang,Soonyoung Lee
Main category: cs.CV
TL;DR: EXAONE Path 2.0是一种病理学基础模型,通过端到端的监督学习在patch级别上进行表征学习,显著提升了数据效率和任务表现。
Details
Motivation: 当前数字病理学中,大多数方法通过自监督学习(SSL)训练patch编码器,但SSL可能忽略领域特定的复杂特征,且数据效率较低。EXAONE Path 2.0旨在克服这些限制。Contribution: 提出了EXAONE Path 2.0,一种直接在slide级别监督下学习patch表征的病理学基础模型。
Method: 模型通过端到端监督学习在patch级别上进行训练,无需依赖SSL,仅需37k WSIs即可实现高效学习。
Result: 在10项生物标志物预测任务中达到了最先进的平均性能,展示了卓越的数据效率。
Insight: 端到端监督学习在病理学中优于自监督学习,尤其是在捕捉复杂特征和提高数据效率方面。
Abstract: In digital pathology, whole-slide images (WSIs) are often difficult to handle due to their gigapixel scale, so most approaches train patch encoders via self-supervised learning (SSL) and then aggregate the patch-level embeddings via multiple instance learning (MIL) or slide encoders for downstream tasks. However, patch-level SSL may overlook complex domain-specific features that are essential for biomarker prediction, such as mutation status and molecular characteristics, as SSL methods rely only on basic augmentations selected for natural image domains on small patch-level area. Moreover, SSL methods remain less data efficient than fully supervised approaches, requiring extensive computational resources and datasets to achieve competitive performance. To address these limitations, we present EXAONE Path 2.0, a pathology foundation model that learns patch-level representations under direct slide-level supervision. Using only 37k WSIs for training, EXAONE Path 2.0 achieves state-of-the-art average performance across 10 biomarker prediction tasks, demonstrating remarkable data efficiency.
[46] Learning from Sparse Point Labels for Dense Carcinosis Localization in Advanced Ovarian Cancer Assessment
Farahdiba Zarin,Riccardo Oliva,Vinkle Srivastav,Armine Vardazaryan,Andrea Rosati,Alice Zampolini Faustini,Giovanni Scambia,Anna Fagotti,Pietro Mascagni,Nicolas Padoy
Main category: cs.CV
TL;DR: 该论文提出了一种从稀疏点标签学习密集定位的方法,用于高级卵巢癌评估中的2D癌变关键点定位,并提出了一种新的损失函数(Crag and Tail loss)以优化学习效果。
Details
Motivation: 在医学领域,密集的像素级标注成本高且不现实,尤其是对新任务而言。研究如何从稀疏的像素级标签学习密集预测任务,以推动标注资源有限的研究进展。Contribution: 提出了一种从稀疏点标注学习密集预测任务的方法,并设计了Crag and Tail损失函数,以有效利用稀疏正标签并减少假阴性或遗漏标注的影响。
Method: 将问题建模为稀疏热图回归任务,通过新提出的损失函数优化模型对稀疏标注的学习能力。
Result: 通过大量实验验证了方法的有效性,能够准确实现癌变关键点的密集定位。
Insight: 该方法展示了在标注资源有限的情况下,通过设计合适的损失函数和任务建模,仍能实现高质量的密集预测任务的潜力。
Abstract: Learning from sparse labels is a challenge commonplace in the medical domain. This is due to numerous factors, such as annotation cost, and is especially true for newly introduced tasks. When dense pixel-level annotations are needed, this becomes even more unfeasible. However, being able to learn from just a few annotations at the pixel-level, while extremely difficult and underutilized, can drive progress in studies where perfect annotations are not immediately available. This work tackles the challenge of learning the dense prediction task of keypoint localization from a few point annotations in the context of 2d carcinosis keypoint localization from laparoscopic video frames for diagnostic planning of advanced ovarian cancer patients. To enable this, we formulate the problem as a sparse heatmap regression from a few point annotations per image and propose a new loss function, called Crag and Tail loss, for efficient learning. Our proposed loss function effectively leverages positive sparse labels while minimizing the impact of false negatives or missed annotations. Through an extensive ablation study, we demonstrate the effectiveness of our approach in achieving accurate dense localization of carcinosis keypoints, highlighting its potential to advance research in scenarios where dense annotations are challenging to obtain.
[47] ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data
Chengkun Li,Yuqi Tong,Kai Chen,Zhenya Yang,Ruiyang Li,Shi Qiu,Jason Ying-Kuen Chan,Pheng-Ann Heng,Qi Dou
Main category: cs.CV
TL;DR: ClipGS是一种支持裁剪平面的高斯泼溅框架,用于医学体积数据的交互式电影级可视化,通过学习截断方案和自适应调整模型提升渲染质量和效率。
Details
Motivation: 医学体积数据的可视化对诊断和手术规划至关重要,但现有技术的高计算成本和低渲染速度限制了交互式应用。Contribution: 提出了ClipGS框架,通过可学习的截断方案和自适应调整模型,实现高质量、高效率的交互式渲染。
Method: 采用高斯泼溅技术,支持裁剪平面,并提出动态调整高斯基元可见性和形状的学习机制。
Result: 在五种医学数据(CT和解剖切片)上,达到平均36.635 PSNR,156 FPS和16.1 MB模型大小,优于现有方法。
Insight: 通过动态调整高斯基元的可见性和形状,能够在保持高质量渲染的同时实现高效交互。
Abstract: The visualization of volumetric medical data is crucial for enhancing diagnostic accuracy and improving surgical planning and education. Cinematic rendering techniques significantly enrich this process by providing high-quality visualizations that convey intricate anatomical details, thereby facilitating better understanding and decision-making in medical contexts. However, the high computing cost and low rendering speed limit the requirement of interactive visualization in practical applications. In this paper, we introduce ClipGS, an innovative Gaussian splatting framework with the clipping plane supported, for interactive cinematic visualization of volumetric medical data. To address the challenges posed by dynamic interactions, we propose a learnable truncation scheme that automatically adjusts the visibility of Gaussian primitives in response to the clipping plane. Besides, we also design an adaptive adjustment model to dynamically adjust the deformation of Gaussians and refine the rendering performance. We validate our method on five volumetric medical data (including CT and anatomical slice data), and reach an average 36.635 PSNR rendering quality with 156 FPS and 16.1 MB model size, outperforming state-of-the-art methods in rendering quality and efficiency.
[48] Diff$^2$I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior
Juncheng Mu,Chengwei Ren,Weixiang Zhang,Liang Pan,Xiao-Ping Zhang,Yue Gao
Main category: cs.CV
TL;DR: Diff²I2P 利用扩散先验和可微的对应关系调整模块,提升了图像到点云的跨模态配准性能。
Details
Motivation: 现有方法通过度量学习强制跨模态特征对齐,但忽视了图像和点云数据之间的固有模态差异,导致配准效果不佳。Contribution: 1. 提出 Diff²I2P,引入扩散模型先验桥接模态差异;2. 提出 Control-Side Score Distillation (CSD) 技术;3. 设计了可微的 Deformable Correspondence Tuning (DCT) 模块。
Method: 1. 利用深度条件扩散模型生成先验;2. 使用 CSD 直接优化变换预测;3. 通过 DCT 和可微 PnP 求解器实现端到端学习。
Result: 在 7-Scenes 基准测试上,配准召回率提升超过 7%,优于现有方法。
Insight: 扩散模型可以作为跨模态特征学习的强先验,显著提升图像和点云之间的配准性能。
Abstract: Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose Diff$^2$I2P, a fully Differentiable I2P registration framework, leveraging a novel and effective Diffusion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that Diff$^2$I2P consistently outperforms SoTA I2P registration methods, achieving over 7% improvement in registration recall on the 7-Scenes benchmark.
[49] MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning
Yifan Yang,Peili Song,Enfan Lan,Dong Liu,Jingtai Liu
Main category: cs.CV
TL;DR: MK-Pose是一种基于多模态关键点学习的框架,用于类别级物体姿态估计,结合RGB图像、点云和文本描述,通过自监督关键点检测和图增强特征融合模块提升性能。
Details
Motivation: 现有方法依赖单一模态(如RGB或点云),难以处理物体遮挡和跨类别的泛化问题,因此提出多模态融合的方法。Contribution: 提出了MK-Pose框架,首次整合RGB、点云和文本描述,设计自监督关键点检测和图特征融合模块。
Method: 使用自监督关键点检测模块(基于注意力查询生成和软热图匹配),以及图增强特征融合模块整合局部和全局信息。
Result: MK-Pose在CAMERA25和REAL275数据集上表现优于现有方法,并在HouseCat6D上验证了跨数据集能力。
Insight: 多模态融合和图建模能有效提升姿态估计的鲁棒性,尤其在遮挡和跨类别场景中。
Abstract: Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{https://github.com/yangyifanYYF/MK-Pose}{https://github.com/yangyifanYYF/MK-Pose}.
[50] FlexGaussian: Flexible and Cost-Effective Training-Free Compression for 3D Gaussian Splatting
Boyuan Tian,Qizhe Gao,Siran Xianyu,Xiaotong Cui,Minjia Zhang
Main category: cs.CV
TL;DR: FlexGaussian是一种无需重新训练的3D高斯泼溅压缩方法,结合混合精度量化和属性判别剪枝,实现高效压缩,适用于移动设备。
Details
Motivation: 大规模3D模型在资源受限设备上的需求增长,需要灵活且高效的压缩方法,当前方法缺乏灵活性且需重新训练。Contribution: 提出FlexGaussian,无需重新训练,通过混合精度量化和属性判别剪枝实现灵活高效的压缩。
Method: 结合混合精度量化和属性判别剪枝,无需重新训练,适应多种压缩目标。
Result: 在PSNR下降小于1dB的情况下,压缩率高达96.4%,速度快于现有方法1.7-2.1倍。
Insight: 无需重新训练的高效压缩方法可显著提升3D模型在移动端的部署效率。
Abstract: 3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints. In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (<1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: https://github.com/Supercomputing-System-AI-Lab/FlexGaussian
[51] Text-promptable Object Counting via Quantity Awareness Enhancement
Miaojing Shi,Xiaowen Zhang,Zijie Yue,Yong Luo,Cairong Zhao,Li Li
Main category: cs.CV
TL;DR: 论文提出QUANet,通过数量导向的文本提示和视觉-文本数量对齐损失增强模型的计数能力,引入双流自适应计数解码器和交叉流数量排序损失,在多个基准测试中表现优异。
Details
Motivation: 现有方法在文本提示计数任务中仅关注对象类别,忽略数量信息,导致计数准确性不足。Contribution: 1. 提出数量导向的文本提示和视觉-文本对齐损失;2. 设计双流自适应计数解码器(Transformer和CNN流)及T2C适配器;3. 引入交叉流数量排序损失。
Method: 1. 使用数量导向的文本提示;2. 双流解码器结合Transformer和CNN,通过T2C适配器进行知识聚合;3. 交叉流排序损失优化预测顺序。
Result: 在FSC-147、CARPK等数据集上展示了优异的零样本类别无关计数性能。
Insight: 数量信息在文本提示计数中至关重要,双流结构和适配器能有效提升模型的泛化能力。
Abstract: Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model’s quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model’s strong generalizability for zero-shot class-agnostic counting. Code is available at https://github.com/viscom-tongji/QUANet
[52] Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis
Hao Tang,Ling Shao,Zhenyu Zhang,Luc Van Gool,Nicu Sebe
Main category: cs.CV
TL;DR: STG-Mamba是一种用于音乐引导舞蹈视频合成的空间-时间图Mamba方法,通过音乐到骨架和骨架到视频的两步映射实现,生成效果优于现有方法。
Details
Motivation: 现有方法在音乐引导舞蹈视频合成任务中难以同时捕捉关节的空间和时间依赖性,因此需要一种更有效的方法来生成自然流畅的舞蹈视频。Contribution: 1. 提出了空间-时间图Mamba(STGM)块,用于音乐到骨架的映射;2. 设计了自监督正则化网络实现骨架到视频的生成;3. 收集了一个包含54,944个视频片段的数据集。
Method: 1. 音乐到骨架映射:利用STGM块捕捉关节的空间和时间依赖;2. 骨架到视频映射:通过自监督正则化网络和条件图像生成视频。
Result: 实验表明,STG-Mamba在音乐引导舞蹈视频合成任务中优于现有方法。
Insight: 结合空间和时间依赖性的建模是生成流畅舞蹈视频的关键,自监督正则化网络可以有效提升视频生成质量。
Abstract: We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.
[53] A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding
Zhenyang Liu,Sixiao Zheng,Siyu Chen,Cairong Zhao,Longfei Liang,Xiangyang Xue,Yanwei Fu
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为SpatialReasoner的神经表示框架,通过大语言模型(LLM)驱动的空间推理,解决了开放词汇3D视觉定位中空间关系推理的不足。
Details
Motivation: 开放词汇的3D视觉定位在自主导航和机器人等应用中至关重要,但现有方法在语言查询中的空间关系推理(如“椅子上的书”)存在不足,需要提升语言和3D场景中的空间推理能力。Contribution: 主要贡献包括:1) 提出SpatialReasoner框架,结合LLM和视觉属性增强的分层特征场;2) 通过微调LLM捕捉空间关系并推断目标、锚点和空间关系;3) 引入视觉属性构建分层特征场,提升定位准确性。
Method: 方法包括:1) 微调LLM以推理语言查询中的空间关系;2) 利用视觉属性(如不透明度和颜色)构建分层特征场;3) 结合CLIP特征和SAM提取的掩码,实现层次化查询定位目标。
Result: 实验表明,该方法可无缝集成到多种神经表示中,显著优于基线模型,并提升了空间推理能力。
Insight: 通过LLM驱动空间推理和视觉属性增强的特征场,可以更准确地实现复杂的开放词汇3D视觉定位任务。
Abstract: Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.’’ This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.
[54] Hierarchical Feature Alignment for Gloss-Free Sign Language Translation
Sobhan Asasi,Mohamed Ilyes Lakhal,Richard Bowden
Main category: cs.CV
TL;DR: 该论文提出了一种无需标注的分层预训练策略,通过视频伪标签和对比学习改进手语翻译性能。
Details
Motivation: 现有手语翻译方法在端到端学习时存在视觉与文本表示的不对齐问题,基于伪标签的方法虽灵活性高但需有效对齐策略。Contribution: 提出了一种分层特征对齐方法,结合伪标签和对比学习,提升翻译质量。
Method: 分层提取帧、片段和视频级特征,与伪标签和文本句子对齐。
Result: 实验显示BLEU-4和ROUGE分数提升,同时保持效率。
Insight: 分层对齐策略能更有效捕捉手语的结构信息,提升翻译性能。
Abstract: Sign Language Translation (SLT) attempts to convert sign language videos into spoken sentences. However, many existing methods struggle with the disparity between visual and textual representations during end-to-end learning. Gloss-based approaches help to bridge this gap by leveraging structured linguistic information. While, gloss-free methods offer greater flexibility and remove the burden of annotation, they require effective alignment strategies. Recent advances in Large Language Models (LLMs) have enabled gloss-free SLT by generating text-like representations from sign videos. In this work, we introduce a novel hierarchical pre-training strategy inspired by the structure of sign language, incorporating pseudo-glosses and contrastive video-language alignment. Our method hierarchically extracts features at frame, segment, and video levels, aligning them with pseudo-glosses and the spoken sentence to enhance translation quality. Experiments demonstrate that our approach improves BLEU-4 and ROUGE scores while maintaining efficiency.
[55] Residual Prior-driven Frequency-aware Network for Image Fusion
Guan Zheng,Xue Wang,Wenhua Qian,Peng Liu,Runzhuo Ma
Main category: cs.CV
TL;DR: 该论文提出了一种名为RPFNet的残差先验驱动的频率感知网络,用于解决图像融合中的全局特征建模和互补信息捕获问题。通过双分支特征提取框架和多种损失函数,实现了高效的融合性能。
Details
Motivation: 图像融合任务中,传统的全局空间建模方法计算成本高,且缺乏真实标签,难以有效捕获互补信息。RPFNet旨在通过频率域建模和残差先验提取解决这些问题。Contribution: 1. 提出RPFNet网络,结合残差先验模块(RPM)和频率域融合模块(FDFM)实现高效特征提取与融合;2. 引入交叉促进模块(CPM)增强局部与全局特征的协同感知;3. 设计了多种损失函数(如频率对比损失和SSIM损失)以约束优化空间。
Method: 1. 使用RPM提取模态差异信息;2. 通过FDFM在频率域实现全局特征建模;3. 引入CPM实现双向特征交互;4. 训练中添加辅助解码器和显着结构损失提升模型灵敏度。
Result: 实验表明,RPFNet能够有效整合判别性特征,增强纹理细节和显着对象,并提升高层视觉任务的性能。
Insight: 频率域建模和残差先验的结合为解决图像融合中的计算复杂性和互补信息捕获问题提供了新思路,同时多损失函数的组合优化能够更好地约束模型训练。
Abstract: Image fusion aims to integrate complementary information across modalities to generate high-quality fused images, thereby enhancing the performance of high-level vision tasks. While global spatial modeling mechanisms show promising results, constructing long-range feature dependencies in the spatial domain incurs substantial computational costs. Additionally, the absence of ground-truth exacerbates the difficulty of capturing complementary features effectively. To tackle these challenges, we propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet. Specifically, RPFNet employs a dual-branch feature extraction framework: the Residual Prior Module (RPM) extracts modality-specific difference information from residual maps, thereby providing complementary priors for fusion; the Frequency Domain Fusion Module (FDFM) achieves efficient global feature modeling and integration through frequency-domain convolution. Additionally, the Cross Promotion Module (CPM) enhances the synergistic perception of local details and global structures through bidirectional feature interaction. During training, we incorporate an auxiliary decoder and saliency structure loss to strengthen the model’s sensitivity to modality-specific differences. Furthermore, a combination of adaptive weight-based frequency contrastive loss and SSIM loss effectively constrains the solution space, facilitating the joint capture of local details and global features while ensuring the retention of complementary information. Extensive experiments validate the fusion performance of RPFNet, which effectively integrates discriminative features, enhances texture details and salient objects, and can effectively facilitate the deployment of the high-level vision task.
[56] DIFFUMA: High-Fidelity Spatio-Temporal Video Prediction via Dual-Path Mamba and Diffusion Enhancement
Xinyu Xie,Weifeng Cao,Jun Shi,Yangyang Hu,Hui Liang,Wanyong Liang,Xiaoliang Qian
Main category: cs.CV
TL;DR: 该论文提出了DIFFUMA模型和CHDL数据集,用于高精度时空视频预测,在半导体制造领域表现卓越。
Details
Motivation: 解决工业场景中高精度时空视频预测的挑战,尤其是半导体制造领域缺乏专用数据集的问题。Contribution: 1. 发布首个公开的半导体晶圆切割过程数据集CHDL;2. 提出DIFFUMA模型,结合Mamba模块和扩散模块,显著提升预测性能。
Method: 采用双路径架构:Mamba模块捕捉全局长时序依赖,扩散模块恢复空间细节。
Result: 在CHDL数据集上,MSE降低39%,SSIM提升至0.988;在自然现象数据集上也表现优异。
Insight: 工业AI需要专用数据集和针对性模型设计,DIFFUMA为高精度动态建模提供了新思路。
Abstract: Spatio-temporal video prediction plays a pivotal role in critical domains, ranging from weather forecasting to industrial automation. However, in high-precision industrial scenarios such as semiconductor manufacturing, the absence of specialized benchmark datasets severely hampers research on modeling and predicting complex processes. To address this challenge, we make a twofold contribution.First, we construct and release the Chip Dicing Lane Dataset (CHDL), the first public temporal image dataset dedicated to the semiconductor wafer dicing process. Captured via an industrial-grade vision system, CHDL provides a much-needed and challenging benchmark for high-fidelity process modeling, defect detection, and digital twin development.Second, we propose DIFFUMA, an innovative dual-path prediction architecture specifically designed for such fine-grained dynamics. The model captures global long-range temporal context through a parallel Mamba module, while simultaneously leveraging a diffusion module, guided by temporal features, to restore and enhance fine-grained spatial details, effectively combating feature degradation. Experiments demonstrate that on our CHDL benchmark, DIFFUMA significantly outperforms existing methods, reducing the Mean Squared Error (MSE) by 39% and improving the Structural Similarity (SSIM) from 0.926 to a near-perfect 0.988. This superior performance also generalizes to natural phenomena datasets. Our work not only delivers a new state-of-the-art (SOTA) model but, more importantly, provides the community with an invaluable data resource to drive future research in industrial AI.
[57] PromptTea: Let Prompts Tell TeaCache the Optimal Threshold
Zishen Huang,Chunyu Yang,Mengyuan Ren
Main category: cs.CV
TL;DR: 该论文提出了一种基于提示复杂度的缓存方法(PCA)和动态CFGCache机制,显著提升了视频生成模型的推理速度,同时保持了视觉保真度。
Details
Motivation: 尽管视频生成技术有所进展,推理速度仍是瓶颈。固定间隔的缓存机制在复杂场景下效果不佳,且手动调整阈值效率低且不鲁棒。Contribution: 1. 提出PCA缓存方法,根据输入提示自动调整缓存阈值;2. 动态CFGCache机制,选择性重用分类器自由引导(CFG)输出;3. 改进输入-输出关系建模,提升预测准确性。
Method: 1. PCA缓存:基于输入提示的语义线索动态调整阈值;2. 解耦噪声输入,增强文本信息贡献;3. 引入多元多项式特征扩展;4. DynCFGCache动态选择CFG输出。
Result: 在Wan2.1模型上实现了2.79倍加速,同时保持高视觉保真度。
Insight: 输入提示的语义信息对缓存决策至关重要,动态机制能更好地平衡速度和生成质量。
Abstract: Despite recent progress in video generation, inference speed remains a major bottleneck. A common acceleration strategy involves reusing model outputs via caching mechanisms at fixed intervals. However, we find that such fixed-frequency reuse significantly degrades quality in complex scenes, while manually tuning reuse thresholds is inefficient and lacks robustness. To address this, we propose Prompt-Complexity-Aware (PCA) caching, a method that automatically adjusts reuse thresholds based on scene complexity estimated directly from the input prompt. By incorporating prompt-derived semantic cues, PCA enables more adaptive and informed reuse decisions than conventional caching methods. We also revisit the assumptions behind TeaCache and identify a key limitation: it suffers from poor input-output relationship modeling due to an oversimplified prior. To overcome this, we decouple the noisy input, enhance the contribution of meaningful textual information, and improve the model’s predictive accuracy through multivariate polynomial feature expansion. To further reduce computational cost, we replace the static CFGCache with DynCFGCache, a dynamic mechanism that selectively reuses classifier-free guidance (CFG) outputs based on estimated output variations. This allows for more flexible reuse without compromising output quality. Extensive experiments demonstrate that our approach achieves significant acceleration-for example, 2.79x speedup on the Wan2.1 model-while maintaining high visual fidelity across a range of scenes.
[58] Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
Yan Hon Michael Chung,Donghyeok Choi
Main category: cs.CV
TL;DR: 论文通过微调视觉语言模型(VLMs)为濒危语言满文开发了高效的OCR系统,显著提升了在真实历史文档上的识别性能。
Details
Motivation: 满文作为濒危语言,缺乏有效的OCR系统处理真实历史文档,阻碍了早期现代东亚历史的研究。Contribution: 1. 提出了一个基于VLMs的OCR框架,在满文上实现了高准确率(98.3%合成数据,93.1%真实数据)。
2. 通过参数高效训练,验证了合成数据到真实数据的有效迁移。
3. 为濒危语言的OCR提供了一种低成本、可部署的解决方案。
Method: 1. 微调三个开源VLMs(LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B)。
2. 使用60,000张合成的满文单词图像进行训练。
3. 评估在合成数据与真实手写文档上的性能。
Result: 1. LLaMA-3.2-11B在合成数据上表现最佳(98.3%单词准确率,0.0024字符错误率)。
2. 在真实文档上保持93.1%的准确率,远超传统CRNN基线(72.5%)。
Insight: 1. VLMs在低资源语言OCR任务中具有潜力。
2. 合成数据训练可以高效迁移到真实场景。
3. 该框架可扩展至其他濒危语言,降低了技术门槛。
Abstract: Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8% synthetic accuracy, it suffered severe degradation to 72.5% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at https://github.com/mic7ch1/ManchuAI-OCR.
[59] FOLC-Net: A Federated-Optimized Lightweight Architecture for Enhanced MRI Disease Diagnosis across Axial, Coronal, and Sagittal Views
Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
Main category: cs.CV
TL;DR: FOLC-Net提出了一种新型的联邦优化轻量级架构,用于增强MRI疾病诊断,特别是在轴位、冠状位和矢状位视图中的性能。
Details
Motivation: 现有的MRI疾病诊断模型在处理多视图数据时性能下降,尤其是在矢状位视图中。FOLC-Net旨在解决这一问题,提高模型的适应性和准确性。Contribution: FOLC-Net的贡献包括:1)引入了MRFO机制、全局模型克隆和ConvNeXt模块;2)提出了一种轻量级架构(仅1.217M参数,0.9MB存储);3)在多视图和单视图数据上均表现出色。
Method: FOLC-Net结合了MRFO优化模型结构、全局模型克隆实现可扩展训练,以及ConvNeXt提升客户端适应性。模型在轴位、冠状位和矢状位视图上进行了测试。
Result: 在矢状位视图中,FOLC-Net的准确率达到92.44%,显著优于现有方法。同时,其在所有单视图及多视图数据上均表现出更高的准确性和鲁棒性。
Insight: FOLC-Net展示了轻量级架构和联邦学习在医学图像分析中的潜力,尤其是在处理多视图数据时表现优异,为去中心化环境中的医疗应用提供了可靠解决方案。
Abstract: The framework is designed to improve performance in the analysis of combined as well as single anatomical perspectives for MRI disease diagnosis. It specifically addresses the performance degradation observed in state-of-the-art (SOTA) models, particularly when processing axial, coronal, and sagittal anatomical planes. The paper introduces the FOLC-Net framework, which incorporates a novel federated-optimized lightweight architecture with approximately 1.217 million parameters and a storage requirement of only 0.9 MB. FOLC-Net integrates Manta-ray foraging optimization (MRFO) mechanisms for efficient model structure generation, global model cloning for scalable training, and ConvNeXt for enhanced client adaptability. The model was evaluated on combined multi-view data as well as individual views, such as axial, coronal, and sagittal, to assess its robustness in various medical imaging scenarios. Moreover, FOLC-Net tests a ShallowFed model on different data to evaluate its ability to generalize beyond the training dataset. The results show that FOLC-Net outperforms existing models, particularly in the challenging sagittal view. For instance, FOLC-Net achieved an accuracy of 92.44% on the sagittal view, significantly higher than the 88.37% accuracy of study method (DL + Residual Learning) and 88.95% of DL models. Additionally, FOLC-Net demonstrated improved accuracy across all individual views, providing a more reliable and robust solution for medical image analysis in decentralized environments. FOLC-Net addresses the limitations of existing SOTA models by providing a framework that ensures better adaptability to individual views while maintaining strong performance in multi-view settings. The incorporation of MRFO, global model cloning, and ConvNeXt ensures that FOLC-Net performs better in real-world medical applications.
[60] Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs
Yahan Yu,Yuyang Dong,Masafumi Oyamada
Main category: cs.CV
TL;DR: 本文提出了一种名为D2I的框架,通过训练时的规则性格式奖励增强多模态LLM的理解和推理能力,而无需额外标注或复杂奖励,评估时则转向直觉推理。该方法在领域内外基准测试中表现优异,并揭示了格式奖励在提升MLLM可迁移推理能力中的重要作用。
Details
Motivation: 多模态推理研究面临模态对齐和训练成本高的挑战,现有方法依赖额外数据标注和规则奖励,限制了可扩展性。本文旨在通过一种无需额外资源的框架,提升MLLM的推理能力。Contribution: 提出D2I框架,通过训练时的规则性格式奖励增强模态对齐和推理能力,评估时切换为直觉推理,实现高性能且低成本的解决方案。
Method: D2I框架在训练阶段采用显式推理策略(基于规则格式奖励),测试阶段则转为隐式直觉推理,避免了复杂奖励和标注需求。
Result: D2I在领域内外基准测试中显著优于基线方法,证明了其有效性和可迁移性。
Insight: 格式奖励是提升MLLM推理能力的关键,同时训练时显式推理与测试时隐式推理的解耦为未来研究提供了新方向。
Abstract: Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model’s acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.
[61] Democratizing High-Fidelity Co-Speech Gesture Video Generation
Xu Yang,Shaoli Huang,Shenbo Xie,Xuelin Chen,Yifei Liu,Changxing Ding
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级框架,通过2D全身骨架作为辅助条件,利用扩散模型和音频-骨架特征融合技术,生成高质量、音频同步的说话人动作视频,并发布了首个公开大型数据集CSG-405。
Details
Motivation: 语音同步手势视频生成的研究受限于大规模公开数据集的稀缺性和高计算需求,且音频-视觉映射存在一对多的复杂性。论文旨在解决这些问题,并推动研究民主化。Contribution: 1. 提出了轻量级框架,结合扩散模型和骨架条件,实现高质量的语音同步视频生成。2. 发布了首个公开大规模数据集CSG-405。3. 在视觉质量和同步性上超越了现有方法。
Method: 1. 使用2D骨架作为辅助条件,结合音频信号生成动作。2. 采用扩散模型预测骨架运动,确保音频同步和身体一致性。3. 利用现有人体视频生成模型合成最终视频。
Result: 实验表明,该方法在视觉质量和同步性上优于现有技术,并能泛化到不同说话人和场景。
Insight: 通过骨架作为中间表示,可以高效连接音频与视觉输出,降低计算负担,同时提升生成视频的多样性和同步性。
Abstract: Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker’s reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker’s reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.
[62] HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement
Qingsen Yan,Kangbiao Shi,Yixu Feng,Tao Hu,Peng Wu,Guansong Pang,Yanning Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种新的颜色空间HVI和改进的网络HVI-CIDNet+,用于在极暗环境中增强低光图像,解决了现有方法中颜色偏差和亮度伪影的问题。
Details
Motivation: 现有基于sRGB和HSV颜色空间的低光图像增强方法存在颜色偏差、亮度伪影及噪声问题,亟需一种更有效的解决方案。Contribution: 1. 提出HVI颜色空间,通过HV颜色图和可学习强度减少红色和黑色噪声伪影;2. 设计HVI-CIDNet+网络,结合视觉语言模型和先验引导注意力块(PAB),有效恢复极暗区域的内容和颜色。
Method: 1. 使用HVI颜色空间解耦亮度和颜色;2. 通过PAB整合视觉语言模型的上下文和退化知识;3. 区域细化模块针对信息丰富和稀缺区域分别采用卷积和自注意力机制。
Result: HVI-CIDNet+在10个基准数据集上优于现有方法。
Insight: 1. 结合视觉语言模型的先验知识可以显著提升极暗区域的恢复效果;2. 动态选择区域处理策略(卷积或自注意力)有助于优化全局性能。
Abstract: Low-Light Image Enhancement (LLIE) aims to restore vivid content and details from corrupted low-light images. However, existing standard RGB (sRGB) color space-based LLIE methods often produce color bias and brightness artifacts due to the inherent high color sensitivity. While Hue, Saturation, and Value (HSV) color space can decouple brightness and color, it introduces significant red and black noise artifacts. To address this problem, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by the HV color map and learnable intensity. The HV color map enforces small distances for the red coordinates to remove red noise artifacts, while the learnable intensity compresses the low-light regions to remove black noise artifacts. Additionally, we introduce the Color and Intensity Decoupling Network+ (HVI-CIDNet+), built upon the HVI color space, to restore damaged content and mitigate color distortion in extremely dark regions. Specifically, HVI-CIDNet+ leverages abundant contextual and degraded knowledge extracted from low-light images using pre-trained vision-language models, integrated via a novel Prior-guided Attention Block (PAB). Within the PAB, latent semantic priors can promote content restoration, while degraded representations guide precise color correction, both particularly in extremely dark regions through the meticulously designed cross-attention fusion mechanism. Furthermore, we construct a Region Refinement Block that employs convolution for information-rich regions and self-attention for information-scarce regions, ensuring accurate brightness adjustments. Comprehensive results from benchmark experiments demonstrate that the proposed HVI-CIDNet+ outperforms the state-of-the-art methods on 10 datasets.
[63] Physics-Grounded Motion Forecasting via Equation Discovery for Trajectory-Guided Image-to-Video Generation
Tao Feng,Xianbing Zhao,Zhenhua Chen,Tien Tsin Wong,Hamid Rezatofighi,Gholamreza Haffari,Lizhen Qu
Main category: cs.CV
TL;DR: 该论文提出了一种结合符号回归和轨迹引导的图像到视频生成框架,通过物理基础的运动预测提升视频生成的物理对齐性。
Details
Motivation: 现有扩散模型和自回归视频生成模型虽然在视觉真实性上表现优异,但缺乏物理对齐性,无法准确模拟现实世界中的物体运动。这主要是因为它们依赖统计相关性而非物理规律。Contribution: 论文的主要贡献是提出了一种新框架,通过符号回归从视频中提取运动轨迹并发现运动方程,从而指导物理准确的视频生成,无需微调现有模型。
Method: 方法包括从输入视频中提取运动轨迹,利用检索式预训练提升符号回归的效果,发现运动方程以预测未来轨迹,并用这些轨迹引导视频生成。
Result: 在经典力学场景(如弹簧质量、摆锤和抛射运动)中,该方法成功恢复了真实解析方程,并显著提高了生成视频的物理对齐性。
Insight: 通过将符号回归与视频生成结合,论文验证了物理规律可以显著提升生成内容的真实性,为未来基于物理的生成模型提供了新思路。
Abstract: Recent advances in diffusion-based and autoregressive video generation models have achieved remarkable visual realism. However, these models typically lack accurate physical alignment, failing to replicate real-world dynamics in object motion. This limitation arises primarily from their reliance on learned statistical correlations rather than capturing mechanisms adhering to physical laws. To address this issue, we introduce a novel framework that integrates symbolic regression (SR) and trajectory-guided image-to-video (I2V) models for physics-grounded video forecasting. Our approach extracts motion trajectories from input videos, uses a retrieval-based pre-training mechanism to enhance symbolic regression, and discovers equations of motion to forecast physically accurate future trajectories. These trajectories then guide video generation without requiring fine-tuning of existing models. Evaluated on scenarios in Classical Mechanics, including spring-mass, pendulums, and projectile motions, our method successfully recovers ground-truth analytical equations and improves the physical alignment of generated videos over baseline methods.
[64] Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation
Joelle Hanna,Damian Borth
Main category: cs.CV
TL;DR: 本文提出了一种基于Vision Transformer(ViT)的弱监督语义分割方法,通过多[CLS]标记和随机掩码策略提升注意力图的解释性,生成高精度伪分割掩码。
Details
Motivation: 弱监督语义分割(WSSS)传统方法依赖外部模块(如类别激活图)生成伪掩码。本文旨在利用ViT的注意力图直接解决WSSS问题,减少对精细标注数据的依赖。Contribution: 1. 提出了一种端到端的WSSS方法,直接利用ViT的注意力图生成伪掩码。2. 设计了稀疏ViT和多[CLS]标记的训练策略,提升类别分配的准确性。3. 在多个标准数据集上验证了方法的有效性,结果接近全监督模型。
Method: 1. 训练稀疏ViT,每个类别对应一个[CLS]标记;2. 使用随机掩码策略促进[CLS]标记与类别的关联;3. 推理时聚合不同[CLS]标记的自注意力图生成伪掩码。
Result: 在多个基准数据集上实验表明,生成的伪掩码质量优于现有方法,训练的分割模型性能接近全监督模型。
Insight: ViT的自注意力图可以直接用于WSSS任务,随机掩码策略能有效提升类别分配的准确性。
Abstract: Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.
[65] IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization
Subrat Kishore Dutta,Xiao Zhang
Main category: cs.CV
TL;DR: 这篇论文提出了一种名为IAP的新型对抗补丁攻击框架,通过感知感知定位和扰动优化生成高度不可见的对抗补丁。IAP在攻击成功率和隐蔽性方面优于现有基线方法,并能有效绕过现有补丁防御技术。
Details
Motivation: 现有的对抗补丁攻击方法要么在目标攻击场景中表现不佳,要么生成的补丁与上下文不协调,容易被人类检查者发现或无法绕过自动补丁防御。因此,需要一种更隐蔽且有效的对抗补丁攻击方法。Contribution: 论文的主要贡献是提出IAP框架,结合了感知感知定位和扰动优化技术,生成高度不可见的对抗补丁。该方法在攻击成功率和隐蔽性方面均优于现有基线。
Method: IAP首先利用类感知定位和敏感度图选择补丁的最佳位置,平衡对抗攻击效力和人类视觉系统的感知。随后,采用感知正则化的对抗损失和优先颜色一致性的梯度更新规则进行扰动优化。
Result: 在多个图像基准和模型架构上的实验表明,IAP在目标攻击场景中表现优异,攻击成功率高且补丁隐蔽性显著提升,同时能绕过多种现有的补丁防御技术。
Insight: 通过平衡对抗攻击效果和人类视觉系统的感知,可以生成更隐蔽的对抗补丁,同时有效绕过防御机制。这为对抗攻击和防御研究提供了新的方向。
Abstract: Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.
[66] SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds
Matthias Zeller,Daniel Casado Herraez,Bengisu Ayan,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss
Main category: cs.CV
TL;DR: 该论文提出了一种名为SemRaFiner的新方法,用于在稀疏且噪声较多的雷达点云中进行panoptic分割,通过优化特征提取和训练流程,提升了分割精度。
Details
Motivation: 现有基于摄像头和LiDAR的语义场景理解方法在恶劣天气下表现不佳,且通常不提供运动信息。雷达传感器可以克服这些限制,但其点云数据稀疏且噪声较多,需要改进分割方法。Contribution: 提出了SemRaFiner方法,优化了稀疏雷达点云的特征提取和训练流程,结合数据增强技术改进了实例分配,显著提升了panoptic分割的准确性。
Method: 通过调整特征提取以适应雷达点云的稀疏性变化,并结合专门的数据增强技术优化训练流程,实现更精确的实例分割。
Result: 实验表明,SemRaFiner在雷达点云的panoptic分割任务中优于现有最先进方法。
Insight: 雷达在恶劣天气下的稳定性优势使其成为自动驾驶场景理解的重要补充,但需要专门的方法处理其数据稀疏性和噪声问题。
Abstract: Semantic scene understanding, including the perception and classification of moving agents, is essential to enabling safe and robust driving behaviours of autonomous vehicles. Cameras and LiDARs are commonly used for semantic scene understanding. However, both sensor modalities face limitations in adverse weather and usually do not provide motion information. Radar sensors overcome these limitations and directly offer information about moving agents by measuring the Doppler velocity, but the measurements are comparably sparse and noisy. In this paper, we address the problem of panoptic segmentation in sparse radar point clouds to enhance scene understanding. Our approach, called SemRaFiner, accounts for changing density in sparse radar point clouds and optimizes the feature extraction to improve accuracy. Furthermore, we propose an optimized training procedure to refine instance assignments by incorporating a dedicated data augmentation. Our experiments suggest that our approach outperforms state-of-the-art methods for radar-based panoptic segmentation.
[67] Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement
Qiyuan Dai,Hanzhuo Huang,Yu Wu,Sibei Yang
Main category: cs.CV
TL;DR: 提出了一种自适应部分学习(APL)方法,通过共享可学习部分查询和DINO部分先验生成一致的目标部分及其对应关系,无需额外标注,显著提升细粒度分类任务的表现。
Details
Motivation: 现有GCD方法依赖DINO的全局表示,导致判别性和泛化性之间的固有权衡,无法满足细粒度分类的需求。Contribution: 1)提出APL方法,自动发现和生成一致的目标部分;2)设计新型all-min对比损失,学习判别性与泛化性兼具的部分表示。
Method: 通过共享可学习部分查询和DINO部分先验生成目标部分,利用all-min对比损失优化部分表示,可即插即用替换CLS token特征。
Result: 在细粒度数据集上显著提升了GCD框架的性能。
Insight: 部分学习可以解决全局表示中判别性与泛化性的矛盾,适合细粒度分类任务。
Abstract: Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.
[68] Pre-Columbian Settlements Shaped Palm Clusters in the Sierra Nevada de Santa Marta, Colombia
Sebastian Fajardo,Sina Mohammadi,Jonas Gregorio de Souza,César Ardila,Alan Tapscott Baltar,Shaddai Heidgen,Maria Isabel Mayorga Hernández,Sylvia Mota de Oliveira,Fernando Montejo,Marco Moderato,Vinicius Peripato,Katy Puche,Carlos Reina,Juan Carlos Vargas,Frank W. Takes,Marco Madella
Main category: cs.CV
TL;DR: 该论文提出了一种结合深度学习与聚类算法的方法,通过卫星图像识别棕榈树分布,进而推断古代人类管理的区域,揭示其对植被的长期影响。
Details
Motivation: 研究旨在解决古代人类活动对热带森林的长期影响问题,尤其是高分辨率尺度上的管理区域识别,为生态与考古学提供新视角。Contribution: 1. 提出基于深度学习和聚类的方法,通过植被特征推断古代管理区域。2. 发布手动标注的棕榈树数据集和考古遗址位置数据。3. 发现古代人类管理的区域可能比考古证据显示的大得多。
Method: 1. 使用深度学习模型从卫星图像中识别棕榈树。2. 通过聚类算法分析棕榈树的空间分布,推断古代人类活动区域。
Result: 棕榈树在考古遗址附近显著更多,且大型遗址周边的管理区域可能比考古证据显示的规模大两个数量级。
Insight: 古代人类活动通过促进棕榈树增殖留下了持久的生态足迹,这可能降低了在难达地区建立基础设施的成本。研究展示了AI与生态、考古数据结合揭示人类环境互动的潜力。
Abstract: Ancient populations markedly transformed Neotropical forests, yet understanding the long-term effects of ancient human management, particularly at high-resolution scales, remains challenging. In this work we propose a new approach to investigate archaeological areas of influence based on vegetation signatures. It consists of a deep learning model trained on satellite imagery to identify palm trees, followed by a clustering algorithm to identify palm clusters, which are then used to estimate ancient management areas. To assess the palm distribution in relation to past human activity, we applied the proposed approach to unique high-resolution satellite imagery data covering 765 km2 of the Sierra Nevada de Santa Marta, Colombia. With this work, we also release a manually annotated palm tree dataset along with estimated locations of archaeological sites from ground-surveys and legacy records. Results demonstrate how palms were significantly more abundant near archaeological sites showing large infrastructure investment. The extent of the largest palm cluster indicates that ancient human-managed areas linked to major infrastructure sites may be up to two orders of magnitude bigger than indicated by archaeological evidence alone. Our findings suggest that pre-Columbian populations influenced local vegetation fostering conditions conducive to palm proliferation, leaving a lasting ecological footprint. This may have lowered the logistical costs of establishing infrastructure-heavy settlements in otherwise less accessible locations. Overall, this study demonstrates the potential of integrating artificial intelligence approaches with new ecological and archaeological data to identify archaeological areas of interest through vegetation patterns, revealing fine-scale human-environment interactions.
[69] CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale
Xiao Liang,Jiawei Hu,Di Wang,Zhi Ma,Lin Zhao,Ronghan Li,Bo Wan,Quan Wang
Main category: cs.CV
TL;DR: CheXPO提出了一种结合置信度-相似度联合挖掘与反事实推理的胸部X光偏好优化策略,有效解决了视觉语言模型在医学应用中的幻觉问题,显著减少了专家标注需求。
Details
Motivation: 视觉语言模型(VLMs)在医学应用中容易产生幻觉问题,影响可靠性。传统偏好优化方法面临临床无关样本、数据分布不均衡和专家标注成本高的挑战,亟需一种更高效且可扩展的解决方案。Contribution: 1. 提出CheXPO策略,结合置信度-相似度联合挖掘与反事实推理;2. 构建多任务胸片视觉指令数据集用于SFT;3. 通过合成反事实推理提供细粒度临床偏好,无需额外专家输入。
Method: 1. 构建统一的多任务胸片视觉指令数据集进行SFT;2. 基于置信度分析识别硬样本,并通过相似性检索扩展样本以平衡分布;3. 使用合成反事实推理生成细粒度临床偏好。
Result: CheXPO仅用5%的SFT样本实现了8.93%的相对性能提升,在多种临床任务中达到SOTA性能。
Insight: 通过联合挖掘置信度和相似性,结合反事实推理,CheXPO提供了一种可扩展且解释性强的解决方案,尤其适用于实际放射学应用。
Abstract: Vision-language models (VLMs) are prone to hallucinations that critically compromise reliability in medical applications. While preference optimization can mitigate these hallucinations through clinical feedback, its implementation faces challenges such as clinically irrelevant training samples, imbalanced data distributions, and prohibitive expert annotation costs. To address these challenges, we introduce CheXPO, a Chest X-ray Preference Optimization strategy that combines confidence-similarity joint mining with counterfactual rationale. Our approach begins by synthesizing a unified, fine-grained multi-task chest X-ray visual instruction dataset across different question types for supervised fine-tuning (SFT). We then identify hard examples through token-level confidence analysis of SFT failures and use similarity-based retrieval to expand hard examples for balancing preference sample distributions, while synthetic counterfactual rationales provide fine-grained clinical preferences, eliminating the need for additional expert input. Experiments show that CheXPO achieves 8.93% relative performance gain using only 5% of SFT samples, reaching state-of-the-art performance across diverse clinical tasks and providing a scalable, interpretable solution for real-world radiology applications.
[70] Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting
Fei Teng,Kai Luo,Sheng Wu,Siyu Li,Pujun Guo,Jiale Wei,Kunyu Peng,Jiaming Zhang,Kailun Yang
Main category: cs.CV
TL;DR: Percep360提出了一种基于局部场景扩散和概率提示的全景街景生成方法,解决了全景数据生成中的连贯性和可控性问题。
Details
Motivation: 全景感知对自动驾驶很重要,但获取全景数据复杂且耗时。现有方法无法高质量、可控地生成全景数据。Contribution: 提出了首个全景生成方法Percep360,包含局部场景扩散方法(LSDM)和概率提示方法(PPM),实现连贯且可控的全景生成。
Method: LSDM将全景生成建模为空间连续扩散过程,PPM动态选择相关控制线索以实现可控生成。
Result: 生成的图像在无参考质量指标上优于原始拼接图像,并能增强下游感知模型。
Insight: 通过扩散过程和动态提示,实现了高质量的全景数据生成,为自动驾驶数据扩充提供了新思路。
Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360{\deg} surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird’s Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/Bryant-Teng/Percep360.
[71] A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
Johanna Orsholm,John Quinto,Hannu Autto,Gaia Banelyte,Nicolas Chazot,Jeremy deWaard,Stephanie deWaard,Arielle Farrell,Brendan Furneaux,Bess Hardwick,Nao Ito,Amlan Kar,Oula Kalttopää,Deirdre Kerdraon,Erik Kristensen,Jaclyn McKeown,Tommi Mononen,Ellen Nein,Hanna Rogers,Tomas Roslin,Paula Schmitz,Jayme Sones,Maija Sujala,Amy Thompson,Evgeny V. Zakharov,Iuliia Zarubiieva,Akshita Gupta,Scott C. Lowe,Graham W. Taylor
Main category: cs.CV
TL;DR: 本文介绍了MassID45数据集,结合分子和成像数据,支持大样本昆虫的自动分类,推动了小目标检测和实例分割技术的发展。
Details
Motivation: 昆虫种类繁多且数量下降,需要高效方法研究其多样性。现有技术多依赖单一标本数据,无法满足大样本生态调查需求。Contribution: 提出了MassID45数据集,首次将分子(DNA条码)与成像数据结合,支持大样本昆虫分类,填补了研究空白。
Method: 结合AI辅助工具,人工标注了大样本图像的分割掩码和17,000多个标本的分类标签。
Result: 数据集成功结合了DNA条码的分类精度和大样本图像的丰度估计,为快速、大规模昆虫群落研究提供了新工具。
Insight: 这一数据集推动了小目标检测和实例分割技术的创新,同时为生态学和机器学习研究提供了新方向。
Abstract: Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.
[72] Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM
Qiyuan Dai,Sibei Yang
Main category: cs.CV
TL;DR: FreeTTA提出了一种无需训练、通用性强的测试时适应方法,通过在线EM算法利用视觉语言模型的零样本预测作为先验,显著提升了跨域和分布外场景下的性能。
Details
Motivation: 视觉语言模型在实际应用中因领域偏移和分布变化而受限,传统测试时适应方法依赖昂贵训练或数据存储假设。FreeTTA旨在解决这些问题,提升灵活性。Contribution: 1. 提出首个显式建模测试数据分布的方法FreeTTA;2. 引入在线EM算法利用零样本预测作为先验;3. 无需训练或历史数据假设,通用性强。
Method: 利用在线EM算法,以视觉语言模型的零样本预测为初始先验,迭代计算在线测试样本的后验概率并更新参数,从而优化预测。
Result: 在15个数据集的跨域和分布外场景下,FreeTTA相比现有方法稳定且显著提升性能。
Insight: 显式建模测试数据分布并利用样本间关系是一种未被探索且有效的方向,在线EM算法为实时适应提供了高效工具。
Abstract: Vision-Language Models (VLMs) have become prominent in open-world image recognition for their strong generalization abilities. Yet, their effectiveness in practical applications is compromised by domain shifts and distributional changes, especially when test data distributions diverge from training data. Therefore, the paradigm of test-time adaptation (TTA) has emerged, enabling the use of online off-the-shelf data at test time, supporting independent sample predictions, and eliminating reliance on test annotations. Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data. Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. More importantly, FreeTTA is the first to explicitly model the test data distribution, enabling the use of intrinsic relationships among test samples to enhance predictions of individual samples without simultaneous access–a direction not previously explored. FreeTTA achieves these advantages by introducing an online EM algorithm that utilizes zero-shot predictions from VLMs as priors to iteratively compute the posterior probabilities of each online test sample and update parameters. Experiments demonstrate that FreeTTA achieves stable and significant improvements compared to state-of-the-art methods across 15 datasets in both cross-domain and out-of-distribution settings.
[73] MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation
Qilong Xing,Zikai Song,Youjia Zhang,Na Feng,Junqing Yu,Wei Yang
Main category: cs.CV
TL;DR: MCA-RG提出了一种基于知识驱动的方法,通过显式对齐视觉特征与医学概念,改进放射学报告生成,利用病理和解剖知识库增强特征提取,并通过对比学习和匹配损失优化模型性能。
Details
Motivation: 当前大型语言模型(LLMs)在放射学报告生成(RRG)中仍面临临床落地困难,主要问题是难以准确映射病理和解剖特征到文本描述,以及特征提取的语义无关性。Contribution: 1. 提出MCA-RG框架,显式对齐视觉特征与医学概念;2. 设计病理和解剖知识库;3. 引入基于解剖的对比学习和病理特征匹配损失;4. 提出特征门控机制过滤低质量概念特征。
Method: 1. 使用病理库和解剖库对齐视觉特征;2. 通过对比学习优化解剖特征;3. 通过匹配损失优先临床相关病理区域;4. 特征门控机制筛选高质量特征;5. 生成报告时利用对齐后的特征作为指导。
Result: 在MIMIC-CXR和CheXpert Plus上实验表明,MCA-RG性能优于现有方法,验证了其有效性。
Insight: 显式对齐医学概念和视觉特征是提升放射学报告生成质量的关键;知识驱动的特征增强和筛选机制能显著改善模型性能。
Abstract: Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.
[74] Cross-Modality Masked Learning for Survival Prediction in ICI Treated NSCLC Patients
Qilong Xing,Zikai Song,Bingxin Gong,Lian Yang,Junqing Yu,Wei Yang
Main category: cs.CV
TL;DR: 该论文提出了一种跨模态掩码学习框架,用于非小细胞肺癌(NSCLC)患者的生存预测,通过结合3D CT图像和临床数据,实现更精准的多模态特征融合。
Details
Motivation: 免疫治疗后的NSCLC患者生存预测对个性化治疗至关重要,但缺乏大规模相关数据集和有效的多模态特征融合方法,阻碍了这一领域的发展。Contribution: 1. 提供了一个包含3D CT图像和临床记录的大规模数据集;2. 提出了一种新颖的跨模态掩码学习框架,提升生存预测的准确性。
Method: 使用Slice-Depth Transformer提取CT图像的3D特征,并用图Transformer处理临床数据,通过掩码模态学习策略引导特征融合。
Result: 该方法在多模态集成中表现优异,超越了现有方法,为NSCLC生存预测设立了新标准。
Insight: 掩码学习策略通过利用完整模态重构缺失部分,能够更有效地整合多模态特征,提升模型性能。
Abstract: Accurate prognosis of non-small cell lung cancer (NSCLC) patients undergoing immunotherapy is essential for personalized treatment planning, enabling informed patient decisions, and improving both treatment outcomes and quality of life. However, the lack of large, relevant datasets and effective multi-modal feature fusion strategies pose significant challenges in this domain. To address these challenges, we present a large-scale dataset and introduce a novel framework for multi-modal feature fusion aimed at enhancing the accuracy of survival prediction. The dataset comprises 3D CT images and corresponding clinical records from NSCLC patients treated with immune checkpoint inhibitors (ICI), along with progression-free survival (PFS) and overall survival (OS) data. We further propose a cross-modality masked learning approach for medical feature fusion, consisting of two distinct branches, each tailored to its respective modality: a Slice-Depth Transformer for extracting 3D features from CT images and a graph-based Transformer for learning node features and relationships among clinical variables in tabular data. The fusion process is guided by a masked modality learning strategy, wherein the model utilizes the intact modality to reconstruct missing components. This mechanism improves the integration of modality-specific features, fostering more effective inter-modality relationships and feature interactions. Our approach demonstrates superior performance in multi-modal integration for NSCLC survival prediction, surpassing existing methods and setting a new benchmark for prognostic models in this context.
[75] GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning
S M Taslim Uddin Raju,Md. Milon Islam,Md Rezwanul Haque,Hamdi Altaheri,Fakhri Karray
Main category: cs.CV
TL;DR: GNN-ViTCap结合GNN和视觉Transformer,通过动态聚类和注意力机制优化WSI分类与标注,显著提升性能。
Details
Motivation: WSI分类与标注面临冗余补丁和未知位置等挑战,现有方法难以有效结合局部与全局信息。Contribution: 提出GNN-ViTCap框架,整合动态聚类、注意力机制和GNN,同时优化分类与标注任务,性能优于SOTA。
Method: 动态聚类筛选代表性补丁,GNN建模局部与全局关系,视觉Transformer提取特征,大语言模型生成标注。
Result: 分类F1达0.934,AUC为0.963;标注BLEU-4为0.811,METEOR为0.569,超越现有方法。
Insight: GNN与视觉Transformer结合能有效解决WSI中补丁冗余和上下文建模问题,提升病理诊断效率。
Abstract: Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model’s input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.
[76] MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation
Hui Li,Pengfei Yang,Juanyang Chen,Le Dong,Yanxin Chen,Quan Wang
Main category: cs.CV
TL;DR: MST-Distill 提出了一种新的跨模态知识蒸馏框架,通过混合多个专业化教师模型和动态路由网络,解决了传统方法中的蒸馏路径选择和知识漂移问题。
Details
Motivation: 传统知识蒸馏方法在跨模态场景中因数据和统计异构性难以利用多模态教师模型的互补知识。本文实证揭示了现有方法的局限性,提出了更有效的解决方案。Contribution: 1. 提出MST-Distill框架,引入混合专业化教师模型和动态路由网络。2. 设计了可插拔的掩码模块,抑制模态差异并重构教师表示。
Method: 1. 使用多模态教师模型集合。2. 通过动态路由网络实现自适应知识蒸馏。3. 掩码模块独立训练以减少模态差异。
Result: 在五个多模态数据集上的实验表明,MST-Distill显著优于现有方法。
Insight: 混合动态教师模型和抑制模态差异的掩码模块是提升跨模态知识蒸馏性能的关键。
Abstract: Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.
[77] Evaluating Large Multimodal Models for Nutrition Analysis: A Benchmark Enriched with Contextual Metadata
Bruce Coburn,Jiangpeng He,Megan E. Rollo,Satvinder S. Dhaliwal,Deborah A. Kerr,Fengqing Zhu
Main category: cs.CV
TL;DR: 该论文研究了如何通过整合上下文元数据(如地点、时间和食物类型)提升大型多模态模型(LMMs)在营养分析中的性能,并提出了新的数据集ACETADA。
Details
Motivation: 现有研究主要评估专有模型(如GPT-4),忽视了广泛的开放权重模型的潜力,且缺乏对上下文元数据及其与推理修饰符交互作用的研究。Contribution: 1) 研究了上下文元数据对营养分析的提升作用;2) 引入了ACETADA数据集;3) 展示了元数据如何增强推理修饰符的效果。
Method: 通过整合GPS坐标(地点/场所类型)、时间戳(餐点/日类型)和食物项目作为元数据,评估了8种LMMs(4种开放权重和4种封闭权重),并比较了直接提示与元数据增强提示的性能。
Result: 结果显示,智能整合元数据显著降低了营养值预测的平均绝对误差(MAE)和平均绝对百分比误差(MAPE)。
Insight: 上下文感知的LMMs在营养分析中具有巨大潜力,且开放权重的LMMs性能未被充分挖掘。
Abstract: Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.
[78] Reading a Ruler in the Wild
Yimu Pan,Manas Mehta,Gwen Sincerbeaux,Jeffery A. Goldstein,Alison D. Gernand,James Z. Wang
Main category: cs.CV
TL;DR: 该论文提出了RulerNet,一种深度学习框架,通过将尺子读数问题统一为关键点检测问题,并利用几何级数参数表示尺子,解决了在复杂环境中将像素测量转换为真实世界尺度的挑战。
Details
Motivation: 传统方法依赖手工阈值或针对特定尺子的固定流程,难以在多样化尺子类型和成像条件下通用。研究旨在开发一种能稳健推断真实世界尺度的通用方法。Contribution: 1. 提出了RulerNet,将尺子读数统一为关键点检测问题;2. 设计了对抗透视变换的几何级数参数表示;3. 提出基于合成数据的训练策略和轻量级网络DeepGP。
Method: 1. 使用几何级数参数表示尺子,避免透视变换影响;2. 结合合成数据(图形生成+ControlNet)增强训练多样性;3. 通过DeepGP直接回归几何级数参数,无需迭代优化。
Result: 实验表明,RulerNet在复杂真实条件下实现了精确、一致且高效的尺度估计,适用于多种尺子类型和成像条件。
Insight: 几何级数参数化是解决透视变换的有效方法,同时合成数据可显著提升模型通用性;轻量级网络设计为实时应用提供了可能。
Abstract: Accurately converting pixel measurements into absolute real-world dimensions remains a fundamental challenge in computer vision and limits progress in key applications such as biomedicine, forensics, nutritional analysis, and e-commerce. We introduce RulerNet, a deep learning framework that robustly infers scale “in the wild” by reformulating ruler reading as a unified keypoint-detection problem and by representing the ruler with geometric-progression parameters that are invariant to perspective transformations. Unlike traditional methods that rely on handcrafted thresholds or rigid, ruler-specific pipelines, RulerNet directly localizes centimeter marks using a distortion-invariant annotation and training strategy, enabling strong generalization across diverse ruler types and imaging conditions while mitigating data scarcity. We also present a scalable synthetic-data pipeline that combines graphics-based ruler generation with ControlNet to add photorealistic context, greatly increasing training diversity and improving performance. To further enhance robustness and efficiency, we propose DeepGP, a lightweight feed-forward network that regresses geometric-progression parameters from noisy marks and eliminates iterative optimization, enabling real-time scale estimation on mobile or edge devices. Experiments show that RulerNet delivers accurate, consistent, and efficient scale estimates under challenging real-world conditions. These results underscore its utility as a generalizable measurement tool and its potential for integration with other vision components for automated, scale-aware analysis in high-impact domains. A live demo is available at https://huggingface.co/spaces/ymp5078/RulerNet-Demo.
[79] Evaluating Attribute Confusion in Fashion Text-to-Image Generation
Ziyue Liu,Federico Girella,Yiming Wang,Davide Talon
Main category: cs.CV
TL;DR: 论文提出了一种基于视觉问答(VQA)的新度量方法L-VQAScore,用于评估时尚领域的文本到图像生成模型中的属性混淆问题,优于现有方法。
Details
Motivation: 当前文本到图像(T2I)生成模型的自动评估方法在时尚领域表现不足,尤其是无法准确捕捉复杂的实体-属性关联(如属性混淆问题)。Contribution: 1. 提出了一种基于局部化VQA策略的新度量L-VQAScore;2. 构建了一个针对复合对齐任务的新数据集;3. 实验表明L-VQAScore与人类评估结果更一致。
Method: 通过视觉定位和VQA探测,同时考虑属性正确生成(reflection)和错误定位(leakage)的情况,提出L-VQAScore。
Result: 在包含挑战性对齐任务的新数据集上,L-VQAScore优于现有T2I评估方法,与人类评估结果的关联性更强。
Insight: 局部化VQA策略能有效捕捉细粒度的实体-属性关联,为T2I评估提供了新的思路。
Abstract: Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.
[80] Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Ke Fan,Shunlin Lu,Minyue Dai,Runyi Yu,Lixing Xiao,Zhiyang Dou,Junting Dong,Lizhuang Ma,Jingbo Wang
Main category: cs.CV
TL;DR: 该论文提出了MotionMillion数据集和MotionMillion-Eval基准,旨在推动文本到运动的零样本生成能力,并通过大规模数据和模型参数扩展实现了显著的泛化性能。
Details
Motivation: 当前文本到运动生成方法的零样本泛化能力受限,主要原因是训练数据集规模不足,且缺乏全面的评估框架。Contribution: 1) 提出MotionMillion,迄今最大的人类运动数据集;2) 引入MotionMillion-Eval基准,用于评估零样本运动生成;3) 通过扩展模型参数至7B,验证了其泛化性能。
Method: 1) 开发高效标注流程构建大型数据集;2) 提出可扩展的模型架构;3) 通过MotionMillion-Eval进行零样本评估。
Result: 模型在域外和复杂组合运动上表现出强大的泛化能力。
Insight: 大规模数据和全面评估框架是实现零样本运动生成的关键。
Abstract: Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
[81] Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Tiezheng Zhang,Yitong Li,Yu-cheng Chou,Jieneng Chen,Alan Yuille,Chen Wei,Junfei Xiao
Main category: cs.CV
TL;DR: 提出了Vision-Language-Vision (VLV)自编码框架,通过预训练组件(视觉编码器、T2I扩散模型和LLM)构建信息瓶颈,高效蒸馏扩散模型知识,降低了训练成本和数据需求。
Details
Motivation: 传统的视觉语言模型(VLMs)需要大量的高质量图文对和GPU资源,训练成本高昂。本文旨在通过知识蒸馏和利用预训练组件,降低训练成本和数据需求。Contribution: 1. 提出VLV框架,通过冻结预训练的T2I扩散模型解码器构建信息瓶颈;2. 展示了高质量语义重建能力;3. 通过微调LLM实现先进的图像描述生成,成本低于1000美元。
Method: 1. 利用视觉编码器提取图像特征;2. 冻结的T2I扩散模型解码器用于正则化语言表示;3. 微调LLM生成描述。
Result: VLV框架在少量数据和低成本下,实现了与GPT-4o和Gemini 2.0 Flash相当的图像描述性能。
Insight: 通过预训练组件的组合和信息瓶颈设计,可以显著降低VLMs训练成本和数据需求,同时保持高性能。
Abstract: Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.
[82] 4KAgent: Agentic Any Image to 4K Super-Resolution
Yushen Zuo,Qi Zheng,Mingyang Wu,Xinrui Jiang,Renjie Li,Jian Wang,Yide Zhang,Gengchen Mai,Lihong V. Wang,James Zou,Xiaoyu Wang,Ming-Hsuan Yang,Zhengzhong Tu
Main category: cs.CV
TL;DR: 4KAgent是一个通用的代理放大系统,可将任何图像提升至4K分辨率,通过定制化模块、感知代理和修复代理实现高效超分辨率,并在多个任务类别中表现优异。
Details
Motivation: 现有超分辨率方法通常针对特定任务设计,缺乏通用性和灵活性。4KAgent旨在通过代理化系统实现跨领域的统一超分辨率,尤其是处理极端低分辨率和严重退化图像。Contribution: 1. 提出首个代理化超分辨率通用系统;2. 结合感知代理和修复代理的动态规划与执行框架;3. 在11个任务类别中实现最先进性能。
Method: 1. Profiling模块定制流程;2. 感知代理利用视觉语言模型和图像质量评估制定修复计划;3. 修复代理采用递归执行-反思范式与专家混合策略。
Result: 在26个基准测试中,4KAgent在感知质量和保真度指标(如NIQE、PSNR)上均表现最优,涵盖自然图像、医学影像等多种领域。
Insight: 代理化方法为低层次视觉任务提供了新的范式,其动态规划与执行机制可能启发更广泛的视觉自主代理研究。
Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.
[83] Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor
Vatsal Agarwal,Matthew Gwilliam,Gefen Kohavi,Eshan Verma,Daniel Ulbricht,Abhinav Shrivastava
Main category: cs.CV
TL;DR: 这篇论文探讨了利用预训练文本到图像扩散模型(Stable Diffusion)作为任务感知的特征提取器,以弥补CLIP在视觉编码中无法捕捉细粒度信息的不足。通过分析扩散特征的语义丰富性和图像-文本对齐能力,并提出一种融合CLIP和扩散特征的策略,提升了多模态模型的视觉理解能力。
Details
Motivation: 现有的多模态大语言模型(MLLMs)依赖CLIP作为视觉编码器,但其无法充分捕捉细粒度和任务相关的视觉信息。本文希望探索扩散模型能否作为更好的视觉编码器。Contribution: 1. 发现扩散模型的特征具有丰富的语义和图像-文本对齐能力;2. 揭示了扩散特征与LLM对齐时的信息泄漏问题,并提出缓解策略;3. 提出了一种简单的融合CLIP和扩散特征的策略,显著提升了视觉理解能力,尤其是在需要空间和组合推理的任务中。
Method: 1. 分析扩散模型的内部特征表示;2. 利用文本条件引导模型关注输入查询相关的区域;3. 引入融合策略结合CLIP和扩散特征。
Result: 在通用VQA和专用MLLM基准测试中,融合扩散特征的方法展现出显著优势,尤其是在空间和组合推理任务中。
Insight: 扩散模型不仅能生成高质量的图像,其内部特征还能作为强大的视觉编码器,为多模态理解任务提供更丰富的细粒度信息。
Abstract: Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.
cs.IR [Back]
[84] DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse
Jeanette Schofield,Shuyu Tian,Hoang Thanh Thanh Truong,Maximilian Heil
Main category: cs.IR
TL;DR: DS@GT团队在CLEF 2025 CheckThat! Lab Task 4b中探索了多种检索与重排序方法,用于从社交媒体推文中检索科学声称的源头,取得了MRR@5为0.58的成绩,相比基线提升了0.15。
Details
Motivation: 社交媒体用户常提出科学声称却未提供来源,导致验证这些声称的需求增加。Contribution: 提出了6种数据增强技术、7种检索与重排序流程,并对双编码器进行了微调。
Method: 结合了多种数据增强、检索与重排序技术,并微调了双编码器模型。
Result: 在CLEF 2025 CheckThat! Lab Task 4b中获得了MRR@5为0.58的成绩,排名16/30。
Insight: 数据增强与检索流程的优化对提升科学声称来源检索任务效果至关重要。
Abstract: Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.
q-bio.QM [Back]
[85] DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning
Shreyas Vinaya Sathyanarayana,Rahil Shah,Sharanabasava D. Hiremath,Rishikesh Panda,Rahul Jana,Riya Singh,Rida Irfan,Ashwin Murali,Bharath Ramsundar
Main category: q-bio.QM
TL;DR: DeepRetro提出了一种结合LLM与传统模板/MCTS工具的迭代式逆合成框架,通过动态反馈和修正探索新颖合成路径。
Details
Motivation: 现有逆合成方法多依赖预定义模板,难以发现新路径,而基于LLM的方法在多步规划上仍有不足。Contribution: 提出DeepRetro框架,结合LLM的生成能力和传统工具的验证机制,实现动态路径探索与人类专家反馈。
Method: 采用迭代式混合框架:模板引擎失败时由LLM提出单步断开建议,经严格验证后递归反馈至流程中。
Result: 框架成功识别可行且新颖的合成路径,并通过交互界面实现人类专家反馈,应用于复杂天然产物合成。
Insight: LLM的迭代推理与人类反馈结合可显著提升逆合成的创新性和实用性。
Abstract: Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.
[86] PAST: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer
Changchun Yang,Haoyang Li,Yushuai Wu,Yilan Zhang,Yifeng Jiao,Yu Zhang,Rihan Huang,Yuan Cheng,Yuan Qi,Xin Guo,Xin Gao
Main category: q-bio.QM
TL;DR: PAST 是一个多模态单细胞基础模型,通过联合学习组织病理图像和单细胞转录组数据,实现了跨模态表征的统一,显著提升了在癌症研究中的预测和分析能力。
Details
Motivation: 当前的病理基础模型通常缺乏与分子数据的单细胞分辨率整合,限制了其在精准肿瘤学中的应用。Contribution: 提出了一个跨模态的单细胞基础模型 PAST,能够在单细胞水平上联合捕捉形态和基因表达信息。
Method: 利用 2000 万对组织病理图像和单细胞转录组数据,训练模型学习统一的跨模态表征。
Result: 在多种癌症和多任务中,PAST 的表现优于现有方法,具有高泛化性和扩展性。
Insight: PAST 为高分辨率空间组学、机制发现和精准癌症研究提供了一个灵活的工具。
Abstract: While pathology foundation models have transformed cancer image analysis, they often lack integration with molecular data at single-cell resolution, limiting their utility for precision oncology. Here, we present PAST, a pan-cancer single-cell foundation model trained on 20 million paired histopathology images and single-cell transcriptomes spanning multiple tumor types and tissue contexts. By jointly encoding cellular morphology and gene expression, PAST learns unified cross-modal representations that capture both spatial and molecular heterogeneity at the cellular level. This approach enables accurate prediction of single-cell gene expression, virtual molecular staining, and multimodal survival analysis directly from routine pathology slides. Across diverse cancers and downstream tasks, PAST consistently exceeds the performance of existing approaches, demonstrating robust generalizability and scalability. Our work establishes a new paradigm for pathology foundation models, providing a versatile tool for high-resolution spatial omics, mechanistic discovery, and precision cancer research.
eess.IV [Back]
[87] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation
Szymon Płotka,Maciej Chrabaszcz,Gizem Mert,Ewa Szczurek,Arkadiusz Sitek
Main category: eess.IV
TL;DR: 论文提出了一种名为HoME的分层软混合专家模型,用于3D医学图像分割,通过两级token路由层提升长上下文建模效率,显著优于现有方法。
Details
Motivation: 3D医学图像分割面临多样模态处理和数据变异性等挑战,需要高效的长上下文建模方法。Contribution: 提出HoME,一种基于Mamba SSM的分层软混合专家模型,通过局部和全局专家路由提升分割性能。
Method: 采用两级SMoE层:局部专家提取特征,全局专家融合信息,结合Mamba SSM实现高效长序列建模。
Result: 在多模态3D医学图像数据集上取得SOTA结果,泛化性强。
Insight: 分层专家路由能有效结合局部特征提取和全局上下文融合,适合复杂医学图像分割任务。
Abstract: In recent years, artificial intelligence has significantly advanced medical image segmentation. However, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba state-space model (SSM) backbone, HoME enhances sequential modeling through sparse, adaptive expert routing. The first stage employs a Soft Mixture-of-Experts (SMoE) layer to partition input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second stage aggregates these outputs via a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement improves generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most commonly used 3D medical imaging modalities and data quality.
[88] Mitigating Multi-Sequence 3D Prostate MRI Data Scarcity through Domain Adaptation using Locally-Trained Latent Diffusion Models for Prostate Cancer Detection
Emerson P. Grabke,Babak Taati,Masoom A. Haider
Main category: eess.IV
TL;DR: 论文提出CCELLA++,一种改进的潜在扩散模型(LDM),用于生成多序列前列腺MRI数据,解决数据稀缺问题,并提升前列腺癌检测的分类器性能。
Details
Motivation: 现有的CCELLA LDM仅限于轴向T2加权序列,未研究机构间的域偏移问题,且未优化病理学结果。CCELLA++旨在解决这些问题,提升临床应用价值。Contribution: 1. 扩展到多序列bpMRI生成;2. 研究域适应问题;3. 显示合成数据在分类器训练中的优越性。
Method: 使用CCELLA++生成合成bpMRI数据(AxT2、HighB、ADC),并通过域适应实验验证性能。
Result: CCELLA++显著提升了HighB和ADC序列的FID分数,且在域适应任务中分类器性能优于真实数据。
Insight: 合成数据在小样本域适应任务中可能优于真实数据,多序列生成能力有助于医学影像分析的发展。
Abstract: Objective: Latent diffusion models (LDMs) could mitigate data scarcity challenges affecting machine learning development for medical image interpretation. The recent CCELLA LDM improved prostate cancer detection performance using synthetic MRI for classifier training but was limited to the axial T2-weighted (AxT2) sequence, did not investigate inter-institutional domain shift, and prioritized radiology over histopathology outcomes. We propose CCELLA++ to address these limitations and improve clinical utility. Methods: CCELLA++ expands CCELLA for simultaneous biparametric prostate MRI (bpMRI) generation, including the AxT2, high b-value diffusion series (HighB) and apparent diffusion coefficient map (ADC). Domain adaptation was investigated by pretraining classifiers on real or LDM-generated synthetic data from an internal institution, followed with fine-tuning on progressively smaller fractions of an out-of-distribution, external dataset. Results: CCELLA++ improved 3D FID for HighB and ADC but not AxT2 (0.013, 0.012, 0.063 respectively) sequences compared to CCELLA (0.060). Classifier pretraining with CCELLA++ bpMRI outperformed real bpMRI in AP and AUC for all domain adaptation scenarios. CCELLA++ pretraining achieved highest classifier performance below 50% (n=665) external dataset volume. Conclusion: Synthetic bpMRI generated by our method can improve downstream classifier generalization and performance beyond real bpMRI or CCELLA-generated AxT2-only images. Future work should seek to quantify medical image sample quality, balance multi-sequence LDM training, and condition the LDM with additional information. Significance: The proposed CCELLA++ LDM can generate synthetic bpMRI that outperforms real data for domain adaptation with a limited target institution dataset. Our code is available at https://github.com/grabkeem/CCELLA-plus-plus
[89] Capsule-ConvKAN: A Hybrid Neural Approach to Medical Image Classification
Laura Pituková,Peter Sinčák,László József Kovács
Main category: eess.IV
TL;DR: 该研究提出了一种新的混合神经网络架构Capsule-ConvKAN,结合了Capsule Network和Convolutional Kolmogorov–Arnold Network的优势,在医学图像分类任务中取得了最佳性能(91.21%准确率)。
Details
Motivation: 传统卷积神经网络在医学图像分类中难以捕捉复杂空间特征。为了提升特征表示能力和分类准确性,作者探索了结合Capsule Network动态路由能力和Convolutional Kolmogorov--Arnold Network灵活性的混合架构。Contribution: 提出了Capsule-ConvKAN,一种新型混合神经网络,融合了动态路由和灵活函数逼近能力,显著提升了医学图像分类的性能。
Method: 通过结合Capsule Network的空间层次结构和动态路由机制与Convolutional Kolmogorov–Arnold Network的灵活函数近似能力,设计了一种新的混合模型。
Result: 在组织病理学图像数据集上的实验表明,Capsule-ConvKAN以91.21%的准确率优于其他对比架构。
Insight: Capsule-ConvKAN能够更好地捕获空间模式和处理复杂特征,为解决传统卷积模型在医学图像分类中的局限性提供了新思路。
Abstract: This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov–Arnold Network, and the newly proposed Capsule–Convolutional Kolmogorov–Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov–Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.
[90] Fast Equivariant Imaging: Acceleration for Unsupervised Learning via Augmented Lagrangian and Auxiliary PnP Denoisers
Guixian Xu,Jinglai Li,Junqi Tang
Main category: eess.IV
TL;DR: 论文提出了一种快速等变成像(FEI)框架,通过拉格朗日乘子和可插拔降噪器加速无监督学习,显著提升了训练效率和性能。
Details
Motivation: 无监督学习在成像任务中需要高效且性能优越的方法。传统等变成像(EI)方法效率较低,亟需改进。Contribution: 提出了FEI框架,结合拉格朗日乘子和可插拔降噪器,实现了10倍加速和更好的泛化性能。
Method: 基于拉格朗日乘子重新构建优化问题,并引入可插拔降噪器。
Result: 在CT100数据集上,训练U-Net的速度提升10倍,且性能优于标准EI。
Insight: 拉格朗日乘子和可插拔降噪器的结合为无监督学习提供了高效且性能优越的解决方案。
Abstract: We propose Fast Equivariant Imaging (FEI), a novel unsupervised learning framework to efficiently train deep imaging networks without ground-truth data. From the perspective of reformulating the Equivariant Imaging based optimization problem via the method of Lagrange multipliers and utilizing plug-and-play denoisers, this novel unsupervised scheme shows superior efficiency and performance compared to vanilla Equivariant Imaging paradigm. In particular, our PnP-FEI scheme achieves an order-of-magnitude (10x) acceleration over standard EI on training U-Net with CT100 dataset for X-ray CT reconstruction, with improved generalization performance.
[91] Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data
Xuesong Li,Nassir Navab,Zhongliang Jiang
Main category: eess.IV
TL;DR: Speckle2Self是一种无需干净数据的自监督超声斑点噪声抑制方法,通过多尺度扰动操作实现对斑点噪声的有效建模和去除。
Details
Motivation: 超声图像中的斑点噪声具有组织依赖性,传统去噪方法(如Noise2Noise或盲点网络)无法直接适用,因此需要一种仅依赖单张噪声图像的自监督方法。Contribution: 提出了Speckle2Self算法,首次实现仅利用单张噪声图像的自监督超声斑点噪声抑制,并通过多尺度扰动操作分离噪声与解剖结构。
Method: 利用多尺度扰动(MSP)操作生成不同尺度的噪声变体,通过低秩建模分离噪声成分,保留共享的解剖结构。
Result: 在模拟和真实超声图像上验证了有效性,优于传统滤波器和SOTA学习方法,并展现了跨设备的泛化能力。
Insight: 超声斑点噪声的高空间依赖性可通过多尺度扰动建模,解剖结构的低秩特性是去噪的关键。
Abstract: Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. \textit{Code and datasets will be released upon acceptance.
[92] SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction
Kaveh Moradkhani,R Jarrett Rushmore,Sylvain Bouix
Main category: eess.IV
TL;DR: SimCortex是一种深度学习框架,用于从T1加权MRI数据中同时重建无碰撞的皮质表面,解决了现有方法中常见的重叠、自相交和拓扑缺陷问题。
Details
Motivation: 现有的皮质表面重建方法常因复杂的几何结构和严格的拓扑要求导致表面重叠和拓扑缺陷,无法满足可靠的神经解剖分析需求。Contribution: 提出SimCortex框架,首次实现了所有脑表面(左右白质和软膜)的同时重建,且在保持拓扑特性的同时显著减少碰撞和自相交。
Method: 1) 通过深度学习将T1w图像分为九类组织标签图;2) 生成无碰撞的初始表面网格;3) 使用基于SVFs的多尺度微分同胚变形优化表面。
Result: 在标准数据集上,SimCortex显著减少了表面重叠和自相交,同时保持了几何精度,性能超越现有方法。
Insight: 结合深度学习和微分同胚变形可以高效解决皮质表面重建中的复杂拓扑问题,为神经解剖分析提供更可靠的工具。
Abstract: Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.
[93] Deep Brain Net: An Optimized Deep Learning Model for Brain tumor Detection in MRI Images Using EfficientNetB0 and ResNet50 with Transfer Learning
Daniel Onah,Ravish Desai
Main category: eess.IV
TL;DR: 论文提出了一种名为Deep Brain Net的深度学习模型,结合EfficientNetB0和ResNet50架构及迁移学习,用于MRI图像中脑肿瘤的检测,实现了高准确率和计算效率。
Details
Motivation: 现有的深度学习模型在脑肿瘤检测中虽表现出潜力,但在准确率和计算效率方面仍有不足。本文旨在通过结合高效架构和迁移学习优化性能。Contribution: 1. 提出Deep Brain Net模型,结合EfficientNetB0和ResNet50的优点;2. 利用迁移学习提升泛化能力并减少训练时间;3. 在公开数据集上验证其优越性。
Method: 1. 采用EfficientNetB0(利用深度可分离卷积减少参数和计算成本)和ResNet50(通过残差连接解决梯度消失问题);2. 结合迁移学习对预训练模型进行微调。
Result: 在MRI数据集上,模型达到了88%的准确率、88.75%的加权F1分数和98.17%的宏AUC ROC分数,优于现有方法。
Insight: 结合高效架构和迁移学习可以有效提升脑肿瘤检测的准确率和计算效率,为临床诊断提供可靠辅助工具。
Abstract: In recent years, deep learning has shown great promise in the automated detection and classification of brain tumors from MRI images. However, achieving high accuracy and computational efficiency remains a challenge. In this research, we propose Deep Brain Net, a novel deep learning system designed to optimize performance in the detection of brain tumors. The model integrates the strengths of two advanced neural network architectures which are EfficientNetB0 and ResNet50, combined with transfer learning to improve generalization and reduce training time. The EfficientNetB0 architecture enhances model efficiency by utilizing mobile inverted bottleneck blocks, which incorporate depth wise separable convolutions. This design significantly reduces the number of parameters and computational cost while preserving the ability of models to learn complex feature representations. The ResNet50 architecture, pre trained on large scale datasets like ImageNet, is fine tuned for brain tumor classification. Its use of residual connections allows for training deeper networks by mitigating the vanishing gradient problem and avoiding performance degradation. The integration of these components ensures that the proposed system is both computationally efficient and highly accurate. Extensive experiments performed on publicly available MRI datasets demonstrate that Deep Brain Net consistently outperforms existing state of the art methods in terms of classification accuracy, precision, recall, and computational efficiency. The result is an accuracy of 88 percent, a weighted F1 score of 88.75 percent, and a macro AUC ROC score of 98.17 percent which demonstrates the robustness and clinical potential of Deep Brain Net in assisting radiologists with brain tumor diagnosis.
eess.AS [Back]
[94] Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation
Saierdaer Yusuyin,Te Ma,Hao Huang,Zhijian Ou
Main category: eess.AS
TL;DR: 论文提出了一种无需发音词典的音素跨语言语音识别方法,通过联合随机逼近(JSA)算法训练语音到音素(S2P)、音素到字形(P2G)和字形到音素(G2G)模型,显著提升了性能。
Details
Motivation: 现有的基于音素的跨语言语音识别方法需要发音词典,限制了其适用性。本研究旨在消除这一限制,并提出了一种更高效的方法。Contribution: 提出了无需发音词典的音素跨语言语音识别方法,通过JSA算法联合训练S2P、P2G和G2P模型,显著提升了性能和数据利用效率。
Method: 1. 提出了一种潜在变量模型方法,将音素视为离散潜在变量;2. 引入G2P模型作为辅助推理模型;3. 采用联合随机逼近(JSA)算法联合训练模型。
Result: 在波兰语和印尼语实验中,仅需10分钟音素监督即可实现5%错误率降低;在跨领域文本数据适应中,性能提升9%。
Insight: 无需发音词典的方法在跨语言语音识别中具有潜力,JSA算法在处理离散潜在变量模型时表现出色。
Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.
cs.GR [Back]
[95] 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
Fan-Yun Sun,Shengguang Wu,Christian Jacobsen,Thomas Yim,Haoming Zou,Alex Zook,Shangru Li,Yu-Hsin Chou,Ethem Can,Xunlei Wu,Clemens Eppner,Valts Blukis,Jonathan Tremblay,Jiajun Wu,Stan Birchfield,Nick Haber
Main category: cs.GR
TL;DR: 该论文提出了一种可扩展的方法(3D-Generalist),通过自改进微调的视觉语言模型(VLM)生成高质量3D环境,作为基础模型的训练数据。该方法在生成仿真就绪的3D环境和合成数据质量与可扩展性方面表现出色。
Details
Motivation: 目前,尽管大规模预训练赋予了模型语言和视觉推理能力,但由于缺乏基于3D世界的数据,其空间推理能力仍有限。手动创建沉浸式3D世界(如VR、游戏和机器人应用)高度费时费力,因此亟需一种自动化方法。Contribution: 1. 将3D环境构建重新建模为序列决策问题,利用VLM作为策略模型生成动作;2. 提出3D-Generalist框架,通过自改进微调生成与提示对齐的高质量3D环境;3. 展示了其生成的合成数据在预训练视觉基础模型上的优越性能。
Method: 1. 利用VLM作为策略模型,输出动作以协同构建3D环境的布局、材质、光照和资产;2. 通过自改进微调优化VLM的表现;3. 预训练视觉基础模型用于下游任务。
Result: 生成的3D环境可用于仿真,且预训练模型在微调后的表现优于基于人工合成数据的模型,接近大规模真实数据的结果。
Insight: 该研究表明,自动化生成的3D合成数据可以替代或补充人工数据,为基础模型的预训练提供高效且高质量的数据源。
Abstract: Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment’s layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.
cs.RO [Back]
[96] Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction
Matteo Tiezzi,Tommaso Apicella,Carlos Cardenas-Perez,Giovanni Fregonese,Stefano Dafarra,Pietro Morerio,Daniele Pucci,Alessio Del Bue
Main category: cs.RO
TL;DR: 论文提出了一种用于评估人形机器人自主行为的框架NeME,通过深度学习方法从关节轨迹中分类动作,实现了无需人工干预的策略评估。
Details
Motivation: 传统评估方法难以复现且无法捕捉机器人轨迹的复杂性,因此需要一种新方法来衡量模仿学习在复杂人机交互任务中的表现。Contribution: 提出了通用的评估框架NeME,利用深度学习模型分类动作,支持多模态模仿学习方法的性能比较。
Method: 设计了一个基于深度学习的NeME模型,从关节轨迹中分类动作,并作为元评估器比较控制策略的性能。
Result: 在ergoCub人形机器人上验证,实验结果表明NeME更符合实际成功率,且具有可复现性和系统性。
Insight: NeME为复杂HRI任务中的策略评估提供了自动化、可扩展的解决方案,减少了人工参与的需求。
Abstract: Evaluating and comparing the performance of autonomous Humanoid Robots is challenging, as success rate metrics are difficult to reproduce and fail to capture the complexity of robot movement trajectories, critical in Human-Robot Interaction and Collaboration (HRIC). To address these challenges, we propose a general evaluation framework that measures the quality of Imitation Learning (IL) methods by focusing on trajectory performance. We devise the Neural Meta Evaluator (NeME), a deep learning model trained to classify actions from robot joint trajectories. NeME serves as a meta-evaluator to compare the performance of robot control policies, enabling policy evaluation without requiring human involvement in the loop. We validate our framework on ergoCub, a humanoid robot, using teleoperation data and comparing IL methods tailored to the available platform. The experimental results indicate that our method is more aligned with the success rate obtained on the robot than baselines, offering a reproducible, systematic, and insightful means for comparing the performance of multimodal imitation learning approaches in complex HRI tasks.
cs.HC [Back]
[97] Super Kawaii Vocalics: Amplifying the “Cute” Factor in Computer Voice
Yuto Mandai,Katie Seaborn,Tomoyasu Nakano,Xin Sun,Yijia Wang,Jun Kato
Main category: cs.HC
TL;DR: 这篇论文探讨了如何通过声音的基频和共振峰频率调整来增强计算机语音的“可爱”感知,为“卡哇伊”声音学研究提供了初步模型和方法。
Details
Motivation: 现有研究主要关注“卡哇伊”在视觉领域的表现,而忽略了声音方面。本文旨在填补这一空白,研究如何通过声音特性增强计算机语音的可爱感。Contribution: 提出了“卡哇伊”声音学的初步模型,并提供了一种调整计算机语音可爱感的方法。
Method: 通过四阶段实验(N=512),研究了文本转语音(TTS)和游戏角色声音的基频和共振峰频率调整对“卡哇伊”感知的影响。
Result: 发现某些声音在特定频率调整下能找到“卡哇伊”的“甜点”,但部分声音存在天花板效应。
Insight: 声音的可爱感不仅受频率特性影响,还依赖于声音本身的类型,表明可爱感知存在非线性关系。
Abstract: “Kawaii” is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii “sweet spots” through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.
[98] Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents
Zackary Rackauckas,Julia Hirschberg
Main category: cs.HC
TL;DR: 该研究探讨了基于日本动漫风格的有声虚拟代理如何影响用户在多模态语言学习环境中的互动,发现代理的设计(尤其是声音、角色和语言风格)显著影响用户体验和学习动机。
Details
Motivation: 研究动机在于探索具有文化和情感风格化的虚拟代理如何提升语言学习环境中的用户互动体验和学习效果。Contribution: 主要贡献在于揭示了虚拟代理的设计(如声音、角色和语言风格)对用户互动和学习策略的影响,并为设计更具吸引力的社交响应系统提供了指导。
Method: 研究方法采用了混合方法评估,54名参与者与基于大型语言模型和文本转语音技术的动漫风格代理进行异步半结构化对话,并分析用户互动模式、感知可用性、情感反应和学习行为。
Result: 研究结果表明,代理的风格化设计(尤其是声音和语言风格)对用户体验、学习动机和策略有显著影响。
Insight: 研究揭示了情感和文化风格化的代理在提升用户互动和学习效果中的潜力,为未来设计类似系统提供了重要参考。
Abstract: This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.
cs.CR [Back]
[99] The bitter lesson of misuse detection
Hadrien Mariaccia,Charbel-Raphaël Segerie,Diego Dorn
Main category: cs.CR
TL;DR: 本文提出了BELLS基准测试框架,用于评估LLM监督系统的性能,发现现有专用监督系统在多样化的对抗性攻击中表现不佳,而通用LLM的简单检测方法却更有效。
Details
Motivation: 现有的LLM监督系统缺乏全面的公开基准测试,无法评估其在多样化对抗性攻击下的表现。Contribution: 1. 提出BELLS基准测试框架,覆盖多种危害类别和对抗性攻击;2. 揭示专用监督系统的局限性;3. 展示通用LLM在检测中的优势;4. 指出LLM的元认知不连贯问题。
Method: 使用BELLS框架,对监督系统在危害严重性和对抗性复杂性两个维度上进行评估,涵盖3种越狱家族和11种危害类别。
Result: 专用监督系统检测率低,通用LLM的简单检测方法更有效,但LLM存在元认知不连贯问题(如Claude 3.7和Mistral Large对有害查询的响应率仍较高)。
Insight: 通用LLM能力对多样化滥用检测至关重要,简单的架构改进可提升鲁棒性,但需进一步研究权衡问题。
Abstract: Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is “harmful or not” largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the “bitter lesson” of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.
cs.AI [Back]
[100] Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report
Li Du,Hanyu Zhao,Yiming Ju,Tengfei Pan
Main category: cs.AI
TL;DR: 论文提出了一个系统化的指令数据构建框架InfinityInstruct-Subject,通过层次化标注和针对性数据生成,解决了现有指令数据集在覆盖范围和深度上的不足,显著提升了模型的指令跟随能力。
Details
Motivation: 尽管现有的指令数据集规模已达千万级,但模型在复杂指令和罕见领域任务上仍表现不佳,主要原因是指令集在覆盖范围和深度上的扩展不足。Contribution: 提出了一个系统化的指令数据构建框架,包括层次化标注系统、种子选择算法、数据合成过程和针对性数据生成,构建了高质量的InfinityInstruct-Subject数据集。
Method: 采用层次化标注系统、信息种子选择算法、进化数据合成和模型缺陷诊断与针对性数据生成,形成闭环迭代以提升指令数据的覆盖范围和深度。
Result: 实验表明,InfinityInstruct-Subject在多个基础模型和基准任务上显著提升了指令跟随能力,覆盖范围和深度优于同类合成指令数据集。
Insight: 通过系统性方法提升指令数据的质量和多样性,能够更有效地增强模型的泛化能力和复杂任务处理能力,而不仅仅是增加数据量。
Abstract: Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) and depth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
[101] The User-Centric Geo-Experience: An LLM-Powered Framework for Enhanced Planning, Navigation, and Dynamic Adaptation
Jieren Deng,Aleksandar Cvetkovic,Pak Kiu Chung,Dragomir Yankov,Chiqun Zhang
Main category: cs.AI
TL;DR: 本文提出了一种基于LLM的用户中心地理体验框架,通过三个协作代理(旅行规划、目的地辅助和本地发现)解决了传统旅行规划系统的静态性和碎片化问题,显著提升了查询解析、导航精度和适应能力。
Details
Motivation: 传统旅行规划系统难以应对复杂的现实场景(如环境变化和行程中断),导致用户体验差。本文旨在填补智能行程规划、精确导航和动态适应方面的技术空白。Contribution: 1) 提出三个协作代理(旅行规划代理、目的地辅助代理、本地发现代理);2) 结合网格空间定位、图像嵌入和RAG技术,提升系统的灵活性和准确性。
Method: 1) 旅行规划代理使用网格空间定位和地图分析;2) 目的地辅助代理提供精细导航;3) 本地发现代理通过图像嵌入和RAG检测并响应行程中断。
Result: 实验证明系统在查询解析、导航精度和中断适应方面表现优异,适用于城市探索和应急响应等场景。
Insight: 通过LLM和多代理协作,实现了动态性和用户中心的设计,为未来智能地理服务提供了新思路。
Abstract: Traditional travel-planning systems are often static and fragmented, leaving them ill-equipped to handle real-world complexities such as evolving environmental conditions and unexpected itinerary disruptions. In this paper, we identify three gaps between existing service providers causing frustrating user experience: intelligent trip planning, precision “last-100-meter” navigation, and dynamic itinerary adaptation. We propose three cooperative agents: a Travel Planning Agent that employs grid-based spatial grounding and map analysis to help resolve complex multi-modal user queries; a Destination Assistant Agent that provides fine-grained guidance for the final navigation leg of each journey; and a Local Discovery Agent that leverages image embeddings and Retrieval-Augmented Generation (RAG) to detect and respond to trip plan disruptions. With evaluations and experiments, our system demonstrates substantial improvements in query interpretation, navigation accuracy, and disruption resilience, underscoring its promise for applications from urban exploration to emergency response.
cs.LG [Back]
[102] Can Interpretation Predict Behavior on Unseen Data?
Victoria R. Li,Jenny Kaufmann,Martin Wattenberg,David Alvarez-Melis,Naomi Saphra
Main category: cs.LG
TL;DR: 这篇论文探讨了可解释性是否能预测模型在未见数据上的行为,通过实验表明简单的可解释性工具可以预测模型在分布外(OOD)的性能。
Details
Motivation: 研究的动机是验证可解释性是否能用于预测模型对未见输入数据的反应,而不仅仅是针对特定干预的效果。Contribution: 主要贡献在于证明了在特定任务中,通过观察注意力模式可以预测模型在分布外数据上的泛化行为。
Method: 方法包括在合成分类任务上独立训练数百个Transformer模型,并分析其注意力模式与OOD泛化之间的相关性。
Result: 实验结果表明,当模型在分布内数据上表现出分层注意力模式时,其在OOD数据上也倾向于分层泛化。
Insight: 研究发现为可解释性在预测模型未见行为方面的潜力提供了概念验证,激励未来进一步研究。
Abstract: Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data – even when the rule’s implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.
[103] Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
Jing Liang,Hongyao Tang,Yi Ma,Jinyi Liu,Yan Zheng,Shuyue Hu,Lei Bai,Jianye Hao
Main category: cs.LG
TL;DR: 该论文提出了一种高效的离策略强化学习微调方法ReMix,用于大型语言模型(LLMs),显著降低了训练成本,并提升了推理能力。
Details
Motivation: 现有的大多数强化学习微调(RFT)方法属于同策略RL,无法充分利用历史数据,导致计算和时间成本高昂。通过引入离策略RL,可以显著提高效率和可扩展性。Contribution: 提出了ReMix方法,使同策略RFT方法(如PPO、GRPO)能够利用离策略数据,包含三部分:混合策略近端策略梯度、KL-凸性策略约束和策略再生技术。
Method: ReMix通过混合策略近端策略梯度提高更新与数据比(UTD),利用KL-凸性策略约束平衡稳定性和灵活性,并通过策略再生实现高效学习和稳态改进的过渡。
Result: ReMix在多个数学推理基准测试中表现出色,1.5B和7B模型的Pass@1准确率分别达到52.10%和63.27%/64.39%。训练成本降低了30x至450x,远超其他先进模型。
Insight: 揭示了离策略偏差导致的隐式偏好(如更短回答的“鞭打效应”)以及严重离策略性下自反思行为的崩溃模式。
Abstract: Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.
[104] Denoising Multi-Beta VAE: Representation Learning for Disentanglement and Generation
Anshuk Uppal,Yuhta Takida,Chieh-Hsin Lai,Yuki Mitsufuji
Main category: cs.LG
TL;DR: 本文提出了一种名为Denoising Multi-Beta VAE的新框架,通过使用多个β值学习不同的潜在表示,解决了生成模型中解纠缠与生成质量之间的权衡问题,并通过非线性扩散模型实现平滑过渡。
Details
Motivation: 传统的β-VAE框架在解纠缠和生成质量之间存在权衡,β值的增加会牺牲生成质量以获得更好的解纠缠。本文旨在解决这一问题,通过多种β值的潜在表示学习,实现解纠缠与高质量生成的统一。Contribution: 1. 提出了一种新的生成建模框架,结合多个β值学习解纠缠与高保真重构的潜在表示;2. 通过非线性扩散模型实现潜在表示的平滑过渡;3. 展示了无需输入图像的样本生成能力,可作为独立的生成模型。
Method: 1. 设计了一种新的损失函数,通过单次训练VAE生成多个β值对应的潜在表示;2. 引入非线性扩散模型,从高解纠缠表示向高保真重构表示平滑过渡;3. 支持通过潜在空间操作生成一致的输出。
Result: 实验表明,该框架在解纠缠和生成质量上均有显著提升,并实现了潜在空间的平滑过渡与一致的可控生成。
Insight: 通过动态调整β值,可以有效平衡解纠缠与生成质量,为生成模型的潜在空间设计提供了新的思路。
Abstract: Disentangled and interpretable latent representations in generative models typically come at the cost of generation quality. The $\beta$-VAE framework introduces a hyperparameter $\beta$ to balance disentanglement and reconstruction quality, where setting $\beta > 1$ introduces an information bottleneck that favors disentanglement over sharp, accurate reconstructions. To address this trade-off, we propose a novel generative modeling framework that leverages a range of $\beta$ values to learn multiple corresponding latent representations. First, we obtain a slew of representations by training a single variational autoencoder (VAE), with a new loss function that controls the information retained in each latent representation such that the higher $\beta$ value prioritize disentanglement over reconstruction fidelity. We then, introduce a non-linear diffusion model that smoothly transitions latent representations corresponding to different $\beta$ values. This model denoises towards less disentangled and more informative representations, ultimately leading to (almost) lossless representations, enabling sharp reconstructions. Furthermore, our model supports sample generation without input images, functioning as a standalone generative model. We evaluate our framework in terms of both disentanglement and generation quality. Additionally, we observe smooth transitions in the latent spaces with respect to changes in $\beta$, facilitating consistent manipulation of generated outputs.
[105] A Principled Framework for Multi-View Contrastive Learning
Panagiotis Koromilas,Efthymios Georgiou,Giorgos Bouritsas,Theodoros Giannakopoulos,Mihalis A. Nicolaou,Yannis Panagakis
Main category: cs.LG
TL;DR: 该论文针对多视图对比学习中现有方法的局限性,提出了两种新的损失函数:MV-InfoNCE和MV-DHEL,通过理论证明和实验验证,展示了它们在多视图和模态数据上的优越性。
Details
Motivation: 当前多视图对比学习方法存在多个局限性(如目标冲突、视图间交互建模不充分等),无法充分利用多视图带来的优势。论文旨在解决这些问题,提出更高效的多视图对比学习框架。Contribution: 1. 提出两种新的损失函数MV-InfoNCE和MV-DHEL,扩展了传统对比学习方法;2. 理论证明了损失函数在多视图对齐和均匀性上的优化能力;3. 实验表明其在多视图和多模态数据上的优越性。
Method: 1. MV-InfoNCE:在一个优化项中同时建模所有视图间的交互;2. MV-DHEL:解耦对齐与均匀性目标,并根据视图数量动态调整交互复杂度。实验在ImageNet1K等数据集上进行。
Result: 实验证明,提出的方法在多视图和多模态数据上均优于现有方法,尤其是在高视图数量下能有效避免维度塌缩。
Insight: 1. 多视图的学习可以通过更高效的损失函数更好地实现;2. 解耦对齐与均匀性是提升性能的关键;3. 多视图对比学习可以扩展到多模态场景。
Abstract: Contrastive Learning (CL), a leading paradigm in Self-Supervised Learning (SSL), typically relies on pairs of data views generated through augmentation. While multiple augmentations per instance (more than two) improve generalization in supervised learning, current CL methods handle additional views suboptimally by simply aggregating different pairwise objectives. This approach suffers from four critical limitations: (L1) it utilizes multiple optimization terms per data point resulting to conflicting objectives, (L2) it fails to model all interactions across views and data points, (L3) it inherits fundamental limitations (e.g. alignment-uniformity coupling) from pairwise CL losses, and (L4) it prevents fully realizing the benefits of increased view multiplicity observed in supervised settings. We address these limitations through two novel loss functions: MV-InfoNCE, which extends InfoNCE to incorporate all possible view interactions simultaneously in one term per data point, and MV-DHEL, which decouples alignment from uniformity across views while scaling interaction complexity with view multiplicity. Both approaches are theoretically grounded - we prove they asymptotically optimize for alignment of all views and uniformity, providing principled extensions to multi-view contrastive learning. Our empirical results on ImageNet1K and three other datasets demonstrate that our methods consistently outperform existing multi-view approaches and effectively scale with increasing view multiplicity. We also apply our objectives to multimodal data and show that, in contrast to other contrastive objectives, they can scale beyond just two modalities. Most significantly, ablation studies reveal that MV-DHEL with five or more views effectively mitigates dimensionality collapse by fully utilizing the embedding space, thereby delivering multi-view benefits observed in supervised learning.