Awards

Best Paper Award
presented to
Wenhui Tan, Bei Liu, Junbo Zhang, Ruihua Song, and Jianlong Fu
RoLD: Robot Latent Diffusion for Multi-task Policy Modeling
Best Student Paper Award
presented to
Yizhou Li, Zihua Liu, Yusuke Monno, and Masatoshi Okutomi
TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration
Best Demonstration Award
presented to
Peng Hung-Yao, Zhong Zi-Heng, Tsai Cheng-Chih, Chiang Ching-Yeh, Pan Tse-Yu
FencBuddy: Action-aware Depth Perception Training for Fencing Attacks
Best Demo Honorable Mention Award
presented to
Yasutomo Kawanishi, Yutaka Nakamura, Taiken Shintani, Carlos T. Ishii, Seiya Kawano, Koichiro Yoshino, Takashi Minato, and Michihiko Minoh
RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations
MMM Community Leadership Award
awarded to
Noboru Babaguchi
MMM Community Leadership Award
awarded to
Kiyoharu Aizawa
Best Overall VBS System
presented to
Bao Tran Gia, Tuong Bui Cong Khanh, Tam Le Thi Thanh, Thuyen Tran Doan, Khiem Le, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, and Shin’ichi Satoh
NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search
Best Expert VBS System
presented to
Bao Tran Gia, Tuong Bui Cong Khanh, Tam Le Thi Thanh, Thuyen Tran Doan, Khiem Le, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, and Shin’ichi Satoh
NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search
Best Novice VBS System
presented to
Thang-Long Nguyen-Ho, Viet-Tham Huynh, Onanong Kongmeesub, Minh-Triet Tran, Dongyun Nie, Graham Healy, and Cathal Gurrin
VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown

Program Booklet

Program Booklet.

Keynote Talks

Schedule: Jan 8, 9:45 - 10:45 Chair: Chong-Wah NGO

Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning
Dr. Nancy F. Chen

We will share about MeraLion (Multimodal Empathetic Reasoning and Learning In One Network), our generative AI efforts in Singapore’s National Multimodal Large Language Model Programme. Speech and audio information is rich in providing more comprehensive understanding of spatial and temporal reasoning in addition to social dynamics that goes beyond semantics derived from text alone. Cultural nuances and multilingual peculiarities add another layer of complexity in understanding human interactions. In addition, we will draw use cases in education to highlight research endeavors, technology deployment experience and application opportunities.
Biography: Dr. Nancy F. Chen is an A*STAR fellow, who leads the Multimodal Generative AI group, heads the Artificial Intelligence for Education (AI4EDU) programme at I2R (Institute for Infocomm Research) and is a principal investigator at CFAR (Centre for Frontier AI Research), A*STAR. Dr. Chen’s recent work in large language models have won honors at ACL 2024, including Area Chair Award and Best Paper Award for Cross-Cultural Considerations in Natural Language Processing. Dr. Chen consistently garners best paper awards for her AI research applied to diverse applications. Examples include IEEE ICASSP 2011 (forensics), APSIPA 2016 (education), SIGDIAL 2021 (social media), MICCAI 2021 (neuroscience), and EMNLP 2023 (healthcare). Multilingual spoken technology from her team has led to commercial spin-offs and has been deployed at Singapore’s Ministry of Education to support home-based learning. Dr. Chen has supervised 100+ students and staff. She has won professional awards from USA National Institute of Health, IEEE, Microsoft, P&G, UNESCO, and L’Oréal.She servers as Program Chair of NeurIPS 2025, APSIPA Board of Governors (2024-2026), IEEE SPS Distinguished Lecturer (2023-2024), Program Chair of ICLR 2023, Board Member of ISCA (2021-2024), and is honoured as Singapore 100 Women in Tech (2021). Prior to A*STAR, she worked at MIT Lincoln Lab while pursuing a PhD at MIT and Harvard. For more info: http://alum.mit.edu/www/nancychen.
Schedule: Jan 9, 16:00 - 17:00 Chair: Keiji YANAI

Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?
Prof. Kiyoharu Aizawa

Manga is a Japanese content that has gained global recognition. Manga is a unique multimedia format that combines both images and text. We created a dataset called Manga109, composed of 109 manga comic books. In 2015, we released a dataset containing approximately 20,000 manga pages, and in 2018, we published an extended version with annotations for more than 500,000 objects, including characters and speech balloons on each page. It is the largest manga dataset in the world with such detailed manual annotations. Manga109 allows for academic use, and we have distributed over 2,000 copies of the dataset to date. Various research efforts have been made using this dataset, both domestically and internationally. For example, different groups have tackled tasks such as character recognition, expression recognition, dialogue recognition, speaker identification, and onomatopoeia recognition and more. In this talk, the journey of Manga109, from its beginning to the present, and show MangaUB benchmark dataset for rapidly advancing large multimodal models (LMMs) to assess the current state of LMMs’ manga comprehension.
Biography: Prof. Kiyoharu Aizawa received the B.E., the M.E., and the Dr. Eng. degrees in Electrical Engineering all from the University of Tokyo, in 1983, 1985, and 1988, respectively. He is a professor with the Department of Information and Communication Engineering, and Director of VR center, University of Tokyo. He was a visiting assistant professor with the University of Illinois from 1990 to 1992. His research fields are multimedia, image processing, and computer vision, with a particular interest in interdisciplinary and cross-disciplinary issues. He received the 1990, 1998 Best Paper Awards, the 1991 Achievement Award, 1999 Electronics Society Award from IEICE Japan, and the 1998 Fujio Frontier Award, the 2002 and 2009 Best Paper Award, and 2013, 2021 Achievement award from ITE Japan, and the IBM Japan Science Prize in 2002. He is on the Editorial Board of ACM TOMM. He served as the Editor-in-Chief of Journal of ITE Japan, an Associate Editor of IEEE TIP, TCSVT, TMM, and MultiMedia. He has also played key roles in numerous international and domestic conferences, serving as General Co-Chair of MMM 2008, ACM Multimedia 2012, and ACM ICMR 2018. He is a Fellow of IEEE, IEICE, ITE and a member of Science Council of Japan.
Schedule: Jan 10, 9:15 - 10:15 Chair: Ichiro IDE

Multi-modal foundation models in the automotive industry
Dr. Andrei Bursuc

The tremendous progress of deep-learning-based approaches to image understanding problems has inspired new advanced perception functionalities for autonomous systems. However, real-world perception systems often require models that can learn from large bulks of unlabeled and uncurated data with few labeled samples, usually costly to select and annotate. In contrast, typical supervised methods require extensive collections of carefully selected labeled data, a condition that is seldom fulfilled in practical applications. Self-supervised learning (SSL) arises as a promising line of research to mitigate this gap by training foundation models using various supervision signals extracted from the data itself, without any human-generated labels.While most popular SSL methods revolve around web image datasets, new diverse forms of self-supervision are starting to be investigated for autonomous driving (AD). AD represents a unique sandbox for SSL methods as it brings among the largest public data collections in the community with different paired sensors (multiple cameras, Lidar, radar, ultrasonics) and provides some of the most challenging computer vision tasks: object detection, depth estimation, image-based odometry and localization, etc. Here, the canonical SSL pipeline (i.e., self-supervised pre-training of a model and fine-tuning it on a downstream task) is revisited and extended to utterly new SSL approaches for computer vision and robotics (e.g., world models), but also to new downstream usages of pre-trained foundation models, such as cross-sensor distillation, auto-labelling, data mining, architecture re-purposing. This talk will provide a tour of different forms of foundation models across multiple sensor types equipping today's and tomorrow's vehicles in a quest towards annotation-efficient and reliable perception systems.
Biography: Dr. Andrei Bursuc is a senior research scientist and deputy scientific director at valeo.ai and research associate at the Astra Inria project team in Paris working on perception for assisted and autonomous driving. His research interests concern reliability of deep neural networks, learning with limited supervision and multi-modal multi-sensor perception. Andrei is also teaching at Ecole Polytechnique and at Ecole Normale Supérieure in Paris. Previously he was a research scientist at Safran Tech in the aerospace industry. Prior to that he was a postdoctoral researcher at Inria Paris, within the Willow project team working with Josef Sivic and Ivan Laptev, and Inria Rennes with Hervé Jégou. He did his PhD at Ecole des Mines Paris and Alcatel-Lucent Bell Labs France with Francoise Preteux and Titus Zaharia on visual content indexing and retrieval.Andrei is member of the ELLIS society and is regularly part of the technical program committee for CVPR, ICCV, ECCV and NeurIPS. Previously he co-organized the CVPR’20-’21 and ECCV’22 tutorials on self-supervised learning, and the ICCV’23 and ECCV’24 tutorials on reliability and uncertainty estimation.

Oral Sessions

Day 1: 8 January

11:00 - 12:00
Best Paper Session
Chair: Toshihiko Yamasaki (The University of Tokyo)
Paper ID Paper Title Authors
196 RoLD: Robot Latent Diffusion for Multi-task Policy Modeling Tan, Wenhui; Liu, Bei; Zhang, Junbo; Song, Ruihua; Fu, Jianlong
379 TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration Li, Yizhou; Liu, Zihua; Monno, Yusuke; Okutomi, Masatoshi
451 ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing Zhang, Zhihui; Pang, Jinhui; Li, Jianan; Hao, Xiaoshuai
462 Flat Local Minima for Continual learning on Semantic Segmentation Huang, Zhongzhan; Liang, Mingfu; Liang, Senwei; Zhong, Shanshan
15:30 – 16:30
Oral Session 1: Content Generation
Chair: Luwei Zhang (The University of Tokyo)
Paper ID Paper Title Authors
268 AD2AT: Audio Description to Alternative Text, a Dataset of Alternative Text from Movies Lincker, Elise; Guinaudeau, Camille; Satoh, Shin’ichi
310 KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser YUAN, HONGHUI; YANAI, KEIJI
167 Saliency Guided Optimization Of Diffusion Latents Wang, Xiwen; Zhou, Jizhe; Li, Mao; Zhu, Xuekang; Li, Cheng
308 Skin-Adapter: Fine-Grained Skin-Color Preservation for Text-to-Image Generation Chen, Zhuowei; Huang, Mengqi; Chen, Nan; Mao, Zhendong
16:45 – 17:45
Oral Session 2: Audio Analysis
Chair: Ling Xiao (The University of Tokyo)
Paper ID Paper Title Authors
273 Operatic Singing Voice Synthesis From Inexperienced Voice Considering Tempo and Vowel Change Sugahara, Aoto; Kishimoto, Soma; Adachi, Yuji; Tai, Kiyoto; Takashima, Ryoichi; Takiguchi, Tetsuya
129 Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation Lv, Yishan; Luo, Jing; Ju, Boyuan; Yang, Xinyu
430 WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition Li, Feng; Luo, Jiusong; Xia, Wanjun
374 SPLGAN-TTS:Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models Chang, Ding-Chi; Li, Shiou-Chi; Huang, Jen-Wei

Day 2: 9 January

9:30 – 10:30
Oral Session 3: Object Detection, Recognition, and Tracking
Chair: Wei-Ta Chu (National Cheng Kung University)
Paper ID Paper Title Authors
236 MineTinyNet-YOLO: An Efficient Small Object Detection Method for Complex Underground Coal Mine Scenarios Yaling, Hao; Wei, Wu
436 Mix-YOLONet: Deep Image Dehazing for Improving Object Detection Lim, Xin; Wong, Lai-Kuan; Loh, Yuen Peng; Gu, Ke; Lin, Weisi
411 Counting Unique Objects in Geo-Tagged Street Images: A Case Study Of Homeless Encampments in Los Angeles Ghasemi, Narges; Kim, Seon Ho; Alfarrarjeh, Abdullah; Shahabi, Cyrus
181 HCV: Lightweight Hybrid CNN-Vision Transformer for Visual Object Tracking Chen, Liang-Chia; Chu, Wei-Ta
10:45 – 11:30
Oral Session 4: Trusted and Explainable AI
Chair: Kazuaki Nakamura (Tokyo University of Science)
Paper ID Paper Title Authors
174 Detoxification of Unlabeled Dataset: Reducing Implicit Class Imbalance Using Pseudo-Jacobian of GAN’s Generator Suyama, Kosei; Nakamura, Kazuaki
244 Making strides Security in Multimodal Fake News Detection Models: A Comprehensive Analysis of Adversarial Attacks Si, Jiahua; Wang, Youze; Hu, Wenbo; Liu, Qiang; Hong, Richang
415 AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection Xu, Xiaoman; Li, Xiangrun; Wang, Taihang; Jiang, Ye
15:00 – 15:45
Oral Session 5: Signal Processing
Chair: Masahiro Toyoura (University of Yamanashi)
Paper ID Paper Title Authors
297 Uncertainty-guided Joint Semi-supervised Segmentation and Registration of Cardiac Images Chen, Junjian; Yang, Xuan
337 Wavelet Integrated Convolutional Neural Network for ECG Signal Denoising Terada, Takamasa; Toyoura, Masahiro
392 MPPQNet: A Moment-Preserving Product Quantization Neural Network for Progressive 3D Point Cloud Transmission Cheng, Shyi-Chyi; CHEN, YEN-LIN; Li, Shih-Yu

Day 3: 10 January

10:30 – 11:30
Oral Session 6: Recognition and Reasoning
Chair: Satoshi Yamasaki (NEC)
Paper ID Paper Title Authors
218 A Multi-Expert Collaborative Framework for Multimodal Named Entity Recognition Xu, Bo; Jiang, Haiqi; Wei, Shouang; Du, Ming; Song, Hui; Wang, Hongya
266 SSDL:Sensor-to-Skeleton Diffusion Model with Lipschitz Regularization for Human Activity Recognition Sharma, Nikhil; Sun, Changchang; Zhao, Zhenghao; Ngu, Anne Hee Hiong; Latapie, Hugo; Yan, Yan
395 Open-vocabulary Scene Graph Generation via Synonym-based Predicate Descriptor Goto, Yuta; Yamazaki, Satoshi; Shibata, Takashi; Liu, Jianquan
274 Grounding Deliberate Reasoning in Multimodal Large Language Models Chen, Jiaxing; Liu, Yuxuan; Li, Dehu; An, Xiang; Deng, Weimo; Feng, Ziyong; Zhao, Yongle; Xie, Yin
15:00 – 16:00
Special Session: MLLMA
Chair: Rajiv Ratn Shah (IIIT-Delhi)
Paper ID Paper Title Authors
193 Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models Huang, Jia-Hong; Zhu, Hongyi; Shen, Yixian; Rudinac, Stevan; Kanoulas, Evangelos
288 Enhanced Anomaly Detection in 3D Motion through Language-Inspired Occlusion-Aware Modeling Li, Su; Wang, Liang; Wang, Jianye; Zhang, Ziheng; Zhang, Junjun; Zhang, Lei
364 Evaluating VQA Models' Consistency in the Scientific Domain C. Quan, Khanh-An; Guinaudeau, Camille; Satoh, Shin’ichi
Panel Discussion
16:15 – 17:00
Oral Session 7: Search and Retrieval
Chair: Nicolas Michel (The University of Tokyo)
Paper ID Paper Title Authors
346 RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers Su, Yulan; Zhang, Sisi; Wang, Yan; Wang, Xingbin; Zhao, Lutan; Dan, Meng; Hou, Rui
232 Image-Generation AI Model Retrieval by Contrastive Learning-based Style Distance Calculation Vu, Thi Ngoc Anh; Shoji, Yoshiyuki; Oe, Yuma; PHAM, Huu Long; Ohshima, Hiroaki
414 Dynamic Exploration Graph: A Novel Approach for Efficient Nearest Neighbor Search in Evolving Multimedia Datasets Hezel, Nico; Barthel, Kai Uwe; Schilling, Bruno; Schall, Konstantin; Jung, Klaus

Poster Sessions

To Presenters

  • Please set up your poster after 1:00 PM and before your poster session starts.

Day 1: 8 January 14:00 – 15:30

Poster ID Paper ID Paper Title Authors
PS1-1 120 Quantized-ViT Efficient Training via Fisher Matrix Regularization Shang, Yuzhang; Liu, Gaowen; Kompella, Ramana; Yan, Yan
PS1-2 121 Saliency based data augmentation for few-shot video action recognition Kong, Yongqiang; Wang, Yunhong; Li, Annan
PS1-3 128 Hybrid Scalable Video Coding with Neural Compression and Enhancement for Streaming Media Ye, Yuyao; Yang, Jiayu; Zhao, Yang; Gao, Mengping; Cao, Hongbin; Wang, Ronggang
PS1-4 130 Pubic Symphysis-Fetal Head Segmentation Network Using BiFormer Attention Mechanism and Multipath Dilated Convolution Cai, Pengzhou; Jiang, Lu; Li, Yanxin; Liu, Xiaojuan; Lan, Libin
PS1-5 131 DART: Depth-Enhanced Accurate and Real-Time Background Matting Li, Guofeng; Li, Hanxi; Li, Bo; Wu, Lin; Cheng, Yan
PS1-6 141 MLP-AMDC: A MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging Cai, Zeyu; Chen, Xunhao; Zhang, Can; Chen, yuchong; Yang, Jiming; Shi, Wubin; Jin, Chengqian; Da, Feipeng
PS1-7 144 Kiite World: Socializing Map-Based Music Exploration Through Playlist Sharing and Synchronized Listening Tsukuda, Kosetsu; Takahashi, Takumi; Ishida, Keisuke; Hamasaki, Masahiro; Goto, Masataka
PS1-8 146 Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste Zhu, Qinfeng; Weng, Ningxin; Fan, Lei; Cai, Yuanzhi
PS1-9 158 Frequency-aware Convolution for Sound Event Detection Song, Tao; Zhang, Wenwen
PS1-10 163 MSD-YOLO : An efficient algorithm for small target detection Liu, Dongyu; Zhu, Yuan; liu, rui; Xing, Zhecong; Geng, Weiyang; Wang, Yanqiang
PS1-11 166 Robust Active Speaker Detection in Challenging Environments Using GNN-Fused Multi-Modal Cues and Body Language Li, Yongqian; Luo, Yong; Zhou, Xin
PS1-12 172 Intra-Class Compact Facial Expression Recognition Based on Amplitude Phase Separation Tian, Xiang; Zhang, Yuan; Mu, Chang; Zhang, Ziyang
PS1-13 176 PA2Net: Pyramid Attention Aggregation Network for Saliency detection Yu, Jizhe; Liu, Yu; Wu, Xiaoshuai; Xu, Kaiping; Li, Jiangquan
PS1-14 188 LIESA: Low-light Image Enhancement with Semantic Awareness Zhang, Jingyao; Hao, Shijie; Sun, Fuming Sun; Rao, Yuan
PS1-15 195 Deep Dual Internal Learning for Hyperspectral Image Super-Resolution Sun, Yongqing; Liu, Hong; Chang, Qiong; Han, Xianhua
PS1-16 198 Zero-shot sketch-based image retrieval with hybrid information fusion and sample relationship modeling Wu, Weijie; Li, Jun; Wu, Zhijian; Xu, Jianhua
PS1-17 206 The Right to an Explanation under the GDPR and the AI Act Juliussen, Bjørn Aslak
PS1-18 221 Improving singing voice transcription generalization with AI generated accompaniments Perez, Miguel; Kirchhoff, Holger; Grosche, Peter; Serra, Xavier
PS1-19 228 LITA: LMM-guided Image-Text Alignment for Art Assessment Sunada, Tatsumi; Shiohara, Kaede; Xiao, Ling; Yamasaki, Toshihiko
PS1-20 229 Towards Inclusive Education: Multimodal Classification of Textbook Images for Accessibility Yadav, Saumya; Lincker, Élise; Huron, Caroline; Martin, Stéphanie; Guinaudeau, Camille; Satoh, Shin’ichi; Shukla, Jainendra
PS1-21 296 GWUNet: A UNet with Gated Attention and Improved Wavelet Transform for Thyroid Nodules Segmentation Zheng, Shuijing; Yu, Suxi; Wang, Yi; Wen, Jing
PS1-22 111 SCLSTE: Semi-Supervised Contrastive Learning-Guided Scene Text Editing Yin, Min; Xie, Liang; Liang, HaoRan; Zhao, Xing; Chen, Ben; Liang, RongHua

Day 2: 9 January 13:30 – 15:00

Poster ID Paper ID Paper Title Authors
PS2-1 192 Comparative Analysis of Relevance Feedback Techniques for Image Retrieval Vadicamo, Lucia; Scotti, Francesca; Dearle, Alan; Connor, Richard
PS2-2 241 Understanding the Roles of Visual Modality in Multimodal Dialogue: An Empirical Study Cao, Qian; Song, Ruihua; Chen, Xu
PS2-3 242 DistillSleep: Leverage Self-Distillation to Improve Performance After Representation Learning for Sleep Staging Yu, Le; Zhang, Xianchao; Qian, Shuxia; Sun, Hong
PS2-4 246 Temporal Closeness for Enhanced Cross-Modal Retrieval of Sensor and Image Data Yamamoto, Shuhei; Kando, Noriko
PS2-5 247 An Analytical Method for Rendering Plenoptic Cameras 2.0 on 3D Multi-Layer Displays Losfeld, Armand; Seznec, Nicolas; Van Bogaert, Laurie; Lafruit, Gauthier; Teratani, Mehrdad
PS2-6 251 QRALadder: QoE and Resource Consumption-Aware Encoding Ladder Optimization for Live Video Streaming Zhu, Yingqian; Gao, Guanyu
PS2-7 256 Boosting Human Pose Estimation via Heatmap Refinement Jiang, Ling; Liu, Zhuocheng; Li, Kaige; Wu, Wei
PS2-8 265 FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation Imajuku, Yuki; Yamakata, Yoko; Aizawa, Kiyoharu
PS2-9 283 LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets Wang, Qing; Ngo, Chong Wah; Lim, Ee-Peng; Sun, Qianru
PS2-10 292 Music2MIDI: Pop Music to MIDI Piano Cover Generation Yip, Tin Yui; Chau, Chuck-jee
PS2-11 293 Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy Detection Chen, Xiangyu; Satoh, Shinichi
PS2-12 295 One-Shot Generative Domain Adaptation by Constructing Self-Amplifying Datasets Xiang, Yanru; Li, Yi
PS2-13 306 Visual Anomaly Detection on Topological Connectivity under Improved YOLOv8 Li, Yu; Xie, Zhenping
PS2-14 315 HierArtEx: Hierarchical Representations and Art Experts Supporting the Retrieval of Museums in the Metaverse Falcon, Alex; Abdari, Ali; Serra, Giuseppe
PS2-15 317 DocMamba: Robust Document Image Dewarping via Selective State Space Sequence Modeling Han, Miaolin; Li, Huibin
PS2-16 326 Real-Time Action Detection in Volleyball Matches Using DETR Architecture shih, Mu-Jan; Hsu, Yi-Yu
PS2-17 332 Select and Order: Enhancing Few-Shot Image Classification through In-Context Learning Huang, Hujiang; Xie, Yu; Gao, Jun; Fan, Chuanliu; Cao, Ziqiang
PS2-18 336 SMG-Diff: Adversarial Attack Method Based on Semantic Mask-Guided Diffusion Zhang, Yongliang; Liu, Jing
PS2-19 344 Dual-Task Feedback Learning for Tongue Detection via Super-Resolution Integration Sun, Ying; Wei, Meiyi; Chen, Gang
PS2-20 354 Towards Visual Storytelling by Understanding Narrative Context through Scene-Graphs Phueaksri, Itthisak; Kastner, Marc A.; Kawanishi, Yasutomo; Komamizu, Takahiro; Ide, Ichiro
PS2-21 456 AMFT-YOLO: A Adaptive Multi-Scale YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes wang, tiebiao; li, xiaoyang; cui, zhenchao
PS2-22 276 Lightweight Dual Grouped Large-Kernel Convolutions for Salient Object Detection Network Liu, Jiajie; Zhang, Zhibin
PS2-23 312 Modeling High-order Relationships between Human and Video for Emotion Recognition Ai, Hanxu; Tao, Xiaomei; Li, Xingbing; Gan, Yanling
DP 117 EIA: Edge-aware Imperceptible Adversarial Attacks on 3D Point Clouds Wang, Zhensu; Peng, Weilong; Wang, Le; Wu, Zhizhe; Zhu, Peican; Tang, Keke
DP 127 MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms Zhang, Jiahao; Gao, Guangyu; Zhao, Xiao
DP 140 Infrared Small Target Detection with Feature Refinement and Context Enhancement Li, Xiuhong; Zhu, Xinyue; Li, Boyuan; Li, Songlin; Wang, Luyao; Jia, Zhenhong
DP 173 Modality-Specific Hashing: Transform Cross-Modal Retrieval into Single-Modal Retrieval Ding, Guohui; Li, Zhonghua; Ren, Yongqiang
DP 178 Multimodal Prompt Learning for Audio Visual Scene-aware Dialog Xu, Feifei; Jia, Fumiaoyue; Zhou, Wang
DP 182 MSA-Former: Multi-Scale Adaptive Transformer for Image Snow Removal Wang, Bin; Chen, Zekun; Zhang, Lei; Liang, Shili; Guo, Sijia; Kang, Xinyu; Li, Huajing
DP 184 SES-Net: Multi-dimensional Spot-Edge-Surface Network for Nuclei Segmentation Lu, Congjian; Zhou, Shuwang; Shan, Ke; Zhang, Hongkuan; Liu, Zhaoyang
DP 189 PianoPal: A Robotic Multimedia System for Interactive Piano Instruction Based on Q-learning and Real-time Feedback Wang, Yufei; Yao, Junfeng; Wang, Zefeng
DP 199 CLIP Multi-modal Hashing for Multimedia Retrieval Zhu, Jian; Sheng, Mingkai; Huang, Zhangmin; Chang, Jingfei; Long, Jian; Jiang, Jinling; Liu, Lei; Luo, Cheng
DP 223 Integrating S1&S2 Framework for Enhanced Semantic Match in Person Re-identification Yang, Xiukang; Ge, Jingguo; Li, Hui; Li, Liangxiong; Wu, Bingzhen
DP 237 Hyper-NeuS:Hypernetworks for Neural SDF Implicit Surface Reconstruction by Volume Rendering Li, Jingkun; Qi, Na; Zhu, Qing
DP 253 Structural Information-guided Fine-grained Texture Image Inpainting Fang, Zhiyi; Qian, Yi; Dai, Xiyue
DP 272 GFA-UDIS: Global-to-Flow Alignment for Unsupervised Deep Image Stitching Han, Sijia; Zhang, Zhibin
DP 275 Joint Decision Network with Modality-Specific and Dual Interactive Features for Fake News Detection Wu, Fei; Zhou, Ruixuan; Ji, Yimu; Jing, Xiao-Yuan
DP 277 MS-SAM:Multi-Scale SAM based on Dynamic Weighted Agent Attention Yang, Enhui; Zhang, Zhibin
DP 281 Multi-Modal Information Multi-Angle Mining For Multimedia Recommendation ZHU, YIJIE; Li, MingYong
DP 305 MambaTalk: Speech-driven 3D Facial Animation with Mamba Zhu, Deli; Xu, Zhao; Yang*, Yunong

Day 3: 10 January 13:30 – 15:00

Poster ID Paper ID Paper Title Authors
PS3-1 356 Rotation Methods for 360-degree Videos in Virtual Reality - A Comparative Study Hürst, Wolfgang; Zeches, Leo
PS3-2 360 Camouflaged Object Detection Based on Localization Guidance and Multi-Scale Refinement Wang, JinYang; Wu, Wei
PS3-3 362 Poseidon: A NAS-Based Ensemble Defense Method against Multiple Perturbations Su, Yulan; Zhang, Sisi; Lin, Zechao; Wang, Xingbin; Zhao, Lutan; Meng, Dan; Hou, Rui
PS3-4 363 MM-CARP: Multimodal Model with Cross-modal retrieval-Augmented and visual Region Perception Guo, Junhao; Fu, Chenhan; Wang, Guoming; Lu, Rongxing; Chen, Dong; Tang, Siliang
PS3-5 365 Revisit Data Association in Semantic SLAM Systems for Autonomous Parking Shao, Xuan; Huang, Leming; Liu, Xinghua
PS3-6 368 Lightweight Motion-Aware Video Super-Resolution for Compressed Videos KWON, ILHWAN; Li, Jun; Shah, Rajiv Ratn; Prasad, Mukesh
PS3-7 373 Vision-Language Pretraining for Variable-shot Image Classification Papadopoulos, Sotirios; Ioannidis, Konstantinos; Vrochidis, Stefanos; Kompatsiaris, Ioannis; Patras, Ioannis
PS3-8 377 A Multi-Aspect Multi-Granularity Pronunciation Assessment Method Based on Branchformer Encoder and Hierarchical Aggregation Du, Wenxu; Wumaier, Aishan; Shi, Yahui; Yi, Nian; Liu, Dehua
PS3-9 386 SCANet: Semantic Coherence Attention Network for Clothing Change Person Re-identification Yang, Dajiang; Wu, Wei; Lee, Yuxing
PS3-10 417 Toward A Full Pipeline Approach to Autonomous Drone Landing Site Identification: From Terrain Survey to Embedded Classifier Springer, Joshua David; Guðmundsson, Gylfi Þór; Kyas, Marcel
PS3-11 429 Innovative Lifelog Visualization and Exploration in Virtual Reality - A Comparative Study Hürst, Wolfgang; Visser, Yannick
PS3-12 435 Synchronization and Calibration of Video Sequences acquired using Multiple Plenoptic 2.0 Cameras Bonatto, Daniele; Fernandes Pinto Fachada, Sarah; Sancho, Jaime; Juarez, Eduardo; Lafruit, Gauthier; Teratani, Mehrdad
PS3-13 444 A Dual-Branch Model for Color Constancy Chen, Zhaoxin; Ma, Bo
PS3-14 445 Data-free Functional Projection of Large Language Models onto Social Media Tagging Domain Mu, Wenchuan; Lim, Kwan Hui
PS3-15 455 MDT-Net: a mask decoder tuning strategy for CLIP-based zero-shot 3D Classification Yan, Hao; Bai, Jing
PS3-16 458 Optimally Planning Drone Trajectory to Capture a 3D Gaussian Splatting Object Wu, Cheng-Yuan; Sun, Yuan-Chun; Lee, Cheng-Tse; Hsu, Cheng-Hsin
PS3-17 230 Quantifying Image-Adjective Associations by Leveraging Large-Scale Pretrained Models Matsuhira, Chihaya; Kastner, Marc A.; Komamizu, Takahiro; Hirayama, Takatsugu; Ide, Ichiro
PS3-18 137 Can masking background and object reduce static bias for zero-shot action recognition? Fukuzawa, Takumi; Hara, Kensho; Kataoka, Hirokatsu; Tamaki, Toru
PS3-19 355 CalorieVoL: Integrating Volumetric Context into Multimodal Large Language Models for Image-based Calorie Estimation Tanabe, Hikaru; Yanai, Keiji
PS3-20 416 Multimodal Engagement Prediction in Human-Robot Interaction using Transformer Neural Networks Lim, Jia Yap; See, John; Dondrup, Christian
PS3-21 431 What Should Autonomous Robots Verbalize and What Should They Not? Yoshihara, Daichi; Yuguchi, Akishige; Kawano, Seiya; Iio, Takamasa; Yoshino, Koichiro
PS3-22 438 BiCA-YOLO: Bidirectional Feature Enhancement and Cross Coordinate Attention for Small Object Detection Lv, Jinyan; Xiao, Guoqiang
DP 307 Frequency-Based Unsupervised Low-Light Image Enhancement Framework Wang, Haodian
DP 309 Target-Oriented Dynamic Denosing Curriculum Learning for Multimodel Stance Detection Suo, Zihao; Pan, Shanliang
DP 316 Noise-robust Separating Multi-source Aliased Vibration Signal Based on Transformer Demucs Jiang, Wanchang; Jiang, Yuxin
DP 321 gFlow: Distributed Real-Time Reverse Remote Rendering System Model Xu, Yixiao; Li, Yubo; Xu, Wanzhao; Gu, Yicheng; Wang, Yun; Ma, Jiangyuan; Qi, Zhengwei
DP 331 BLCC: A Benchmark for Multi-LiDAR and Multi-Camera Calibration Minghui, Hou; Gang, Wang; Zhiyang, Wang; Tongzhou, Zhang; Baorui, Ma
DP 342 MC-YOLO: Multi-scale Transmission Line Defect Target Recognition Network Wang, Jingdong; Ding, XU; Meng, Fanqi
DP 350 A Novel Human Abnormal Posture Detection Method Based on Spatial-Topological Feature Fusion of Skeleton Ma, Yuefeng; Cheng, Zhiqi; Liu, Deheng; Tang, Shiying
DP 359 SSCDUF: Spatial-Spectral Correlation Transformer Based on Deep Unfolding Framework for Hyperspectral Image Reconstruction Zhao, Hui; Qi, Na; Zhu, Qing; Lin, Xiumin
DP 383 Cross-View Geo-Localization via Learning Correspondence Semantic Similarity Knowledge Chen, Guanli; Huang, Guoheng; Yuan, Xiaochen; Chen, Xuhang; Zhong, Guo; Pun, Chi-Man
DP 385 Style Separation and Content Recovery for Generalizable Sketch Re-identification and A New Benchmark Lu, Lingyi; Xu, Xin; Wang, Xiao
DP 387 Chain of Thought Guided Few-shot Fine-tuning of LLMs for Multimodal Aspect-based Sentiment Classification Wu, Hao; Yang, Danping; Liu, Peng; Li, Xianxian
DP 393 Progressive Neural Architecture Generation with Weaker Predictors Zhang, Zhengzhuo; Zhuang, Liansheng
DP 420 Self-Supervised Reference-based Image Super-Resolution with Conditional Diffusion Model shi, shuai; Qi, Na; Li, Yezi; Zhu, Qing
DP 447 TPS-YOLO: The Efficient Tiny Person Detection Network Based on Improved YOLOv8 and Model Pruning Yao, Li; Huang, Qianni; Wan, Yan
DP 460 MICAN: Multi-modal Inconsistency-based Cooperation Attention Network for fake news detection Yi, Zepu; Lu, Songfeng; Tang, Xueming; Zhu, Jianxin; Wu, Junjun
DP 214 TACST: Time-Aware Transformer for Robust Speech Emotion Recognition Wei, Wei; Zhang, Bingkun; Wang, Yibing
DP 215 TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion Wei, Wei; Zhang, Bingkun; Wang, Yibing

Demonstrations: Day 2 & 3 (9 and 10 January 13:30 – 15:00)

demoID paperID title authors
D01 468 SelectSum: Topic-Based Selective Summarization of Speech-Based Videos Wattasseril, Jobin Idiculla; Döllner, Jürgen
D02 469 Real-time Visualizer for Turntablist Performance Hamanaka, Masatoshi
D03 494 Multi-Dimensional Exploration of Media Collection Metadata Khan, Omar Shahbaz ; Duane, Aaron ; Hasnan, Hariz ; Blavec, Noé Le ; Ouvrard, Pierre ; Verdon, Johan ; d’Orazio, Laurent ; Thierry, Constance ; Jónsson, Björn Þór
D04 470 DriveCoach: Smart Driving Assistance with Multimodal Risk Prediction and Risk Adaptive Behavior Recommendation Gan, Wenbin; Dao, Minh-Son; Zettsu, Koji
D05 472 System Demo of Modeling Smart University Campus Virtual Environments Fernandez Roblero, Jaime Boanerjes ; Ali, Muhammad Intizar
D06 473 AMDA: Advancing Multimedia Data Annotation for human-centric situations Mohamed Serouis, Ibrahim; Sèdes, Florence
D07 475 FencBuddy: Action-aware Depth Perception Training for Fencing Attacks HUNG-YAO, PENG; ZI-HENG, ZHONG; CHENG-CHIH, TSAI; CHING-YEH, CHIANG; TSE-YU, PAN
D08 477 WaveFontStyler: Font Style Transfer Based on Sound Izumi, Kota; Yanai, Keiji
D09 479 Training a Segmentation-based Visual Anonymization Service for Street Scenes Korb, Martin; Bailer, Werner
D10 481 CleverFox: Integrating Visual Mnemonics with AI for Enhanced Language Learning Chiang, Yung-Chu ; Tang, Zi-Xian ; Luo, Yi-Ching ; Chang, Jason S.
D11 482 Fingering Prediction for Classical Guitar: Dataset Creation and Model Development Iino, Nami ; Iino, Akinaru
D12 483 An Implementation of Networked JamSketch Kitahara, Tetsuro ; Tsutsumi, Takuya ; Nagoshi, Takaaki ; Suzuki, Taizan
D13 485 Using Language Models to Generate and Forget the Narrative Memories of an Assistive Robot Garcia Contreras, Angel Fernando ; Chang, Wen-Yu ; Kawano, Seiya ; Chen, Yun-Nung ; Yoshino, Koichiro
D14 486 Better Image Segmentation with Classification: Guiding Zero-Shot Models Using Class Activation Maps Borgli, Hanna ; Stensland, Håkon Kvale ; Halvorsen, Pål
D15 488 Transformer-Based Audio Generation Conditioned by 2D Latent Maps: A Demonstration Limberg, Christian ; Zhang, Zhe ; Kastner, Marc A.
D16 489 KuzushijiFontDiff: Diffusion Model for Japanese Kuzushiji Font Generation YUAN, HONGHUI; YANAI, KEIJI
D17 490 SceneTextStyler: Editing Text with Style Transformation YUAN, HONGHUI; YANAI, KEIJI
D18 492 Multimodal Interoperability with the CLAMS Platform Lynch, Kelley ; Rim, Kyeongmin ; King, Owen ; Pustejovsky, James
D19 493 Enhancing User Control in AI-Based Video Summarization for Social Media Kontostathis, Ioannis; Apostolidis, Evlampios; Apostolidis, Konstantinos; Mezaris, Vasileios
D20 496 Movie Retrieval Systems Using Genre-guided Multimodal Learning Techniques Huang, Wei-Lun ; Hidayati, Shintami Chusnul ; Pan, Tse-Yu
D21 497 A User Identification and Reading Style Detection System Based on Eye Movement Patterns During Reading Kongmeesub, Onanong; Gurrin, Cathal; Nie, Dongyun
D22 484 Federated Learning with Multimodal-Sensing and Knowledge Distillation: An application on real-world benchmark dataset Le, Duy-Dong ; Huynh, Duy-Thanh ; Bao, Pham The
D23 499 Efficient Deployment of Multimodal AI Models: Leveraging Pruning, Quantization and Multi-Objective Optimization for Edge Computing Vu, Dang ; Dang, Tien ; Nguyen, Quoc-Trung ; Pham, Tan
D24 466 Badminton Footwork Practice via an Immersive Virtual Reality System Jheng, Duen-Chian ; Harchan, Bill Louis ; Kostka de Sztemberg, Berenika Nawoja ; Hsu, Jen-Hao ; Hu, Min-Chun
D25 480 RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations Kawanishi, Yasutomo; Nakamura, Yutaka; Shintani, Taiken; Ishi, Carlos T.; Kawano, Seiya; Yoshino, Koichiro; Minato, Takashi; Minoh, Michihiko
D26 487 Leveraging Latent Diffusion in 3D Gaussian Splatting for Novel View Synthesis Li, Bohan ; Li, Xingyi ; Liang, Yangwen ; Wang, Shuangquan ; Song, Kee-Bong

VBS: Video Browser Showdown: Day 1 (8 January)

paperID authors title
406 Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Kongmeesub, Onanong; Tran, Minh-Triet; Nie, Dongyun; Healy, Graham; Gurrin, Cathal VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown
501 Tran, Quang-Linh; Nguyen, Binh; Jones, Gareth J. F.; Gurrin, Cathal VideoEase at VBS2025: An Interactive Video Retrieval System
502 Rossetto, Luca; Gasser, Ralph Feature-driven Video Segmentation and Advanced Querying with vitrivr-engine
503 Nguyen, Tai; Vo, Anh Ngoc Minh; Pham, Dat Duc; Tran, Vinh Quang; Duong, Nhu Thi Quynh; Le, Tien Anh; Le, Tan Duy; Nguyen, Binh T. HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025
504 CHENG, Yu Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; NGO, Chong Wah Interactive Video Search with Multi-modal LLM Video Captioning
505 Le, Huy M.; Nguyen Tien, Dat; Le Duy, Khang; Nguyen Dang Quang, Tuan; Nguyen Khanh, Toan; Nguyen, Binh T. FUSIONISTA: Fusion of 3-D Information of Video in Retrieval System
506 C. Quan, Khanh-An; Ngoc Nguyen, Qui; Tran, Minh-Triet ViFi: A Video Finding System at Video Browser Showdown 2025
507 Vuong, Gia-Huy; Ho, Van-Son; Nguyen-Dang, Tien-Thanh; Thai, Xuan-Dang; Ho-Le, Minh-Quan; Le, Tu-Khiem; Pham, Minh-Khoi; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet ViewsInsight2.0: Enhancing Video Retrieval for VBS 2025 with an Automatic Query Generator Powered by Large Language Models
508 Pantelidis, Nick; Georgalis, Dimitris; Pegia, Maria; Galanopoulos, Damianos; Apostolidis, Konstantinos; Stavrothanasopoulos, Klearchos; Moumtzidou, Anastasia; Gkountakos, Konstantinos; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis VERGE in VBS 2025
509 Sharma, Ujjwal; Khan, Omar Shahbaz; Rudinac, Stevan; Jónsson, Björn Þór Exquisitor at the Video Browser Showdown 2025: Unifying Conversational Search and User Relevance Feedback
510 Spiess, Florian; Rossetto, Luca; Schuldt, Heiko Simplified Video Retrieval in Virtual Reality with vitrivr-VR
511 Leopold, Mario; Schöffmann, Klaus diveXplore at the Video Browser Showdown 2025
512 Tran Gia, Bao; Bui Cong Khanh, Tuong; Le Thi Thanh, Tam; Tran Doan, Thuyen; Le Tran Trong, Khiem; Do, Tien; Mai, Tien-Dung; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search
513 Stroh, Michael; Kloda, Vojtěch; Verner, Benjamin; Vopálková, Zuzana; Buchmüller, Raphael; Jäckl, Bastian; Lokoč, Jakub; Hajko, Jakob PraK Tool V3: Enhancing Video Item Search Using Localized Text and Texture Queries
514 Arnold, Rahel; Kempf, Rahel; Waltenspül, Raphael; Schuldt, Heiko MediaMix: Multimedia Retrieval in Mixed Reality
515 Ho-Le, Minh-Quan; Ho, Duy-Khang; Do-Huu, Huy-Hoang; Le-Hinh, Nhut-Thanh; Vo-Hoang, Hoa-Vien; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet SnapSeek 2.0 at Video Browser Showdown 2025
517 Luu, Duc-Tuan; C. Quan, Khanh-An; Nguyen, Duy-Ngoc; Bui-Le, Khanh-Linh; Doan, Nhat-Sang; Le-Ngo, Minh-Duc; Nguyen, Vinh-Tiep; Tran, Minh-Triet IMSearch 2.0: Toward User-centric and Efficient Interactive Multimedia Retrieval System

Social Events

Welcome Reception (Day 1: 8 January)

We warmly invite attendees to the reception.

  • Time: 6:00 PM ~ 8:00 PM (tentative)
  • Location: Reception Hall 1
  • Refreshments including a variety of foods and drinks will be provided.

Banquet (Day 2: 9 January)

  • Time
    • Start: 6:00 PM (tentative)
  • Location: KOTOWA Nara-Koen Premium View
Address
〒630-8374 奈良県奈良市今御門町15
15 Imamikadocho, Nara, 630-8374, Japan


  • Foods and drinks will be provided.
    • Highlight: Kiki-sake(利き酒) will be held as a part of banquet.
      • “Kikisake” is the Japanese tradition of sake tasting. It involves sampling and evaluating different types of sake to appreciate their flavors, aromas, and characteristics, much like wine tasting in Western cultures. The word ‘kiki’ refers to discerning or distinguishing, and ‘sake’ is Japan’s traditional rice wine. It’s often done in a formal setting or as an enjoyable activity to explore the rich variety of sake styles.