Program at a Glance

Day Time Event
Jan. 7 Afternoon Preparation for VBS (* Registration desk won’t be available.)
Jan. 8 Morning Registration opens.
Morning - Evening Main conference
Night Reception
Jan. 9 Morning - Evening Main conference
Night Banquet
Jan. 10 Morning - Evening Main conference

Keynote Talks

Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning
Nancy F. Chen

We will share about MeraLion (Multimodal Empathetic Reasoning and Learning In One Network), our generative AI efforts in Singapore’s National Multimodal Large Language Model Programme. Speech and audio information is rich in providing more comprehensive understanding of spatial and temporal reasoning in addition to social dynamics that goes beyond semantics derived from text alone. Cultural nuances and multilingual peculiarities add another layer of complexity in understanding human interactions. In addition, we will draw use cases in education to highlight research endeavors, technology deployment experience and application opportunities.
Biography: Nancy F. Chen is an A*STAR fellow, who leads the Multimodal Generative AI group, heads the Artificial Intelligence for Education (AI4EDU) programme at I2R (Institute for Infocomm Research) and is a principal investigator at CFAR (Centre for Frontier AI Research), A*STAR. Dr. Chen’s recent work in large language models have won honors at ACL 2024, including Area Chair Award and Best Paper Award for Cross-Cultural Considerations in Natural Language Processing. Dr. Chen consistently garners best paper awards for her AI research applied to diverse applications. Examples include IEEE ICASSP 2011 (forensics), APSIPA 2016 (education), SIGDIAL 2021 (social media), MICCAI 2021 (neuroscience), and EMNLP 2023 (healthcare). Multilingual spoken technology from her team has led to commercial spin-offs and has been deployed at Singapore’s Ministry of Education to support home-based learning. Dr. Chen has supervised 100+ students and staff. She has won professional awards from USA National Institute of Health, IEEE, Microsoft, P&G, UNESCO, and L’Oréal.She servers as Program Chair of NeurIPS 2025, APSIPA Board of Governors (2024-2026), IEEE SPS Distinguished Lecturer (2023-2024), Program Chair of ICLR 2023, Board Member of ISCA (2021-2024), and is honoured as Singapore 100 Women in Tech (2021). Prior to A*STAR, she worked at MIT Lincoln Lab while pursuing a PhD at MIT and Harvard. For more info: http://alum.mit.edu/www/nancychen.

Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?
Kiyoharu Aizawa

Manga is a Japanese content that has gained global recognition. Manga is a unique multimedia format that combines both images and text. We created a dataset called Manga109, composed of 109 manga comic books. In 2015, we released a dataset containing approximately 20,000 manga pages, and in 2018, we published an extended version with annotations for more than 500,000 objects, including characters and speech balloons on each page. It is the largest manga dataset in the world with such detailed manual annotations. Manga109 allows for academic use, and we have distributed over 2,000 copies of the dataset to date. Various research efforts have been made using this dataset, both domestically and internationally. For example, different groups have tackled tasks such as character recognition, expression recognition, dialogue recognition, speaker identification, and onomatopoeia recognition and more. In this talk, the journey of Manga109, from its beginning to the present, and show MangaUB benchmark dataset for rapidly advancing large multimodal models (LMMs) to assess the current state of LMMs’ manga comprehension.
Biography: Kiyoharu Aizawa received the B.E., the M.E., and the Dr. Eng. degrees in Electrical Engineering all from the University of Tokyo, in 1983, 1985, and 1988, respectively. He is a professor with the Department of Information and Communication Engineering, and Director of VR center, University of Tokyo. He was a visiting assistant professor with the University of Illinois from 1990 to 1992. His research fields are multimedia, image processing, and computer vision, with a particular interest in interdisciplinary and cross-disciplinary issues. He received the 1990, 1998 Best Paper Awards, the 1991 Achievement Award, 1999 Electronics Society Award from IEICE Japan, and the 1998 Fujio Frontier Award, the 2002 and 2009 Best Paper Award, and 2013, 2021 Achievement award from ITE Japan, and the IBM Japan Science Prize in 2002. He is on the Editorial Board of ACM TOMM. He served as the Editor-in-Chief of Journal of ITE Japan, an Associate Editor of IEEE TIP, TCSVT, TMM, and MultiMedia. He has also played key roles in numerous international and domestic conferences, serving as General Co-Chair of MMM 2008, ACM Multimedia 2012, and ACM ICMR 2018. He is a Fellow of IEEE, IEICE, ITE and a member of Science Council of Japan.

Multi-modal foundation models in the automotive industry
Andrei Bursuc

The tremendous progress of deep-learning-based approaches to image understanding problems has inspired new advanced perception functionalities for autonomous systems. However, real-world perception systems often require models that can learn from large bulks of unlabeled and uncurated data with few labeled samples, usually costly to select and annotate. In contrast, typical supervised methods require extensive collections of carefully selected labeled data, a condition that is seldom fulfilled in practical applications. Self-supervised learning (SSL) arises as a promising line of research to mitigate this gap by training foundation models using various supervision signals extracted from the data itself, without any human-generated labels.While most popular SSL methods revolve around web image datasets, new diverse forms of self-supervision are starting to be investigated for autonomous driving (AD). AD represents a unique sandbox for SSL methods as it brings among the largest public data collections in the community with different paired sensors (multiple cameras, Lidar, radar, ultrasonics) and provides some of the most challenging computer vision tasks: object detection, depth estimation, image-based odometry and localization, etc. Here, the canonical SSL pipeline (i.e., self-supervised pre-training of a model and fine-tuning it on a downstream task) is revisited and extended to utterly new SSL approaches for computer vision and robotics (e.g., world models), but also to new downstream usages of pre-trained foundation models, such as cross-sensor distillation, auto-labelling, data mining, architecture re-purposing. This talk will provide a tour of different forms of foundation models across multiple sensor types equipping today's and tomorrow's vehicles in a quest towards annotation-efficient and reliable perception systems.
Biography: Andrei Bursuc is a senior research scientist and deputy scientific director at valeo.ai and research associate at the Astra Inria project team in Paris working on perception for assisted and autonomous driving. His research interests concern reliability of deep neural networks, learning with limited supervision and multi-modal multi-sensor perception. Andrei is also teaching at Ecole Polytechnique and at Ecole Normale Supérieure in Paris. Previously he was a research scientist at Safran Tech in the aerospace industry. Prior to that he was a postdoctoral researcher at Inria Paris, within the Willow project team working with Josef Sivic and Ivan Laptev, and Inria Rennes with Hervé Jégou. He did his PhD at Ecole des Mines Paris and Alcatel-Lucent Bell Labs France with Francoise Preteux and Titus Zaharia on visual content indexing and retrieval.Andrei is member of the ELLIS society and is regularly part of the technical program committee for CVPR, ICCV, ECCV and NeurIPS. Previously he co-organized the CVPR’20-’21 and ECCV’22 tutorials on self-supervised learning, and the ICCV’23 and ECCV’24 tutorials on reliability and uncertainty estimation.

Accepted Papers

Regular Paper

paperID authors title
111 Yin, Min ; Xie, Liang ; Liang, HaoRan ; Zhao, Xing ; Chen, Ben ; Liang, RongHua SCLSTE: Semi-Supervised Contrastive Learning-Guided Scene Text Editing
117 Wang, Zhensu ; Peng, Weilong ; Wang, Le ; Wu, Zhizhe ; Zhu, Peican ; Tang, Keke EIA: Edge-aware Imperceptible Adversarial Attacks on 3D Point Clouds
120 Shang, Yuzhang ; Liu, Gaowen ; Kompella, Ramana ; Yan, Yan Quantized-ViT Efficient Training via Fisher Matrix Regularization
121 Kong, Yongqiang; Wang, Yunhong; Li, Annan Saliency based data augmentation for few-shot video action recognition
127 Zhang, Jiahao ; Gao, Guangyu ; Zhao, Xiao MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms
128 Ye, Yuyao ; Yang, Jiayu ; Zhao, Yang ; Gao, Mengping ; Cao, Hongbin ; Wang, Ronggang Hybrid Scalable Video Coding with Neural Compression and Enhancement for Streaming Media
129 Lv, Yishan; Luo, Jing; Ju, Boyuan; Yang, Xinyu Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation
130 Cai, Pengzhou ; Jiang, Lu ; Li, Yanxin ; Liu, Xiaojuan ; Lan, Libin Pubic Symphysis-Fetal Head Segmentation Network Using BiFormer Attention Mechanism and Multipath Dilated Convolution
131 Li, Guofeng ; Li, Hanxi ; Li, Bo ; Wu, Lin ; Cheng, Yan DART: Depth-Enhanced Accurate and Real-Time Background Matting
140 Li, Xiuhong; Zhu, Xinyue; Li, Boyuan; Li, Songlin; Wang, Luyao; Jia, Zhenhong Infrared Small Target Detection with Feature Refinement and Context Enhancement
141 Cai, Zeyu ; Chen, Xunhao ; Zhang, Can ; Chen, yuchong ; Yang, Jiming ; Shi, Wubin ; Jin, Chengqian ; Da, Feipeng MLP-AMDC: A MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging
144 Tsukuda, Kosetsu; Takahashi, Takumi; Ishida, Keisuke; Hamasaki, Masahiro; Goto, Masataka Kiite World: Socializing Map-Based Music Exploration Through Playlist Sharing and Synchronized Listening
146 Zhu, Qinfeng ; Weng, Ningxin ; Fan, Lei ; Cai, Yuanzhi Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste
158 Song, Tao ; Zhang, Wenwen Frequency-aware Convolution for Sound Event Detection
163 Liu, Dongyu; Zhu, Yuan; liu, rui; Xing, Zhecong; Geng, Weiyang; Wang, Yanqiang MSD-YOLO : An efficient algorithm for small target detection
166 Li, Yongqian ; Luo, Yong ; Zhou, Xin Robust Active Speaker Detection in Challenging Environments Using GNN-Fused Multi-Modal Cues and Body Language
167 Wang, Xiwen; Zhou, Jizhe; Li, Mao; Zhu, Xuekang; Li, Cheng Saliency Guided Optimization Of Diffusion Latents
172 Tian, Xiang; Zhang, Yuan; Mu, Chang; Zhang, Ziyang Intra-Class Compact Facial Expression Recognition Based on Amplitude Phase Separation
173 Ding, Guohui; Li, Zhonghua; Ren, Yongqiang Modality-Specific Hashing: Transform Cross-Modal Retrieval into Single-Modal Retrieval
174 Suyama, Kosei; Nakamura, Kazuaki Detoxification of Unlabeled Dataset: Reducing Implicit Class Imbalance Using Pseudo-Jacobian of GAN’s Generator
176 Yu, Jizhe; Liu, Yu; Wu, Xiaoshuai; Xu, Kaiping; Li, Jiangquan PA2Net: Pyramid Attention Aggregation Network for Saliency detection
178 Xu, Feifei; Jia, Fumiaoyue; Zhou, Wang Multimodal Prompt Learning for Audio Visual Scene-aware Dialog
181 Chen, Liang-Chia; Chu, Wei-Ta HCV: Lightweight Hybrid CNN-Vision Transformer for Visual Object Tracking
182 Wang, Bin ; Chen, Zekun ; Zhang, Lei ; Liang, Shili ; Guo, Sijia ; Kang, Xinyu ; Li, Huajing MSA-Former: Multi-Scale Adaptive Transformer for Image Snow Removal
184 Lu, Congjian ; Zhou, Shuwang ; Shan, Ke ; Zhang, Hongkuan ; Liu, Zhaoyang SES-Net: Multi-dimensional Spot-Edge-Surface Network for Nuclei Segmentation
188 Zhang, Jingyao ; Hao, Shijie ; Sun, Fuming Sun ; Rao, Yuan LIESA: Low-light Image Enhancement with Semantic Awareness
189 Wang, Yufei ; Yao, Junfeng ; Wang, Zefeng PianoPal: A Robotic Multimedia System for Interactive Piano Instruction Based on Q-learning and Real-time Feedback
192 Vadicamo, Lucia ; Scotti, Francesca ; Dearle, Alan ; Connor, Richard Comparative Analysis of Relevance Feedback Techniques for Image Retrieval
195 Sun, Yongqing ; Liu, Hong ; Chang, Qiong ; Han, Xianhua Deep Dual Internal Learning for Hyperspectral Image Super-Resolution
196 Tan, Wenhui ; Liu, Bei ; Zhang, Junbo ; Song, Ruihua ; Fu, Jianlong RoLD: Robot Latent Diffusion for Multi-task Policy Modeling
198 Wu, Weijie ; Li, Jun ; Wu, Zhijian ; Xu, Jianhua Zero-shot sketch-based image retrieval with hybrid information fusion and sample relationship modeling
199 Zhu, Jian ; Sheng, Mingkai ; Huang, Zhangmin ; Chang, Jingfei ; Long, Jian ; Jiang, Jinling ; Liu, Lei ; Luo, Cheng CLIP Multi-modal Hashing for Multimedia Retrieval
206 Juliussen, Bjørn Aslak The Right to an Explanation under the GDPR and the AI Act
209 Gu, Tao; Zhang, Chongyang An Enhanced Vision-Language Pre-Training Approach for Scene Text Detection
218 Xu, Bo; Jiang, Haiqi; Wei, Shouang; Du, Ming; Song, Hui; Wang, Hongya A Multi-Expert Collaborative Framework for Multimodal Named Entity Recognition
221 Perez, Miguel ; Kirchhoff, Holger ; Grosche, Peter ; Serra, Xavier Improving singing voice transcription generalization with AI generated accompaniments
223 Yang, Xiukang ; Ge, Jingguo ; Li, Hui ; Li, Liangxiong ; Wu, Bingzhen Integrating S1&S2 Framework for Enhanced Semantic Match in Person Re-identification
228 Sunada, Tatsumi; Shiohara, Kaede; Xiao, Ling; Yamasaki, Toshihiko LITA: LMM-guided Image-Text Alignment for Art Assessment
229 Yadav, Saumya ; Lincker, Élise ; Huron, Caroline ; Martin, Stéphanie ; Guinaudeau, Camille ; Satoh, Shin’ichi ; Shukla, Jainendra Towards Inclusive Education: Multimodal Classification of Textbook Images for Accessibility
232 Vu, Thi Ngoc Anh ; Shoji, Yoshiyuki ; Oe, Yuma ; PHAM, Huu Long ; Ohshima, Hiroaki Image-Generation AI Model Retrieval by Contrastive Learning-based Style Distance Calculation
236 Yaling, Hao; Wei, Wu MineTinyNet-YOLO: An Efficient Small Object Detection Method for Complex Underground Coal Mine Scenarios
237 Li, Jingkun; Qi, Na; Zhu, Qing Hyper-NeuS:Hypernetworks for Nerual SDF Implicit Surface Reconstruction by Volume Rendering
241 Cao, Qian; Song, Ruihua; Chen, Xu Understanding the Roles of Visual Modality in Multimodal Dialogue: An Empirical Study
242 Yu, Le ; Zhang, Xianchao ; Qian, Shuxia ; Sun, Hong DistillSleep: Leverage Self-Distillation to Improve Performance After Representation Learning for Sleep Staging
244 Si, Jiahua ; Wang, Youze ; Hu, Wenbo ; Liu, Qiang ; Hong, Richang Making strides Security in Multimodal Fake News Detection Models: A Comprehensive Analysis of Adversarial Attacks
246 Yamamoto, Shuhei ; Kando, Noriko Temporal Closeness for Enhanced Cross-Modal Retrieval of Sensor and Image Data
247 Losfeld, Armand; Seznec, Nicolas; Van Bogaert, Laurie; Lafruit, Gauthier; Teratani, Mehrdad An Analytical Method for Rendering Plenoptic Cameras 2.0 on 3D Multi-Layer Displays
251 Zhu, Yingqian; Gao, Guanyu QRALadder: QoE and Resource Consumption-Aware Encoding Ladder Optimization for Live Video Streaming
253 Fang, Zhiyi ; Qian, Yi ; Dai, Xiyue Structural Information-guided Fine-grained Texture Image Inpainting
256 Jiang, Ling; Liu, Zhuocheng; Li, Kaige; Wu, Wei Boosting Human Pose Estimation via Heatmap Refinement
265 Imajuku, Yuki; Yamakata, Yoko; Aizawa, Kiyoharu FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
266 Sharma, Nikhil ; Sun, Changchang ; Zhao, Zhenghao ; Ngu, Anne Hee Hiong ; Latapie, Hugo ; Yan, Yan SSDL:Sensor-to-Skeleton Diffusion Model with Lipschitz Regularization for Human Activity Recognition
268 Lincker, Elise ; Guinaudeau, Camille ; Satoh, Shin’ichi AD2AT: Audio Description to Alternative Text, a Dataset of Alternative Text from Movies
272 Han, Sijia; Zhang, Zhibin GFA-UDIS: Global-to-Flow Alignment for Unsupervised Deep Image Stitching
273 Sugahara, Aoto ; Kishimoto, Soma ; Adachi, Yuji ; Tai, Kiyoto ; Takashima, Ryoichi ; Takiguchi, Tetsuya Operatic Singing Voice Synthesis From Inexperienced Voice Considering Tempo and Vowel Change
274 Chen, Jiaxing ; Liu, Yuxuan ; Li, Dehu ; An, Xiang ; Deng, Weimo ; Feng, Ziyong ; Zhao, Yongle ; Xie, Yin Grounding Deliberate Reasoning in Multimodal Large Language Models
275 Wu, Fei ; Zhou, Ruixuan ; Ji, Yimu ; Jing, Xiao-Yuan Joint Decision Network with Modality-Specific and Dual Interactive Features for Fake News Detection
276 Liu, Jiajie; Zhang, Zhibin Lightweight Dual Grouped Large-Kernel Convolutions for Salient Object Detection Network
277 Yang, Enhui; Zhang, Zhibin MS-SAM:Multi-Scale SAM based on Dynamic Weighted Agent Attention
281 ZHU, YIJIE; Li, MingYong Multi-Modal Information Multi-Angle Mining For Multimedia Recommendation
283 Wang, Qing; Ngo, Chong Wah; Lim, Ee-Peng; Sun, Qianru LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets
292 Yip, Tin Yui; Chau, Chuck-jee Music2MIDI: Pop Music to MIDI Piano Cover Generation
293 Chen, Xiangyu; Satoh, Shinichi Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy Detection
295 Xiang, Yanru; Li, Yi One-Shot Generative Domain Adaptation by Constructing Self-Amplifying Datasets
296 Zheng, Shuijing; Yu, Suxi; Wang, Yi; Wen, Jing GWUNet: A UNet with Gated Attention and Improved Wavelet Transform for Thyroid Nodules Segmentation
297 Chen, Junjian ; Yang, Xuan Uncertainty-guided Joint Semi-supervised Segmentation and Registration of Cardiac Images
305 Zhu, Deli ; Xu, Zhao ; Yang*, Yunong MambaTalk: Speech-driven 3D Facial Animation with Mamba
306 Li, Yu ; Xie, Zhenping Visual Anomaly Detection on Topological Connectivity under Improved YOLOv8
307 Wang, Haodian Frequency-Based Unsupervised Low-Light Image Enhancement Framework
308 Chen, Zhuowei; Huang, Mengqi; Chen, Nan; Mao, Zhendong Skin-Adapter: Fine-Grained Skin-Color Preservation for Text-to-Image Generation
309 Suo, Zihao; Pan, Shanliang Target-Oriented Dynamic Denosing Curriculum Learning for Multimodel Stance Detection
310 YUAN, HONGHUI; YANAI, KEIJI KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser
312 Ai, Hanxu ; Tao, Xiaomei ; Li, Xingbing ; Gan, Yanling Modeling High-order Relationships between Human and Video for Emotion Recognition
315 Falcon, Alex ; Abdari, Ali ; Serra, Giuseppe HierArtEx: Hierarchical Representations and Art Experts Supporting the Retrieval of Museums in the Metaverse
316 Jiang, Wanchang; Jiang, Yuxin Noise-robust Separating Multi-source Aliased Vibration Signal Based on Transformer Demucs
317 Han, Miaolin; Li, Huibin DocMamba: Robust Document Image Dewarping via Selective State Space Sequence Modeling
320 Yu, Le A Self-supervised Multiview Joint Pre-training Framework for Representation Learning in Sleep Staging
321 Xu, Yixiao ; Li, Yubo ; Xu, Wanzhao ; Gu, Yicheng ; Wang, Yun ; Ma, Jiangyuan ; Qi, Zhengwei gFlow: Distributed Real-Time Reverse Remote Rendering System Model
326 shih, Mu-Jan ; Hsu, Yi-Yu Real-Time Action Detection in Volleyball Matches Using DETR Architecture
327 Jiang, Jiacheng ; Zhang, Shuo ; Zhang, Yiting ; Liu, Jing Deep Vision Transformer with Association Divergence for Image Anomaly Detection and Localization
328 Tao, Ran; Lu, Hailun; Lu, Xiaohui SMSF-Net: A Semantics-Driven Multiscale Selective Fusion Network for Object Detection in Remote Sensing Images
331 Minghui, Hou ; Gang, Wang ; Zhiyang, Wang ; Tongzhou, Zhang ; Baorui, Ma BLCC: A Benchmark for Multi-LiDAR and Multi-Camera Calibration
332 Huang, Hujiang; Xie, Yu; Gao, Jun; Fan, Chuanliu; Cao, Ziqiang Select and Order: Enhancing Few-Shot Image Classification through In-Context Learning
334 Tao, Ran; Lu, Xiaohui; Luo, Xin; Lu, Hailun HENet: High-level Semantic Guidance and Edge Feature Fusion Network for Prohibited Item Detection in X-ray Images
336 Zhang, Yongliang; Liu, Jing SMG-Diff: Adversarial Attack Method Based on Semantic Mask-Guided Diffusion
337 Terada, Takamasa; Toyoura, Masahiro Wavelet Integrated Convolutional Neural Network for ECG Signal Denoising
342 Wang, Jingdong; Ding, XU; Meng, Fanqi MC-YOLO: Multi-scale Transmission Line Defect Target Recognition Network
344 Sun, Ying; Wei, Meiyi; Chen, Gang Dual-Task Feedback Learning for Tongue Detection via Super-Resolution Integration
346 Su, Yulan; Zhang, Sisi; Wang, Yan; Wang, Xingbin; Zhao, Lutan; Dan, Meng; Hou, Rui RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers
350 Ma, Yuefeng ; Cheng, Zhiqi ; Liu, Deheng ; Tang, Shiying A Novel Human Abnormal Posture Detection Method Based on Spatial-Topological Feature Fusion of Skeleton
354 Phueaksri, Itthisak ; Kastner, Marc A. ; Kawanishi, Yasutomo ; Komamizu, Takahiro ; Ide, Ichiro Towards Visual Storytelling by Understanding Narrative Context through Scene-Graphs
356 Hürst, Wolfgang; Zeches, Leo Rotation Methods for 360-degree Videos in Virtual Reality - A Comparative Study
359 Zhao, Hui; Qi, Na; Zhu, Qing; Lin, Xiumin SSCDUF: Spatial-Spectral Correlation Transformer Based on Deep Unfolding Framework for Hyperspectral Image Reconstruction
360 Wang, JinYang; Wu, Wei Camouflaged Object Detection Based on Localization Guidance and Multi-Scale Refinement
362 Su, Yulan; Zhang, Sisi; Lin, Zechao; Wang, Xingbin; Zhao, Lutan; Meng, Dan; Hou, Rui Poseidon: A NAS-Based Ensemble Defense Method against Multiple Perturbations
363 Guo, Junhao ; Fu, Chenhan ; Wang, Guoming ; Lu, Rongxing ; Chen, Dong ; Tang, Siliang MM-CARP: Multimodal Model with Cross-modal retrieval-Augmented and visual Region Perception
365 Shao, Xuan ; Huang, Leming ; Liu, Xinghua Revisit Data Association in Semantic SLAM Systems for Autonomous Parking
368 KWON, ILHWAN ; Li, Jun ; Shah, Rajiv Ratn ; Prasad, Mukesh Lightweight Motion-Aware Video Super-Resolution for Compressed Videos
373 Papadopoulos, Sotirios ; Ioannidis, Konstantinos ; Vrochidis, Stefanos ; Kompatsiaris, Ioannis ; Patras, Ioannis Vision-Language Pretraining for Variable-shot Image Classification
374 Chang, Ding-Chi; Li, Shiou-Chi; Huang, Jen-Wei SPLGAN-TTS:Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models
376 Zou, Yangbin; Liu, Tao YOLO-PCDM: An Enhanced Model for Small Object Detection and Multiscale Fusion in Remote Sensing Imagery
377 Du, Wenxu; Wumaier, Aishan; Shi, Yahui; Yi, Nian; Liu, Dehua A Multi-Aspect Multi-Granularity Pronunciation Assessment Method Based on Branchformer Encoder and Hierarchical Aggregation
379 Li, Yizhou; Liu, Zihua; Monno, Yusuke; Okutomi, Masatoshi TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration
383 Chen, Guanli ; Huang, Guoheng ; Yuan, Xiaochen ; Chen, Xuhang ; Zhong, Guo ; Pun, Chi-Man Cross-View Geo-Localization via Learning Correspondence Semantic Similarity Knowledge
385 Lu, Lingyi; Xu, Xin; Wang, Xiao Style Separation and Content Recovery for Generalizable Sketch Re-identification and A New Benchmark
386 Yang, Dajiang; Wu, Wei; Lee, Yuxing SCANet: Semantic Coherence Attention Network for Clothing Change Person Re-identification
387 Wu, Hao; Yang, Danping; Liu, Peng; Li, Xianxian Chain of Thought Guided Few-shot Fine-tuning of LLMs for Multimodal Aspect-based Sentiment Classification
392 Cheng, Shyi-Chyi ; CHEN, YEN-LIN ; Li, Shih-Yu MPPQNet: A Moment-Preserving Product Quantization Neural Network for Progressive 3D Point Cloud Transmission
393 Zhang, Zhengzhuo; Zhuang, Liansheng Progressive Neural Architecture Generation with Weaker Predictors
395 Goto, Yuta ; Yamazaki, Satoshi ; Shibata, Takashi ; Liu, Jianquan Open-vocabulary Scene Graph Generation via Synonym-based Predicate Descriptor
404 Ma, Yuefeng; Liu, Deheng; Cheng, Zhiqi Semantic- and Pose-Guided Feature Fusion for Multi-Pose Person Re-Identification
414 Hezel, Nico; Barthel, Kai Uwe; Schilling, Bruno; Schall, Konstantin; Jung, Klaus Dynamic Exploration Graph: A Novel Approach for Efficient Nearest Neighbor Search in Evolving Multimedia Datasets
415 Xu, Xiaoman; Li, Xiangrun; Wang, Taihang; Jiang, Ye AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection
417 Springer, Joshua David; Guðmundsson, Gylfi Þór; Kyas, Marcel Toward A Full Pipeline Approach to Autonomous Drone Landing Site Identification: From Terrain Survey to Embedded Classifier
420 shi, shuai; Qi, Na; Li, Yezi; Zhu, Qing Self-Supervised Reference-based Image Super-Resolution with Conditional Diffusion Model
429 Hürst, Wolfgang; Visser, Yannick Innovative Lifelog Visualization and Exploration in Virtual Reality - A Comparative Study
430 Li, Feng; Luo, Jiusong; Xia, Wanjun WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition
435 Bonatto, Daniele ; Fernandes Pinto Fachada, Sarah ; Sancho, Jaime ; Juarez, Eduardo ; Lafruit, Gauthier ; Teratani, Mehrdad Synchronization and Calibration of Video Sequences acquired using Multiple Plenoptic 2.0 Cameras
436 Lim, Xin ; Wong, Lai-Kuan ; Loh, Yuen Peng ; Gu, Ke ; Lin, Weisi Mix-YOLONet: Deep Image Dehazing for Improving Object Detection
438 Lv, Jinyan; Xiao, Guoqiang BiCA-YOLO: Bidirectional Feature Enhancement and Cross Coordinate Attention for Small Object Detection
444 Chen, Zhaoxin; Ma, Bo A Dual-Branch Model for Color Constancy
445 Mu, Wenchuan; Lim, Kwan Hui Data-free Functional Projection of Large Language Models onto Social Media Tagging Domain
447 Yao, Li; Huang, Qianni; Wan, Yan TPS-YOLO: The Efficient Tiny Person Detection Network Based on Improved YOLOv8 and Model Pruning
451 Zhang, Zhihui ; Pang, Jinhui ; Li, Jianan ; Hao, Xiaoshuai ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing
455 Yan, Hao; Bai, Jing MDT-Net: a mask decoder tuning strategy for CLIP-based zero-shot 3D Classification
456 wang, tiebiao; li, xiaoyang; cui, zhenchao AMFT-YOLO: A Adaptive Multi-Scale YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes
458 Wu, Cheng-Yuan; Sun, Yuan-Chun; Lee, Cheng-Tse; Hsu, Cheng-Hsin Optimally Planning Drone Trajectory to Capture a 3D Gaussian Splatting Object
460 Yi, Zepu; Lu, Songfeng; Tang, Xueming; Zhu, Jianxin; Wu, Junjun MICAN: Multi-modal Inconsistency-based Cooperation Attention Network for fake news detection
462 Huang, Zhongzhan; Liang, Mingfu; Liang, Senwei; Zhong, Shanshan Flat Local Minima for Continual learning on Semantic Segmentation

ExpertSUM: Special Session on Expert-Level Text Summarization from Fine-Grained Multimedia Analytics

paperID authors title
137 Fukuzawa, Takumi ; Hara, Kensho ; Kataoka, Hirokatsu ; Tamaki, Toru Can masking background and object reduce static bias for zero-shot action recognition?
355 Tanabe, Hikaru; Yanai, Keiji CalorieVoL: Integrating Volumetric Context into Multimodal Large Language Models for Image-based Calorie Estimation

MLLMA: Special Session on Multimodal Large Language Models and Applications

paperID authors title
193 Huang, Jia-Hong; Zhu, Hongyi; Shen, Yixian; Rudinac, Stevan; Kanoulas, Evangelos Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models
214 Wei, Wei; Zhang, Bingkun; Wang, Yibing TFCST: An Efficient Emotion Recognition Model Based on Deep Speech Analysis and Hierarchical Progressive Structure
215 Wei, Wei; Zhang, Bingkun; Wang, Yibing TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion
230 Matsuhira, Chihaya ; Kastner, Marc A. ; Komamizu, Takahiro ; Hirayama, Takatsugu ; Ide, Ichiro Quantifying Image-Adjective Associations by Leveraging Large-Scale Pretrained Models
288 Li, Su ; Wang, Liang ; Wang, Jianye ; Zhang, Ziheng ; Zhang, Junjun ; Zhang, Lei Enhanced Anomaly Detection in 3D Motion through Language-Inspired Occlusion-Aware Modeling
364 C. Quan, Khanh-An ; Guinaudeau, Camille ; Satoh, Shin’ichi Evaluating VQA Models' Consistency in the Scientific Domain

Special Session on Multimedia Research in Robotics

paperID authors title
416 Lim, Jia Yap ; See, John ; Dondrup, Christian Multimodal Engagement Prediction in Human-Robot Interaction using Transformer Neural Networks
431 Yoshihara, Daichi ; Yuguchi, Akishige ; Kawano, Seiya ; Iio, Takamasa ; Yoshino, Koichiro What Should Autonomous Robots Verbalize and What Should They Not?

SpIMA: Special Session on Spatial Intelligence in Multimedia Analytics

paperID authors title
411 Ghasemi, Narges ; Kim, Seon Ho ; Alfarrarjeh, Abdullah ; Shahabi, Cyrus Counting Unique Objects in Geo-Tagged Street Images: A Case Study Of Homeless Encampments in Los Angeles

Special Session on Simulating Edge Computing and Multimodal AI: A Benchmark for Real-World Applications

paperID authors title
484 Le, Duy-Dong ; Huynh, Duy-Thanh ; Bao, Pham The Federated Learning with Multimodal-Sensing and Knowledge Distillation: An application on real-world benchmark dataset
498 Phan, Mai ; Nguyen, Dang Hieu ; Nguyen, Thuc ; Nguyen, Quang Sang An Integrated Multimodal Sensing Model Combining S3D, HBCO, and Transformer Encoder for Efficient Data Processing
499 Vu, Dang ; Dang, Tien ; Nguyen, Quoc-Trung ; Pham, Tan Efficient Deployment of Multimodal AI Models: Leveraging Pruning, Quantization and Multi-Objective Optimization for Edge Computing

Demo Paper

paperID authors title
466 Jheng, Duen-Chian ; Harchan, Bill Louis ; Kostka de Sztemberg, Berenika Nawoja ; Hsu, Jen-Hao ; Hu, Min-Chun Badminton Footwork Practice via an Immersive Virtual Reality System
468 Wattasseril, Jobin Idiculla; Döllner, Jürgen SelectSum: Topic-Based Selective Summarization of Speech-Based Videos
469 Hamanaka, Masatoshi Real-time Visualizer for Turntablist Performance
470 Gan, Wenbin; Dao, Minh-Son; Zettsu, Koji DriveCoach: Smart Driving Assistance with Multimodal Risk Prediction and Risk Adaptive Behavior Recommendation
472 Fernandez Roblero, Jaime Boanerjes ; Ali, Muhammad Intizar System Demo of Modeling Smart University Campus Virtual Environments
473 Mohamed Serouis, Ibrahim; Sèdes, Florence AMDA: Advancing Multimedia Data Annotation for human-centric situations
475 HUNG-YAO, PENG; ZI-HENG, ZHONG; CHENG-CHIH, TSAI; CHING-YEH, CHIANG; TSE-YU, PAN FencBuddy: Action-aware Depth Perception Training for Fencing Attacks
477 Izumi, Kota; Yanai, Keiji WaveFontStyler: Font Style Transfer Based on Sound
479 Korb, Martin; Bailer, Werner Training a Segmentation-based Visual Anonymization Service for Street Scenes
480 Kawanishi, Yasutomo; Nakamura, Yutaka; Shintani, Taiken; Ishi, Carlos T.; Kawano, Seiya; Yoshino, Koichiro; Minato, Takashi; Minoh, Michihiko RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations
481 Chiang, Yung-Chu ; Tang, Zi-Xian ; Luo, Yi-Ching ; Chang, Jason S. CleverFox: Integrating Visual Mnemonics with AI for Enhanced Language Learning
482 Iino, Nami ; Iino, Akinaru Fingering Prediction for Classical Guitar: Dataset Creation and Model Development
483 Kitahara, Tetsuro ; Tsutsumi, Takuya ; Nagoshi, Takaaki ; Suzuki, Taizan An Implementation of Networked JamSketch
485 Garcia Contreras, Angel Fernando ; Chang, Wen-Yu ; Kawano, Seiya ; Chen, Yun-Nung ; Yoshino, Koichiro Using Language Models to Generate and Forget the Narrative Memories of an Assistive Robot
486 Borgli, Hanna ; Stensland, Håkon Kvale ; Halvorsen, Pål Better Image Segmentation with Classification: Guiding Zero-Shot Models Using Class Activation Maps
487 Li, Bohan ; Li, Xingyi ; Liang, Yangwen ; Wang, Shuangquan ; Song, Kee-Bong Leveraging Latent Diffusion in 3D Gaussian Splatting for Novel View Synthesis
488 Limberg, Christian ; Zhang, Zhe ; Kastner, Marc A. Transformer-Based Audio Generation Conditioned by 2D Latent Maps: A Demonstration
489 YUAN, HONGHUI; YANAI, KEIJI KuzushijiFontDiff: Diffusion Model for Japanese Kuzushiji Font Generation
490 YUAN, HONGHUI; YANAI, KEIJI SceneTextStyler: Editing Text with Style Transformation
492 Lynch, Kelley ; Rim, Kyeongmin ; King, Owen ; Pustejovsky, James Multimodal Interoperability with the CLAMS Platform
493 Kontostathis, Ioannis; Apostolidis, Evlampios; Apostolidis, Konstantinos; Mezaris, Vasileios Enhancing User Control in AI-Based Video Summarization for Social Media
494 Khan, Omar Shahbaz ; Duane, Aaron ; Hasnan, Hariz ; Blavec, Noé Le ; Ouvrard, Pierre ; Verdon, Johan ; d’Orazio, Laurent ; Thierry, Constance ; Jónsson, Björn Þór Multi-Dimensional Exploration of Media Collection Metadata
496 Huang, Wei-Lun ; Hidayati, Shintami Chusnul ; Pan, Tse-Yu Movie Retrieval Systems Using Genre-guided Multimodal Learning Techniques
497 Kongmeesub, Onanong; Gurrin, Cathal; Nie, Dongyun A User Identification and Reading Style Detection System Based on Eye Movement Patterns During Reading

VBS: Video Browser Showdown

paperID authors title
406 Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Kongmeesub, Onanong; Tran, Minh-Triet; Nie, Dongyun; Healy, Graham; Gurrin, Cathal VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown
501 Tran, Quang-Linh; Nguyen, Binh; Jones, Gareth J. F.; Gurrin, Cathal VideoEase at VBS2025: An Interactive Video Retrieval System
502 Rossetto, Luca; Gasser, Ralph Feature-driven Video Segmentation and Advanced Querying with vitrivr-engine
503 Nguyen, Tai; Vo, Anh Ngoc Minh; Pham, Dat Duc; Tran, Vinh Quang; Duong, Nhu Thi Quynh; Le, Tien Anh; Le, Tan Duy; Nguyen, Binh T. HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025
504 CHENG, Yu Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; NGO, Chong Wah Interactive Video Search with Multi-modal LLM Video Captioning
505 Le, Huy M.; Nguyen Tien, Dat; Le Duy, Khang; Nguyen Dang Quang, Tuan; Nguyen Khanh, Toan; Nguyen, Binh T. FUSIONISTA: Fusion of 3-D Information of Video in Retrieval System
506 C. Quan, Khanh-An; Ngoc Nguyen, Qui; Tran, Minh-Triet ViFi: A Video Finding System at Video Browser Showdown 2025
507 Vuong, Gia-Huy; Ho, Van-Son; Nguyen-Dang, Tien-Thanh; Thai, Xuan-Dang; Ho-Le, Minh-Quan; Le, Tu-Khiem; Pham, Minh-Khoi; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet ViewsInsight2.0: Enhancing Video Retrieval for VBS 2025 with an Automatic Query Generator Powered by Large Language Models
508 Pantelidis, Nick; Georgalis, Dimitris; Pegia, Maria; Galanopoulos, Damianos; Apostolidis, Konstantinos; Stavrothanasopoulos, Klearchos; Moumtzidou, Anastasia; Gkountakos, Konstantinos; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis VERGE in VBS 2025
509 Sharma, Ujjwal; Khan, Omar Shahbaz; Rudinac, Stevan; Jónsson, Björn Þór Exquisitor at the Video Browser Showdown 2025: Unifying Conversational Search and User Relevance Feedback
510 Spiess, Florian; Rossetto, Luca; Schuldt, Heiko Simplified Video Retrieval in Virtual Reality with vitrivr-VR
511 Leopold, Mario; Schöffmann, Klaus diveXplore at the Video Browser Showdown 2025
512 Tran Gia, Bao; Bui Cong Khanh, Tuong; Le Thi Thanh, Tam; Tran Doan, Thuyen; Le Tran Trong, Khiem; Do, Tien; Mai, Tien-Dung; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search
513 Stroh, Michael; Kloda, Vojtěch; Verner, Benjamin; Vopálková, Zuzana; Buchmüller, Raphael; Jäckl, Bastian; Lokoč, Jakub; Hajko, Jakob PraK Tool V3: Enhancing Video Item Search Using Localized Text and Texture Queries
514 Arnold, Rahel; Kempf, Rahel; Waltenspül, Raphael; Schuldt, Heiko MediaMix: Multimedia Retrieval in Mixed Reality
515 Ho-Le, Minh-Quan; Ho, Duy-Khang; Do-Huu, Huy-Hoang; Le-Hinh, Nhut-Thanh; Vo-Hoang, Hoa-Vien; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet SnapSeek 2.0 at Video Browser Showdown 2025
517 Luu, Duc-Tuan; C. Quan, Khanh-An; Nguyen, Duy-Ngoc; Bui-Le, Khanh-Linh; Doan, Nhat-Sang; Le-Ngo, Minh-Duc; Nguyen, Vinh-Tiep; Tran, Minh-Triet IMSearch 2.0: Toward User-centric and Efficient Interactive Multimedia Retrieval System