Awards

Best Paper Award

presented to

Wenhui Tan, Bei Liu, Junbo Zhang, Ruihua Song, and Jianlong Fu

RoLD: Robot Latent Diffusion for Multi-task Policy Modeling

Best Student Paper Award

presented to

Yizhou Li, Zihua Liu, Yusuke Monno, and Masatoshi Okutomi

TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration

Best Demonstration Award

presented to

Peng Hung-Yao, Zhong Zi-Heng, Tsai Cheng-Chih, Chiang Ching-Yeh, Pan Tse-Yu

FencBuddy: Action-aware Depth Perception Training for Fencing Attacks

Best Demo Honorable Mention Award

presented to

Yasutomo Kawanishi, Yutaka Nakamura, Taiken Shintani, Carlos T. Ishii, Seiya Kawano, Koichiro Yoshino, Takashi Minato, and Michihiko Minoh

RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations

MMM Community Leadership Award

awarded to

Noboru Babaguchi

MMM Community Leadership Award

awarded to

Kiyoharu Aizawa

Best Overall VBS System

presented to

Bao Tran Gia, Tuong Bui Cong Khanh, Tam Le Thi Thanh, Thuyen Tran Doan, Khiem Le, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, and Shin’ichi Satoh

NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search

Best Expert VBS System

presented to

Bao Tran Gia, Tuong Bui Cong Khanh, Tam Le Thi Thanh, Thuyen Tran Doan, Khiem Le, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, and Shin’ichi Satoh

NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search

Best Novice VBS System

presented to

Thang-Long Nguyen-Ho, Viet-Tham Huynh, Onanong Kongmeesub, Minh-Triet Tran, Dongyun Nie, Graham Healy, and Cathal Gurrin

VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown

Program Booklet

Proceedings

The proceedings are available in Springer’s site as follows:

Part I: https://link.springer.com/book/10.1007/978-981-96-2054-8
Part II: https://link.springer.com/book/10.1007/978-981-96-2061-6
Part III: https://link.springer.com/book/10.1007/978-981-96-2064-7
Part IV: https://link.springer.com/book/10.1007/978-981-96-2071-5
Part V: https://link.springer.com/book/10.1007/978-981-96-2074-6

Keynote Talks

Schedule: Jan 8, 9:45 - 10:45 Chair: Chong-Wah NGO

Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning

Dr. Nancy F. Chen

We will share about MeraLion (Multimodal Empathetic Reasoning and Learning In One Network), our generative AI efforts in Singapore’s National Multimodal Large Language Model Programme. Speech and audio information is rich in providing more comprehensive understanding of spatial and temporal reasoning in addition to social dynamics that goes beyond semantics derived from text alone. Cultural nuances and multilingual peculiarities add another layer of complexity in understanding human interactions. In addition, we will draw use cases in education to highlight research endeavors, technology deployment experience and application opportunities.

Biography: Dr. Nancy F. Chen is an A*STAR fellow, who leads the Multimodal Generative AI group, heads the Artificial Intelligence for Education (AI4EDU) programme at I2R (Institute for Infocomm Research) and is a principal investigator at CFAR (Centre for Frontier AI Research), A*STAR. Dr. Chen’s recent work in large language models have won honors at ACL 2024, including Area Chair Award and Best Paper Award for Cross-Cultural Considerations in Natural Language Processing. Dr. Chen consistently garners best paper awards for her AI research applied to diverse applications. Examples include IEEE ICASSP 2011 (forensics), APSIPA 2016 (education), SIGDIAL 2021 (social media), MICCAI 2021 (neuroscience), and EMNLP 2023 (healthcare). Multilingual spoken technology from her team has led to commercial spin-offs and has been deployed at Singapore’s Ministry of Education to support home-based learning. Dr. Chen has supervised 100+ students and staff. She has won professional awards from USA National Institute of Health, IEEE, Microsoft, P&G, UNESCO, and L’Oréal.She servers as Program Chair of NeurIPS 2025, APSIPA Board of Governors (2024-2026), IEEE SPS Distinguished Lecturer (2023-2024), Program Chair of ICLR 2023, Board Member of ISCA (2021-2024), and is honoured as Singapore 100 Women in Tech (2021). Prior to A*STAR, she worked at MIT Lincoln Lab while pursuing a PhD at MIT and Harvard. For more info: http://alum.mit.edu/www/nancychen.

Schedule: Jan 9, 16:00 - 17:00 Chair: Keiji YANAI

Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?

Prof. Kiyoharu Aizawa

Manga is a Japanese content that has gained global recognition. Manga is a unique multimedia format that combines both images and text. We created a dataset called Manga109, composed of 109 manga comic books. In 2015, we released a dataset containing approximately 20,000 manga pages, and in 2018, we published an extended version with annotations for more than 500,000 objects, including characters and speech balloons on each page. It is the largest manga dataset in the world with such detailed manual annotations. Manga109 allows for academic use, and we have distributed over 2,000 copies of the dataset to date. Various research efforts have been made using this dataset, both domestically and internationally. For example, different groups have tackled tasks such as character recognition, expression recognition, dialogue recognition, speaker identification, and onomatopoeia recognition and more. In this talk, the journey of Manga109, from its beginning to the present, and show MangaUB benchmark dataset for rapidly advancing large multimodal models (LMMs) to assess the current state of LMMs’ manga comprehension.

Biography: Prof. Kiyoharu Aizawa received the B.E., the M.E., and the Dr. Eng. degrees in Electrical Engineering all from the University of Tokyo, in 1983, 1985, and 1988, respectively. He is a professor with the Department of Information and Communication Engineering, and Director of VR center, University of Tokyo. He was a visiting assistant professor with the University of Illinois from 1990 to 1992. His research fields are multimedia, image processing, and computer vision, with a particular interest in interdisciplinary and cross-disciplinary issues. He received the 1990, 1998 Best Paper Awards, the 1991 Achievement Award, 1999 Electronics Society Award from IEICE Japan, and the 1998 Fujio Frontier Award, the 2002 and 2009 Best Paper Award, and 2013, 2021 Achievement award from ITE Japan, and the IBM Japan Science Prize in 2002. He is on the Editorial Board of ACM TOMM. He served as the Editor-in-Chief of Journal of ITE Japan, an Associate Editor of IEEE TIP, TCSVT, TMM, and MultiMedia. He has also played key roles in numerous international and domestic conferences, serving as General Co-Chair of MMM 2008, ACM Multimedia 2012, and ACM ICMR 2018. He is a Fellow of IEEE, IEICE, ITE and a member of Science Council of Japan.

Schedule: Jan 10, 9:15 - 10:15 Chair: Ichiro IDE

Multi-modal foundation models in the automotive industry

Dr. Andrei Bursuc

The tremendous progress of deep-learning-based approaches to image understanding problems has inspired new advanced perception functionalities for autonomous systems. However, real-world perception systems often require models that can learn from large bulks of unlabeled and uncurated data with few labeled samples, usually costly to select and annotate. In contrast, typical supervised methods require extensive collections of carefully selected labeled data, a condition that is seldom fulfilled in practical applications. Self-supervised learning (SSL) arises as a promising line of research to mitigate this gap by training foundation models using various supervision signals extracted from the data itself, without any human-generated labels.While most popular SSL methods revolve around web image datasets, new diverse forms of self-supervision are starting to be investigated for autonomous driving (AD). AD represents a unique sandbox for SSL methods as it brings among the largest public data collections in the community with different paired sensors (multiple cameras, Lidar, radar, ultrasonics) and provides some of the most challenging computer vision tasks: object detection, depth estimation, image-based odometry and localization, etc. Here, the canonical SSL pipeline (i.e., self-supervised pre-training of a model and fine-tuning it on a downstream task) is revisited and extended to utterly new SSL approaches for computer vision and robotics (e.g., world models), but also to new downstream usages of pre-trained foundation models, such as cross-sensor distillation, auto-labelling, data mining, architecture re-purposing. This talk will provide a tour of different forms of foundation models across multiple sensor types equipping today's and tomorrow's vehicles in a quest towards annotation-efficient and reliable perception systems.

Biography: Dr. Andrei Bursuc is a senior research scientist and deputy scientific director at valeo.ai and research associate at the Astra Inria project team in Paris working on perception for assisted and autonomous driving. His research interests concern reliability of deep neural networks, learning with limited supervision and multi-modal multi-sensor perception. Andrei is also teaching at Ecole Polytechnique and at Ecole Normale Supérieure in Paris. Previously he was a research scientist at Safran Tech in the aerospace industry. Prior to that he was a postdoctoral researcher at Inria Paris, within the Willow project team working with Josef Sivic and Ivan Laptev, and Inria Rennes with Hervé Jégou. He did his PhD at Ecole des Mines Paris and Alcatel-Lucent Bell Labs France with Francoise Preteux and Titus Zaharia on visual content indexing and retrieval.Andrei is member of the ELLIS society and is regularly part of the technical program committee for CVPR, ICCV, ECCV and NeurIPS. Previously he co-organized the CVPR’20-’21 and ECCV’22 tutorials on self-supervised learning, and the ICCV’23 and ECCV’24 tutorials on reliability and uncertainty estimation.

Oral Sessions

Day 1: 8 January

11:00 - 12:00

Best Paper Session

Chair: Toshihiko Yamasaki (The University of Tokyo)

Paper ID	Paper Title	Authors
196	RoLD: Robot Latent Diffusion for Multi-task Policy Modeling	Tan, Wenhui; Liu, Bei; Zhang, Junbo; Song, Ruihua; Fu, Jianlong
379	TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration	Li, Yizhou; Liu, Zihua; Monno, Yusuke; Okutomi, Masatoshi
451	ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing	Zhang, Zhihui; Pang, Jinhui; Li, Jianan; Hao, Xiaoshuai
462	Flat Local Minima for Continual learning on Semantic Segmentation	Huang, Zhongzhan; Liang, Mingfu; Liang, Senwei; Zhong, Shanshan

15:30 – 16:30

Oral Session 1: Content Generation

Chair: Luwei Zhang (The University of Tokyo)

Paper ID	Paper Title	Authors
268	AD2AT: Audio Description to Alternative Text, a Dataset of Alternative Text from Movies	Lincker, Elise; Guinaudeau, Camille; Satoh, Shin’ichi
310	KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser	YUAN, HONGHUI; YANAI, KEIJI
167	Saliency Guided Optimization Of Diffusion Latents	Wang, Xiwen; Zhou, Jizhe; Li, Mao; Zhu, Xuekang; Li, Cheng
308	Skin-Adapter: Fine-Grained Skin-Color Preservation for Text-to-Image Generation	Chen, Zhuowei; Huang, Mengqi; Chen, Nan; Mao, Zhendong

16:45 – 17:45

Oral Session 2: Audio Analysis

Chair: Ling Xiao (The University of Tokyo)

Paper ID	Paper Title	Authors
273	Operatic Singing Voice Synthesis From Inexperienced Voice Considering Tempo and Vowel Change	Sugahara, Aoto; Kishimoto, Soma; Adachi, Yuji; Tai, Kiyoto; Takashima, Ryoichi; Takiguchi, Tetsuya
129	Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation	Lv, Yishan; Luo, Jing; Ju, Boyuan; Yang, Xinyu
430	WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition	Li, Feng; Luo, Jiusong; Xia, Wanjun
374	SPLGAN-TTS:Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models	Chang, Ding-Chi; Li, Shiou-Chi; Huang, Jen-Wei

Day 2: 9 January

9:30 – 10:30

Oral Session 3: Object Detection, Recognition, and Tracking

Chair: Wei-Ta Chu (National Cheng Kung University)

Paper ID	Paper Title	Authors
236	MineTinyNet-YOLO: An Efficient Small Object Detection Method for Complex Underground Coal Mine Scenarios	Yaling, Hao; Wei, Wu
436	Mix-YOLONet: Deep Image Dehazing for Improving Object Detection	Lim, Xin; Wong, Lai-Kuan; Loh, Yuen Peng; Gu, Ke; Lin, Weisi
411	Counting Unique Objects in Geo-Tagged Street Images: A Case Study Of Homeless Encampments in Los Angeles	Ghasemi, Narges; Kim, Seon Ho; Alfarrarjeh, Abdullah; Shahabi, Cyrus
181	HCV: Lightweight Hybrid CNN-Vision Transformer for Visual Object Tracking	Chen, Liang-Chia; Chu, Wei-Ta

10:45 – 11:30

Oral Session 4: Trusted and Explainable AI

Chair: Kazuaki Nakamura (Tokyo University of Science)

Paper ID	Paper Title	Authors
174	Detoxification of Unlabeled Dataset: Reducing Implicit Class Imbalance Using Pseudo-Jacobian of GAN’s Generator	Suyama, Kosei; Nakamura, Kazuaki
244	Making strides Security in Multimodal Fake News Detection Models: A Comprehensive Analysis of Adversarial Attacks	Si, Jiahua; Wang, Youze; Hu, Wenbo; Liu, Qiang; Hong, Richang
415	AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection	Xu, Xiaoman; Li, Xiangrun; Wang, Taihang; Jiang, Ye

15:00 – 15:45

Oral Session 5: Signal Processing

Chair: Masahiro Toyoura (University of Yamanashi)

Paper ID	Paper Title	Authors
297	Uncertainty-guided Joint Semi-supervised Segmentation and Registration of Cardiac Images	Chen, Junjian; Yang, Xuan
337	Wavelet Integrated Convolutional Neural Network for ECG Signal Denoising	Terada, Takamasa; Toyoura, Masahiro
392	MPPQNet: A Moment-Preserving Product Quantization Neural Network for Progressive 3D Point Cloud Transmission	Cheng, Shyi-Chyi; CHEN, YEN-LIN; Li, Shih-Yu

Day 3: 10 January

10:30 – 11:30

Oral Session 6: Recognition and Reasoning

Chair: Satoshi Yamasaki (NEC)

Paper ID	Paper Title	Authors
218	A Multi-Expert Collaborative Framework for Multimodal Named Entity Recognition	Xu, Bo; Jiang, Haiqi; Wei, Shouang; Du, Ming; Song, Hui; Wang, Hongya
266	SSDL:Sensor-to-Skeleton Diffusion Model with Lipschitz Regularization for Human Activity Recognition	Sharma, Nikhil; Sun, Changchang; Zhao, Zhenghao; Ngu, Anne Hee Hiong; Latapie, Hugo; Yan, Yan
395	Open-vocabulary Scene Graph Generation via Synonym-based Predicate Descriptor	Goto, Yuta; Yamazaki, Satoshi; Shibata, Takashi; Liu, Jianquan
274	Grounding Deliberate Reasoning in Multimodal Large Language Models	Chen, Jiaxing; Liu, Yuxuan; Li, Dehu; An, Xiang; Deng, Weimo; Feng, Ziyong; Zhao, Yongle; Xie, Yin

15:00 – 16:00

Special Session: MLLMA

Chair: Rajiv Ratn Shah (IIIT-Delhi)

Paper ID	Paper Title	Authors
193	Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models	Huang, Jia-Hong; Zhu, Hongyi; Shen, Yixian; Rudinac, Stevan; Kanoulas, Evangelos
288	Enhanced Anomaly Detection in 3D Motion through Language-Inspired Occlusion-Aware Modeling	Li, Su; Wang, Liang; Wang, Jianye; Zhang, Ziheng; Zhang, Junjun; Zhang, Lei
364	Evaluating VQA Models' Consistency in the Scientific Domain	C. Quan, Khanh-An; Guinaudeau, Camille; Satoh, Shin’ichi
	Panel Discussion

16:15 – 17:00

Oral Session 7: Search and Retrieval

Chair: Nicolas Michel (The University of Tokyo)

Paper ID	Paper Title	Authors
346	RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers	Su, Yulan; Zhang, Sisi; Wang, Yan; Wang, Xingbin; Zhao, Lutan; Dan, Meng; Hou, Rui
232	Image-Generation AI Model Retrieval by Contrastive Learning-based Style Distance Calculation	Vu, Thi Ngoc Anh; Shoji, Yoshiyuki; Oe, Yuma; PHAM, Huu Long; Ohshima, Hiroaki
414	Dynamic Exploration Graph: A Novel Approach for Efficient Nearest Neighbor Search in Evolving Multimedia Datasets	Hezel, Nico; Barthel, Kai Uwe; Schilling, Bruno; Schall, Konstantin; Jung, Klaus

Poster Sessions

To Presenters

Please set up your poster after 1:00 PM and before your poster session starts.

Day 1: 8 January 14:00 – 15:30

Poster ID	Paper ID	Paper Title	Authors
PS1-1	120	Quantized-ViT Efficient Training via Fisher Matrix Regularization	Shang, Yuzhang; Liu, Gaowen; Kompella, Ramana; Yan, Yan
PS1-2	121	Saliency based data augmentation for few-shot video action recognition	Kong, Yongqiang; Wang, Yunhong; Li, Annan
PS1-3	128	Hybrid Scalable Video Coding with Neural Compression and Enhancement for Streaming Media	Ye, Yuyao; Yang, Jiayu; Zhao, Yang; Gao, Mengping; Cao, Hongbin; Wang, Ronggang
PS1-4	130	Pubic Symphysis-Fetal Head Segmentation Network Using BiFormer Attention Mechanism and Multipath Dilated Convolution	Cai, Pengzhou; Jiang, Lu; Li, Yanxin; Liu, Xiaojuan; Lan, Libin
PS1-5	131	DART: Depth-Enhanced Accurate and Real-Time Background Matting	Li, Guofeng; Li, Hanxi; Li, Bo; Wu, Lin; Cheng, Yan
PS1-6	141	MLP-AMDC: A MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging	Cai, Zeyu; Chen, Xunhao; Zhang, Can; Chen, yuchong; Yang, Jiming; Shi, Wubin; Jin, Chengqian; Da, Feipeng
PS1-7	144	Kiite World: Socializing Map-Based Music Exploration Through Playlist Sharing and Synchronized Listening	Tsukuda, Kosetsu; Takahashi, Takumi; Ishida, Keisuke; Hamasaki, Masahiro; Goto, Masataka
PS1-8	146	Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste	Zhu, Qinfeng; Weng, Ningxin; Fan, Lei; Cai, Yuanzhi
PS1-9	158	Frequency-aware Convolution for Sound Event Detection	Song, Tao; Zhang, Wenwen
PS1-10	163	MSD-YOLO : An efficient algorithm for small target detection	Liu, Dongyu; Zhu, Yuan; liu, rui; Xing, Zhecong; Geng, Weiyang; Wang, Yanqiang
PS1-11	166	Robust Active Speaker Detection in Challenging Environments Using GNN-Fused Multi-Modal Cues and Body Language	Li, Yongqian; Luo, Yong; Zhou, Xin
PS1-12	172	Intra-Class Compact Facial Expression Recognition Based on Amplitude Phase Separation	Tian, Xiang; Zhang, Yuan; Mu, Chang; Zhang, Ziyang
PS1-13	176	PA2Net: Pyramid Attention Aggregation Network for Saliency detection	Yu, Jizhe; Liu, Yu; Wu, Xiaoshuai; Xu, Kaiping; Li, Jiangquan
PS1-14	188	LIESA: Low-light Image Enhancement with Semantic Awareness	Zhang, Jingyao; Hao, Shijie; Sun, Fuming Sun; Rao, Yuan
PS1-15	195	Deep Dual Internal Learning for Hyperspectral Image Super-Resolution	Sun, Yongqing; Liu, Hong; Chang, Qiong; Han, Xianhua
PS1-16	198	Zero-shot sketch-based image retrieval with hybrid information fusion and sample relationship modeling	Wu, Weijie; Li, Jun; Wu, Zhijian; Xu, Jianhua
PS1-17	206	The Right to an Explanation under the GDPR and the AI Act	Juliussen, Bjørn Aslak
PS1-18	221	Improving singing voice transcription generalization with AI generated accompaniments	Perez, Miguel; Kirchhoff, Holger; Grosche, Peter; Serra, Xavier
PS1-19	228	LITA: LMM-guided Image-Text Alignment for Art Assessment	Sunada, Tatsumi; Shiohara, Kaede; Xiao, Ling; Yamasaki, Toshihiko
PS1-20	229	Towards Inclusive Education: Multimodal Classification of Textbook Images for Accessibility	Yadav, Saumya; Lincker, Élise; Huron, Caroline; Martin, Stéphanie; Guinaudeau, Camille; Satoh, Shin’ichi; Shukla, Jainendra
PS1-21	296	GWUNet: A UNet with Gated Attention and Improved Wavelet Transform for Thyroid Nodules Segmentation	Zheng, Shuijing; Yu, Suxi; Wang, Yi; Wen, Jing
PS1-22	111	SCLSTE: Semi-Supervised Contrastive Learning-Guided Scene Text Editing	Yin, Min; Xie, Liang; Liang, HaoRan; Zhao, Xing; Chen, Ben; Liang, RongHua

Day 2: 9 January 13:30 – 15:00

Poster ID	Paper ID	Paper Title	Authors
PS2-1	192	Comparative Analysis of Relevance Feedback Techniques for Image Retrieval	Vadicamo, Lucia; Scotti, Francesca; Dearle, Alan; Connor, Richard
PS2-2	241	Understanding the Roles of Visual Modality in Multimodal Dialogue: An Empirical Study	Cao, Qian; Song, Ruihua; Chen, Xu
PS2-3	242	DistillSleep: Leverage Self-Distillation to Improve Performance After Representation Learning for Sleep Staging	Yu, Le; Zhang, Xianchao; Qian, Shuxia; Sun, Hong
PS2-4	246	Temporal Closeness for Enhanced Cross-Modal Retrieval of Sensor and Image Data	Yamamoto, Shuhei; Kando, Noriko
PS2-5	247	An Analytical Method for Rendering Plenoptic Cameras 2.0 on 3D Multi-Layer Displays	Losfeld, Armand; Seznec, Nicolas; Van Bogaert, Laurie; Lafruit, Gauthier; Teratani, Mehrdad
PS2-6	251	QRALadder: QoE and Resource Consumption-Aware Encoding Ladder Optimization for Live Video Streaming	Zhu, Yingqian; Gao, Guanyu
PS2-7	256	Boosting Human Pose Estimation via Heatmap Refinement	Jiang, Ling; Liu, Zhuocheng; Li, Kaige; Wu, Wei
PS2-8	265	FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation	Imajuku, Yuki; Yamakata, Yoko; Aizawa, Kiyoharu
PS2-9	283	LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets	Wang, Qing; Ngo, Chong Wah; Lim, Ee-Peng; Sun, Qianru
PS2-10	292	Music2MIDI: Pop Music to MIDI Piano Cover Generation	Yip, Tin Yui; Chau, Chuck-jee
PS2-11	293	Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy Detection	Chen, Xiangyu; Satoh, Shinichi
PS2-12	295	One-Shot Generative Domain Adaptation by Constructing Self-Amplifying Datasets	Xiang, Yanru; Li, Yi
PS2-13	306	Visual Anomaly Detection on Topological Connectivity under Improved YOLOv8	Li, Yu; Xie, Zhenping
PS2-14	315	HierArtEx: Hierarchical Representations and Art Experts Supporting the Retrieval of Museums in the Metaverse	Falcon, Alex; Abdari, Ali; Serra, Giuseppe
PS2-15	317	DocMamba: Robust Document Image Dewarping via Selective State Space Sequence Modeling	Han, Miaolin; Li, Huibin
PS2-16	326	Real-Time Action Detection in Volleyball Matches Using DETR Architecture	shih, Mu-Jan; Hsu, Yi-Yu
PS2-17	332	Select and Order: Enhancing Few-Shot Image Classification through In-Context Learning	Huang, Hujiang; Xie, Yu; Gao, Jun; Fan, Chuanliu; Cao, Ziqiang
PS2-18	336	SMG-Diff: Adversarial Attack Method Based on Semantic Mask-Guided Diffusion	Zhang, Yongliang; Liu, Jing
PS2-19	344	Dual-Task Feedback Learning for Tongue Detection via Super-Resolution Integration	Sun, Ying; Wei, Meiyi; Chen, Gang
PS2-20	354	Towards Visual Storytelling by Understanding Narrative Context through Scene-Graphs	Phueaksri, Itthisak; Kastner, Marc A.; Kawanishi, Yasutomo; Komamizu, Takahiro; Ide, Ichiro
PS2-21	456	AMFT-YOLO: A Adaptive Multi-Scale YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes	wang, tiebiao; li, xiaoyang; cui, zhenchao
PS2-22	276	Lightweight Dual Grouped Large-Kernel Convolutions for Salient Object Detection Network	Liu, Jiajie; Zhang, Zhibin
PS2-23	312	Modeling High-order Relationships between Human and Video for Emotion Recognition	Ai, Hanxu; Tao, Xiaomei; Li, Xingbing; Gan, Yanling
DP	117	EIA: Edge-aware Imperceptible Adversarial Attacks on 3D Point Clouds	Wang, Zhensu; Peng, Weilong; Wang, Le; Wu, Zhizhe; Zhu, Peican; Tang, Keke
DP	127	MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms	Zhang, Jiahao; Gao, Guangyu; Zhao, Xiao
DP	140	Infrared Small Target Detection with Feature Refinement and Context Enhancement	Li, Xiuhong; Zhu, Xinyue; Li, Boyuan; Li, Songlin; Wang, Luyao; Jia, Zhenhong
DP	173	Modality-Specific Hashing: Transform Cross-Modal Retrieval into Single-Modal Retrieval	Ding, Guohui; Li, Zhonghua; Ren, Yongqiang
DP	178	Multimodal Prompt Learning for Audio Visual Scene-aware Dialog	Xu, Feifei; Jia, Fumiaoyue; Zhou, Wang
DP	182	MSA-Former: Multi-Scale Adaptive Transformer for Image Snow Removal	Wang, Bin; Chen, Zekun; Zhang, Lei; Liang, Shili; Guo, Sijia; Kang, Xinyu; Li, Huajing
DP	184	SES-Net: Multi-dimensional Spot-Edge-Surface Network for Nuclei Segmentation	Lu, Congjian; Zhou, Shuwang; Shan, Ke; Zhang, Hongkuan; Liu, Zhaoyang
DP	189	PianoPal: A Robotic Multimedia System for Interactive Piano Instruction Based on Q-learning and Real-time Feedback	Wang, Yufei; Yao, Junfeng; Wang, Zefeng
DP	199	CLIP Multi-modal Hashing for Multimedia Retrieval	Zhu, Jian; Sheng, Mingkai; Huang, Zhangmin; Chang, Jingfei; Long, Jian; Jiang, Jinling; Liu, Lei; Luo, Cheng
DP	223	Integrating S1&S2 Framework for Enhanced Semantic Match in Person Re-identification	Yang, Xiukang; Ge, Jingguo; Li, Hui; Li, Liangxiong; Wu, Bingzhen
DP	237	Hyper-NeuS:Hypernetworks for Neural SDF Implicit Surface Reconstruction by Volume Rendering	Li, Jingkun; Qi, Na; Zhu, Qing
DP	253	Structural Information-guided Fine-grained Texture Image Inpainting	Fang, Zhiyi; Qian, Yi; Dai, Xiyue
DP	272	GFA-UDIS: Global-to-Flow Alignment for Unsupervised Deep Image Stitching	Han, Sijia; Zhang, Zhibin
DP	275	Joint Decision Network with Modality-Specific and Dual Interactive Features for Fake News Detection	Wu, Fei; Zhou, Ruixuan; Ji, Yimu; Jing, Xiao-Yuan
DP	277	MS-SAM:Multi-Scale SAM based on Dynamic Weighted Agent Attention	Yang, Enhui; Zhang, Zhibin
DP	281	Multi-Modal Information Multi-Angle Mining For Multimedia Recommendation	ZHU, YIJIE; Li, MingYong
DP	305	MambaTalk: Speech-driven 3D Facial Animation with Mamba	Zhu, Deli; Xu, Zhao; Yang*, Yunong

Day 3: 10 January 13:30 – 15:00

Poster ID	Paper ID	Paper Title	Authors
PS3-1	356	Rotation Methods for 360-degree Videos in Virtual Reality - A Comparative Study	Hürst, Wolfgang; Zeches, Leo
PS3-2	360	Camouflaged Object Detection Based on Localization Guidance and Multi-Scale Refinement	Wang, JinYang; Wu, Wei
PS3-3	362	Poseidon: A NAS-Based Ensemble Defense Method against Multiple Perturbations	Su, Yulan; Zhang, Sisi; Lin, Zechao; Wang, Xingbin; Zhao, Lutan; Meng, Dan; Hou, Rui
PS3-4	363	MM-CARP: Multimodal Model with Cross-modal retrieval-Augmented and visual Region Perception	Guo, Junhao; Fu, Chenhan; Wang, Guoming; Lu, Rongxing; Chen, Dong; Tang, Siliang
PS3-5	365	Revisit Data Association in Semantic SLAM Systems for Autonomous Parking	Shao, Xuan; Huang, Leming; Liu, Xinghua
PS3-6	368	Lightweight Motion-Aware Video Super-Resolution for Compressed Videos	KWON, ILHWAN; Li, Jun; Shah, Rajiv Ratn; Prasad, Mukesh
PS3-7	373	Vision-Language Pretraining for Variable-shot Image Classification	Papadopoulos, Sotirios; Ioannidis, Konstantinos; Vrochidis, Stefanos; Kompatsiaris, Ioannis; Patras, Ioannis
PS3-8	377	A Multi-Aspect Multi-Granularity Pronunciation Assessment Method Based on Branchformer Encoder and Hierarchical Aggregation	Du, Wenxu; Wumaier, Aishan; Shi, Yahui; Yi, Nian; Liu, Dehua
PS3-9	386	SCANet: Semantic Coherence Attention Network for Clothing Change Person Re-identification	Yang, Dajiang; Wu, Wei; Lee, Yuxing
PS3-10	417	Toward A Full Pipeline Approach to Autonomous Drone Landing Site Identification: From Terrain Survey to Embedded Classifier	Springer, Joshua David; Guðmundsson, Gylfi Þór; Kyas, Marcel
PS3-11	429	Innovative Lifelog Visualization and Exploration in Virtual Reality - A Comparative Study	Hürst, Wolfgang; Visser, Yannick
PS3-12	435	Synchronization and Calibration of Video Sequences acquired using Multiple Plenoptic 2.0 Cameras	Bonatto, Daniele; Fernandes Pinto Fachada, Sarah; Sancho, Jaime; Juarez, Eduardo; Lafruit, Gauthier; Teratani, Mehrdad
PS3-13	444	A Dual-Branch Model for Color Constancy	Chen, Zhaoxin; Ma, Bo
PS3-14	445	Data-free Functional Projection of Large Language Models onto Social Media Tagging Domain	Mu, Wenchuan; Lim, Kwan Hui
PS3-15	455	MDT-Net: a mask decoder tuning strategy for CLIP-based zero-shot 3D Classification	Yan, Hao; Bai, Jing
PS3-16	458	Optimally Planning Drone Trajectory to Capture a 3D Gaussian Splatting Object	Wu, Cheng-Yuan; Sun, Yuan-Chun; Lee, Cheng-Tse; Hsu, Cheng-Hsin
PS3-17	230	Quantifying Image-Adjective Associations by Leveraging Large-Scale Pretrained Models	Matsuhira, Chihaya; Kastner, Marc A.; Komamizu, Takahiro; Hirayama, Takatsugu; Ide, Ichiro
PS3-18	137	Can masking background and object reduce static bias for zero-shot action recognition?	Fukuzawa, Takumi; Hara, Kensho; Kataoka, Hirokatsu; Tamaki, Toru
PS3-19	355	CalorieVoL: Integrating Volumetric Context into Multimodal Large Language Models for Image-based Calorie Estimation	Tanabe, Hikaru; Yanai, Keiji
PS3-20	416	Multimodal Engagement Prediction in Human-Robot Interaction using Transformer Neural Networks	Lim, Jia Yap; See, John; Dondrup, Christian
PS3-21	431	What Should Autonomous Robots Verbalize and What Should They Not?	Yoshihara, Daichi; Yuguchi, Akishige; Kawano, Seiya; Iio, Takamasa; Yoshino, Koichiro
PS3-22	438	BiCA-YOLO: Bidirectional Feature Enhancement and Cross Coordinate Attention for Small Object Detection	Lv, Jinyan; Xiao, Guoqiang
DP	307	Frequency-Based Unsupervised Low-Light Image Enhancement Framework	Wang, Haodian
DP	309	Target-Oriented Dynamic Denosing Curriculum Learning for Multimodel Stance Detection	Suo, Zihao; Pan, Shanliang
DP	316	Noise-robust Separating Multi-source Aliased Vibration Signal Based on Transformer Demucs	Jiang, Wanchang; Jiang, Yuxin
DP	321	gFlow: Distributed Real-Time Reverse Remote Rendering System Model	Xu, Yixiao; Li, Yubo; Xu, Wanzhao; Gu, Yicheng; Wang, Yun; Ma, Jiangyuan; Qi, Zhengwei
DP	331	BLCC: A Benchmark for Multi-LiDAR and Multi-Camera Calibration	Minghui, Hou; Gang, Wang; Zhiyang, Wang; Tongzhou, Zhang; Baorui, Ma
DP	342	MC-YOLO: Multi-scale Transmission Line Defect Target Recognition Network	Wang, Jingdong; Ding, XU; Meng, Fanqi
DP	350	A Novel Human Abnormal Posture Detection Method Based on Spatial-Topological Feature Fusion of Skeleton	Ma, Yuefeng; Cheng, Zhiqi; Liu, Deheng; Tang, Shiying
DP	359	SSCDUF: Spatial-Spectral Correlation Transformer Based on Deep Unfolding Framework for Hyperspectral Image Reconstruction	Zhao, Hui; Qi, Na; Zhu, Qing; Lin, Xiumin
DP	383	Cross-View Geo-Localization via Learning Correspondence Semantic Similarity Knowledge	Chen, Guanli; Huang, Guoheng; Yuan, Xiaochen; Chen, Xuhang; Zhong, Guo; Pun, Chi-Man
DP	385	Style Separation and Content Recovery for Generalizable Sketch Re-identification and A New Benchmark	Lu, Lingyi; Xu, Xin; Wang, Xiao
DP	387	Chain of Thought Guided Few-shot Fine-tuning of LLMs for Multimodal Aspect-based Sentiment Classification	Wu, Hao; Yang, Danping; Liu, Peng; Li, Xianxian
DP	393	Progressive Neural Architecture Generation with Weaker Predictors	Zhang, Zhengzhuo; Zhuang, Liansheng
DP	420	Self-Supervised Reference-based Image Super-Resolution with Conditional Diffusion Model	shi, shuai; Qi, Na; Li, Yezi; Zhu, Qing
DP	447	TPS-YOLO: The Efficient Tiny Person Detection Network Based on Improved YOLOv8 and Model Pruning	Yao, Li; Huang, Qianni; Wan, Yan
DP	460	MICAN： Multi-modal Inconsistency-based Cooperation Attention Network for fake news detection	Yi, Zepu; Lu, Songfeng; Tang, Xueming; Zhu, Jianxin; Wu, Junjun
DP	214	TACST: Time-Aware Transformer for Robust Speech Emotion Recognition	Wei, Wei; Zhang, Bingkun; Wang, Yibing
DP	215	TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion	Wei, Wei; Zhang, Bingkun; Wang, Yibing

Demonstrations: Day 2 & 3 (9 and 10 January 13:30 – 15:00)

demoID	paperID	title	authors
D01	468	SelectSum: Topic-Based Selective Summarization of Speech-Based Videos	Wattasseril, Jobin Idiculla; Döllner, Jürgen
D02	469	Real-time Visualizer for Turntablist Performance	Hamanaka, Masatoshi
D03	494	Multi-Dimensional Exploration of Media Collection Metadata	Khan, Omar Shahbaz ; Duane, Aaron ; Hasnan, Hariz ; Blavec, Noé Le ; Ouvrard, Pierre ; Verdon, Johan ; d’Orazio, Laurent ; Thierry, Constance ; Jónsson, Björn Þór
D04	470	DriveCoach: Smart Driving Assistance with Multimodal Risk Prediction and Risk Adaptive Behavior Recommendation	Gan, Wenbin; Dao, Minh-Son; Zettsu, Koji
D05	472	System Demo of Modeling Smart University Campus Virtual Environments	Fernandez Roblero, Jaime Boanerjes ; Ali, Muhammad Intizar
D06	473	AMDA: Advancing Multimedia Data Annotation for human-centric situations	Mohamed Serouis, Ibrahim; Sèdes, Florence
D07	475	FencBuddy: Action-aware Depth Perception Training for Fencing Attacks	HUNG-YAO, PENG; ZI-HENG, ZHONG; CHENG-CHIH, TSAI; CHING-YEH, CHIANG; TSE-YU, PAN
D08	477	WaveFontStyler: Font Style Transfer Based on Sound	Izumi, Kota; Yanai, Keiji
D09	479	Training a Segmentation-based Visual Anonymization Service for Street Scenes	Korb, Martin; Bailer, Werner
D10	481	CleverFox: Integrating Visual Mnemonics with AI for Enhanced Language Learning	Chiang, Yung-Chu ; Tang, Zi-Xian ; Luo, Yi-Ching ; Chang, Jason S.
D11	482	Fingering Prediction for Classical Guitar: Dataset Creation and Model Development	Iino, Nami ; Iino, Akinaru
D12	483	An Implementation of Networked JamSketch	Kitahara, Tetsuro ; Tsutsumi, Takuya ; Nagoshi, Takaaki ; Suzuki, Taizan
D13	485	Using Language Models to Generate and Forget the Narrative Memories of an Assistive Robot	Garcia Contreras, Angel Fernando ; Chang, Wen-Yu ; Kawano, Seiya ; Chen, Yun-Nung ; Yoshino, Koichiro
D14	486	Better Image Segmentation with Classification: Guiding Zero-Shot Models Using Class Activation Maps	Borgli, Hanna ; Stensland, Håkon Kvale ; Halvorsen, Pål
D15	488	Transformer-Based Audio Generation Conditioned by 2D Latent Maps: A Demonstration	Limberg, Christian ; Zhang, Zhe ; Kastner, Marc A.
D16	489	KuzushijiFontDiff: Diffusion Model for Japanese Kuzushiji Font Generation	YUAN, HONGHUI; YANAI, KEIJI
D17	490	SceneTextStyler: Editing Text with Style Transformation	YUAN, HONGHUI; YANAI, KEIJI
D18	492	Multimodal Interoperability with the CLAMS Platform	Lynch, Kelley ; Rim, Kyeongmin ; King, Owen ; Pustejovsky, James
D19	493	Enhancing User Control in AI-Based Video Summarization for Social Media	Kontostathis, Ioannis; Apostolidis, Evlampios; Apostolidis, Konstantinos; Mezaris, Vasileios
D20	496	Movie Retrieval Systems Using Genre-guided Multimodal Learning Techniques	Huang, Wei-Lun ; Hidayati, Shintami Chusnul ; Pan, Tse-Yu
D21	497	A User Identification and Reading Style Detection System Based on Eye Movement Patterns During Reading	Kongmeesub, Onanong; Gurrin, Cathal; Nie, Dongyun
D22	484	Federated Learning with Multimodal-Sensing and Knowledge Distillation: An application on real-world benchmark dataset	Le, Duy-Dong ; Huynh, Duy-Thanh ; Bao, Pham The
D23	499	Efficient Deployment of Multimodal AI Models: Leveraging Pruning, Quantization and Multi-Objective Optimization for Edge Computing	Vu, Dang ; Dang, Tien ; Nguyen, Quoc-Trung ; Pham, Tan
D24	466	Badminton Footwork Practice via an Immersive Virtual Reality System	Jheng, Duen-Chian ; Harchan, Bill Louis ; Kostka de Sztemberg, Berenika Nawoja ; Hsu, Jen-Hao ; Hu, Min-Chun
D25	480	RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations	Kawanishi, Yasutomo; Nakamura, Yutaka; Shintani, Taiken; Ishi, Carlos T.; Kawano, Seiya; Yoshino, Koichiro; Minato, Takashi; Minoh, Michihiko
D26	487	Leveraging Latent Diffusion in 3D Gaussian Splatting for Novel View Synthesis	Li, Bohan ; Li, Xingyi ; Liang, Yangwen ; Wang, Shuangquan ; Song, Kee-Bong

VBS: Video Browser Showdown: Day 1 (8 January)

paperID	authors	title
406	Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Kongmeesub, Onanong; Tran, Minh-Triet; Nie, Dongyun; Healy, Graham; Gurrin, Cathal	VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown
501	Tran, Quang-Linh; Nguyen, Binh; Jones, Gareth J. F.; Gurrin, Cathal	VideoEase at VBS2025: An Interactive Video Retrieval System
502	Rossetto, Luca; Gasser, Ralph	Feature-driven Video Segmentation and Advanced Querying with vitrivr-engine
503	Nguyen, Tai; Vo, Anh Ngoc Minh; Pham, Dat Duc; Tran, Vinh Quang; Duong, Nhu Thi Quynh; Le, Tien Anh; Le, Tan Duy; Nguyen, Binh T.	HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025
504	CHENG, Yu Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; NGO, Chong Wah	Interactive Video Search with Multi-modal LLM Video Captioning
505	Le, Huy M.; Nguyen Tien, Dat; Le Duy, Khang; Nguyen Dang Quang, Tuan; Nguyen Khanh, Toan; Nguyen, Binh T.	FUSIONISTA: Fusion of 3-D Information of Video in Retrieval System
506	C. Quan, Khanh-An; Ngoc Nguyen, Qui; Tran, Minh-Triet	ViFi: A Video Finding System at Video Browser Showdown 2025
507	Vuong, Gia-Huy; Ho, Van-Son; Nguyen-Dang, Tien-Thanh; Thai, Xuan-Dang; Ho-Le, Minh-Quan; Le, Tu-Khiem; Pham, Minh-Khoi; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet	ViewsInsight2.0: Enhancing Video Retrieval for VBS 2025 with an Automatic Query Generator Powered by Large Language Models
508	Pantelidis, Nick; Georgalis, Dimitris; Pegia, Maria; Galanopoulos, Damianos; Apostolidis, Konstantinos; Stavrothanasopoulos, Klearchos; Moumtzidou, Anastasia; Gkountakos, Konstantinos; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis	VERGE in VBS 2025
509	Sharma, Ujjwal; Khan, Omar Shahbaz; Rudinac, Stevan; Jónsson, Björn Þór	Exquisitor at the Video Browser Showdown 2025: Unifying Conversational Search and User Relevance Feedback
510	Spiess, Florian; Rossetto, Luca; Schuldt, Heiko	Simplified Video Retrieval in Virtual Reality with vitrivr-VR
511	Leopold, Mario; Schöffmann, Klaus	diveXplore at the Video Browser Showdown 2025
512	Tran Gia, Bao; Bui Cong Khanh, Tuong; Le Thi Thanh, Tam; Tran Doan, Thuyen; Le Tran Trong, Khiem; Do, Tien; Mai, Tien-Dung; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi	NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search
513	Stroh, Michael; Kloda, Vojtěch; Verner, Benjamin; Vopálková, Zuzana; Buchmüller, Raphael; Jäckl, Bastian; Lokoč, Jakub; Hajko, Jakob	PraK Tool V3: Enhancing Video Item Search Using Localized Text and Texture Queries
514	Arnold, Rahel; Kempf, Rahel; Waltenspül, Raphael; Schuldt, Heiko	MediaMix: Multimedia Retrieval in Mixed Reality
515	Ho-Le, Minh-Quan; Ho, Duy-Khang; Do-Huu, Huy-Hoang; Le-Hinh, Nhut-Thanh; Vo-Hoang, Hoa-Vien; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet	SnapSeek 2.0 at Video Browser Showdown 2025
517	Luu, Duc-Tuan; C. Quan, Khanh-An; Nguyen, Duy-Ngoc; Bui-Le, Khanh-Linh; Doan, Nhat-Sang; Le-Ngo, Minh-Duc; Nguyen, Vinh-Tiep; Tran, Minh-Triet	IMSearch 2.0: Toward User-centric and Efficient Interactive Multimedia Retrieval System

Welcome Reception (Day 1: 8 January)

We warmly invite attendees to the reception.

Time: 6:00 PM ~ 8:00 PM (tentative)
Location: Reception Hall 1
Refreshments including a variety of foods and drinks will be provided.

Banquet (Day 2: 9 January)

Time
- Start: 6:00 PM (tentative)
Location: KOTOWA Nara-Koen Premium View

Address
〒630-8374 奈良県奈良市今御門町１５
15 Imamikadocho, Nara, 630-8374, Japan

Foods and drinks will be provided.
- Highlight: Kiki-sake（利き酒） will be held as a part of banquet.
  - “Kikisake” is the Japanese tradition of sake tasting. It involves sampling and evaluating different types of sake to appreciate their flavors, aromas, and characteristics, much like wine tasting in Western cultures. The word ‘kiki’ refers to discerning or distinguishing, and ‘sake’ is Japan’s traditional rice wine. It’s often done in a formal setting or as an enjoyable activity to explore the rich variety of sake styles.

Table of Contents

Awards

Program Booklet

Proceedings

Keynote Talks

Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning

Dr. Nancy F. Chen

Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?

Prof. Kiyoharu Aizawa

Multi-modal foundation models in the automotive industry

Dr. Andrei Bursuc

Oral Sessions

Day 1: 8 January

Day 2: 9 January

Day 3: 10 January

Poster Sessions

To Presenters

Day 1: 8 January 14:00 – 15:30

Day 2: 9 January 13:30 – 15:00

Day 3: 10 January 13:30 – 15:00

Demonstrations: Day 2 & 3 (9 and 10 January 13:30 – 15:00)

VBS: Video Browser Showdown: Day 1 (8 January)

Accepted Special Sessions

Important Dates

Sponsors

Subsidies

Supporters

Table of Contents

Awards

Program Booklet

Proceedings

Keynote Talks

Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning Dr. Nancy F. Chen

Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga? Prof. Kiyoharu Aizawa

Multi-modal foundation models in the automotive industry Dr. Andrei Bursuc

Oral Sessions

Day 1: 8 January

Day 2: 9 January

Day 3: 10 January

Poster Sessions

To Presenters

Day 1: 8 January 14:00 – 15:30

Day 2: 9 January 13:30 – 15:00

Day 3: 10 January 13:30 – 15:00

Demonstrations: Day 2 & 3 (9 and 10 January 13:30 – 15:00)

VBS: Video Browser Showdown: Day 1 (8 January)

Social Events

Welcome Reception (Day 1: 8 January)

Banquet (Day 2: 9 January)

Accepted Special Sessions

Important Dates

Sponsors

Subsidies

Supporters

Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning

Dr. Nancy F. Chen

Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?

Prof. Kiyoharu Aizawa

Multi-modal foundation models in the automotive industry

Dr. Andrei Bursuc