Table of Contents
Program at a Glance
Tentative Program (updated: 29 November 2024)
The current program is tentative, it may be changed.
Keynote Talks
Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning
Dr. Nancy F. Chen
Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?
Prof. Kiyoharu Aizawa
Multi-modal foundation models in the automotive industry
Dr. Andrei Bursuc
Oral Presentations
paperID | authors | title |
---|---|---|
117 | Wang, Zhensu ; Peng, Weilong ; Wang, Le ; Wu, Zhizhe ; Zhu, Peican ; Tang, Keke | EIA: Edge-aware Imperceptible Adversarial Attacks on 3D Point Clouds |
127 | Zhang, Jiahao ; Gao, Guangyu ; Zhao, Xiao | MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms |
129 | Lv, Yishan; Luo, Jing; Ju, Boyuan; Yang, Xinyu | Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation |
140 | Li, Xiuhong; Zhu, Xinyue; Li, Boyuan; Li, Songlin; Wang, Luyao; Jia, Zhenhong | Infrared Small Target Detection with Feature Refinement and Context Enhancement |
167 | Wang, Xiwen; Zhou, Jizhe; Li, Mao; Zhu, Xuekang; Li, Cheng | Saliency Guided Optimization Of Diffusion Latents |
174 | Suyama, Kosei; Nakamura, Kazuaki | Detoxification of Unlabeled Dataset: Reducing Implicit Class Imbalance Using Pseudo-Jacobian of GAN’s Generator |
196 | Tan, Wenhui ; Liu, Bei ; Zhang, Junbo ; Song, Ruihua ; Fu, Jianlong | RoLD: Robot Latent Diffusion for Multi-task Policy Modeling |
199 | Zhu, Jian ; Sheng, Mingkai ; Huang, Zhangmin ; Chang, Jingfei ; Long, Jian ; Jiang, Jinling ; Liu, Lei ; Luo, Cheng | CLIP Multi-modal Hashing for Multimedia Retrieval |
218 | Xu, Bo; Jiang, Haiqi; Wei, Shouang; Du, Ming; Song, Hui; Wang, Hongya | A Multi-Expert Collaborative Framework for Multimodal Named Entity Recognition |
223 | Yang, Xiukang ; Ge, Jingguo ; Li, Hui ; Li, Liangxiong ; Wu, Bingzhen | Integrating S1&S2 Framework for Enhanced Semantic Match in Person Re-identification |
232 | Vu, Thi Ngoc Anh ; Shoji, Yoshiyuki ; Oe, Yuma ; PHAM, Huu Long ; Ohshima, Hiroaki | Image-Generation AI Model Retrieval by Contrastive Learning-based Style Distance Calculation |
236 | Yaling, Hao; Wei, Wu | MineTinyNet-YOLO: An Efficient Small Object Detection Method for Complex Underground Coal Mine Scenarios |
244 | Si, Jiahua ; Wang, Youze ; Hu, Wenbo ; Liu, Qiang ; Hong, Richang | Making strides Security in Multimodal Fake News Detection Models: A Comprehensive Analysis of Adversarial Attacks |
266 | Sharma, Nikhil ; Sun, Changchang ; Zhao, Zhenghao ; Ngu, Anne Hee Hiong ; Latapie, Hugo ; Yan, Yan | SSDL:Sensor-to-Skeleton Diffusion Model with Lipschitz Regularization for Human Activity Recognition |
268 | Lincker, Elise ; Guinaudeau, Camille ; Satoh, Shin’ichi | AD2AT: Audio Description to Alternative Text, a Dataset of Alternative Text from Movies |
273 | Sugahara, Aoto ; Kishimoto, Soma ; Adachi, Yuji ; Tai, Kiyoto ; Takashima, Ryoichi ; Takiguchi, Tetsuya | Operatic Singing Voice Synthesis From Inexperienced Voice Considering Tempo and Vowel Change |
274 | Chen, Jiaxing ; Liu, Yuxuan ; Li, Dehu ; An, Xiang ; Deng, Weimo ; Feng, Ziyong ; Zhao, Yongle ; Xie, Yin | Grounding Deliberate Reasoning in Multimodal Large Language Models |
305 | Zhu, Deli ; Xu, Zhao ; Yang*, Yunong | MambaTalk: Speech-driven 3D Facial Animation with Mamba |
308 | Chen, Zhuowei; Huang, Mengqi; Chen, Nan; Mao, Zhendong | Skin-Adapter: Fine-Grained Skin-Color Preservation for Text-to-Image Generation |
310 | YUAN, HONGHUI; YANAI, KEIJI | KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser |
331 | Minghui, Hou ; Gang, Wang ; Zhiyang, Wang ; Tongzhou, Zhang ; Baorui, Ma | BLCC: A Benchmark for Multi-LiDAR and Multi-Camera Calibration |
346 | Su, Yulan; Zhang, Sisi; Wang, Yan; Wang, Xingbin; Zhao, Lutan; Dan, Meng; Hou, Rui | RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers |
359 | Zhao, Hui; Qi, Na; Zhu, Qing; Lin, Xiumin | SSCDUF: Spatial-Spectral Correlation Transformer Based on Deep Unfolding Framework for Hyperspectral Image Reconstruction |
374 | Chang, Ding-Chi; Li, Shiou-Chi; Huang, Jen-Wei | SPLGAN-TTS:Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models |
377 | Du, Wenxu; Wumaier, Aishan; Shi, Yahui; Yi, Nian; Liu, Dehua | A Multi-Aspect Multi-Granularity Pronunciation Assessment Method Based on Branchformer Encoder and Hierarchical Aggregation |
379 | Li, Yizhou; Liu, Zihua; Monno, Yusuke; Okutomi, Masatoshi | TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration |
385 | Lu, Lingyi; Xu, Xin; Wang, Xiao | Style Separation and Content Recovery for Generalizable Sketch Re-identification and A New Benchmark |
393 | Zhang, Zhengzhuo; Zhuang, Liansheng | Progressive Neural Architecture Generation with Weaker Predictors |
395 | Goto, Yuta ; Yamazaki, Satoshi ; Shibata, Takashi ; Liu, Jianquan | Open-vocabulary Scene Graph Generation via Synonym-based Predicate Descriptor |
415 | Xu, Xiaoman; Li, Xiangrun; Wang, Taihang; Jiang, Ye | AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection |
430 | Li, Feng; Luo, Jiusong; Xia, Wanjun | WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition |
451 | Zhang, Zhihui ; Pang, Jinhui ; Li, Jianan ; Hao, Xiaoshuai | ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing |
462 | Huang, Zhongzhan; Liang, Mingfu; Liang, Senwei; Zhong, Shanshan | Flat Local Minima for Continual learning on Semantic Segmentation |
214 | Wei, Wei; Zhang, Bingkun; Wang, Yibing | TACST: Time-Aware Transformer for Robust Speech Emotion Recognition |
215 | Wei, Wei; Zhang, Bingkun; Wang, Yibing | TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion |
288 | Li, Su ; Wang, Liang ; Wang, Jianye ; Zhang, Ziheng ; Zhang, Junjun ; Zhang, Lei | Enhanced Anomaly Detection in 3D Motion through Language-Inspired Occlusion-Aware Modeling |
411 | Ghasemi, Narges ; Kim, Seon Ho ; Alfarrarjeh, Abdullah ; Shahabi, Cyrus | Counting Unique Objects in Geo-Tagged Street Images: A Case Study Of Homeless Encampments in Los Angeles |
Poster Presentations
paperID | authors | title |
---|---|---|
111 | Yin, Min ; Xie, Liang ; Liang, HaoRan ; Zhao, Xing ; Chen, Ben ; Liang, RongHua | SCLSTE: Semi-Supervised Contrastive Learning-Guided Scene Text Editing |
120 | Shang, Yuzhang ; Liu, Gaowen ; Kompella, Ramana ; Yan, Yan | Quantized-ViT Efficient Training via Fisher Matrix Regularization |
121 | Kong, Yongqiang; Wang, Yunhong; Li, Annan | Saliency based data augmentation for few-shot video action recognition |
128 | Ye, Yuyao ; Yang, Jiayu ; Zhao, Yang ; Gao, Mengping ; Cao, Hongbin ; Wang, Ronggang | Hybrid Scalable Video Coding with Neural Compression and Enhancement for Streaming Media |
130 | Cai, Pengzhou ; Jiang, Lu ; Li, Yanxin ; Liu, Xiaojuan ; Lan, Libin | Pubic Symphysis-Fetal Head Segmentation Network Using BiFormer Attention Mechanism and Multipath Dilated Convolution |
131 | Li, Guofeng ; Li, Hanxi ; Li, Bo ; Wu, Lin ; Cheng, Yan | DART: Depth-Enhanced Accurate and Real-Time Background Matting |
141 | Cai, Zeyu ; Chen, Xunhao ; Zhang, Can ; Chen, yuchong ; Yang, Jiming ; Shi, Wubin ; Jin, Chengqian ; Da, Feipeng | MLP-AMDC: A MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging |
144 | Tsukuda, Kosetsu; Takahashi, Takumi; Ishida, Keisuke; Hamasaki, Masahiro; Goto, Masataka | Kiite World: Socializing Map-Based Music Exploration Through Playlist Sharing and Synchronized Listening |
146 | Zhu, Qinfeng ; Weng, Ningxin ; Fan, Lei ; Cai, Yuanzhi | Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste |
158 | Song, Tao ; Zhang, Wenwen | Frequency-aware Convolution for Sound Event Detection |
163 | Liu, Dongyu; Zhu, Yuan; liu, rui; Xing, Zhecong; Geng, Weiyang; Wang, Yanqiang | MSD-YOLO : An efficient algorithm for small target detection |
166 | Li, Yongqian ; Luo, Yong ; Zhou, Xin | Robust Active Speaker Detection in Challenging Environments Using GNN-Fused Multi-Modal Cues and Body Language |
172 | Tian, Xiang; Zhang, Yuan; Mu, Chang; Zhang, Ziyang | Intra-Class Compact Facial Expression Recognition Based on Amplitude Phase Separation |
173 | Ding, Guohui; Li, Zhonghua; Ren, Yongqiang | Modality-Specific Hashing: Transform Cross-Modal Retrieval into Single-Modal Retrieval |
176 | Yu, Jizhe; Liu, Yu; Wu, Xiaoshuai; Xu, Kaiping; Li, Jiangquan | PA2Net: Pyramid Attention Aggregation Network for Saliency detection |
178 | Xu, Feifei; Jia, Fumiaoyue; Zhou, Wang | Multimodal Prompt Learning for Audio Visual Scene-aware Dialog |
181 | Chen, Liang-Chia; Chu, Wei-Ta | HCV: Lightweight Hybrid CNN-Vision Transformer for Visual Object Tracking |
182 | Wang, Bin ; Chen, Zekun ; Zhang, Lei ; Liang, Shili ; Guo, Sijia ; Kang, Xinyu ; Li, Huajing | MSA-Former: Multi-Scale Adaptive Transformer for Image Snow Removal |
184 | Lu, Congjian ; Zhou, Shuwang ; Shan, Ke ; Zhang, Hongkuan ; Liu, Zhaoyang | SES-Net: Multi-dimensional Spot-Edge-Surface Network for Nuclei Segmentation |
188 | Zhang, Jingyao ; Hao, Shijie ; Sun, Fuming Sun ; Rao, Yuan | LIESA: Low-light Image Enhancement with Semantic Awareness |
189 | Wang, Yufei ; Yao, Junfeng ; Wang, Zefeng | PianoPal: A Robotic Multimedia System for Interactive Piano Instruction Based on Q-learning and Real-time Feedback |
192 | Vadicamo, Lucia ; Scotti, Francesca ; Dearle, Alan ; Connor, Richard | Comparative Analysis of Relevance Feedback Techniques for Image Retrieval |
195 | Sun, Yongqing ; Liu, Hong ; Chang, Qiong ; Han, Xianhua | Deep Dual Internal Learning for Hyperspectral Image Super-Resolution |
198 | Wu, Weijie ; Li, Jun ; Wu, Zhijian ; Xu, Jianhua | Zero-shot sketch-based image retrieval with hybrid information fusion and sample relationship modeling |
206 | Juliussen, Bjørn Aslak | The Right to an Explanation under the GDPR and the AI Act |
221 | Perez, Miguel ; Kirchhoff, Holger ; Grosche, Peter ; Serra, Xavier | Improving singing voice transcription generalization with AI generated accompaniments |
228 | Sunada, Tatsumi; Shiohara, Kaede; Xiao, Ling; Yamasaki, Toshihiko | LITA: LMM-guided Image-Text Alignment for Art Assessment |
229 | Yadav, Saumya ; Lincker, Élise ; Huron, Caroline ; Martin, Stéphanie ; Guinaudeau, Camille ; Satoh, Shin’ichi ; Shukla, Jainendra | Towards Inclusive Education: Multimodal Classification of Textbook Images for Accessibility |
237 | Li, Jingkun; Qi, Na; Zhu, Qing | Hyper-NeuS:Hypernetworks for Neural SDF Implicit Surface Reconstruction by Volume Rendering |
241 | Cao, Qian; Song, Ruihua; Chen, Xu | Understanding the Roles of Visual Modality in Multimodal Dialogue: An Empirical Study |
242 | Yu, Le ; Zhang, Xianchao ; Qian, Shuxia ; Sun, Hong | DistillSleep: Leverage Self-Distillation to Improve Performance After Representation Learning for Sleep Staging |
246 | Yamamoto, Shuhei ; Kando, Noriko | Temporal Closeness for Enhanced Cross-Modal Retrieval of Sensor and Image Data |
247 | Losfeld, Armand; Seznec, Nicolas; Van Bogaert, Laurie; Lafruit, Gauthier; Teratani, Mehrdad | An Analytical Method for Rendering Plenoptic Cameras 2.0 on 3D Multi-Layer Displays |
251 | Zhu, Yingqian; Gao, Guanyu | QRALadder: QoE and Resource Consumption-Aware Encoding Ladder Optimization for Live Video Streaming |
253 | Fang, Zhiyi ; Qian, Yi ; Dai, Xiyue | Structural Information-guided Fine-grained Texture Image Inpainting |
256 | Jiang, Ling; Liu, Zhuocheng; Li, Kaige; Wu, Wei | Boosting Human Pose Estimation via Heatmap Refinement |
265 | Imajuku, Yuki; Yamakata, Yoko; Aizawa, Kiyoharu | FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation |
272 | Han, Sijia; Zhang, Zhibin | GFA-UDIS: Global-to-Flow Alignment for Unsupervised Deep Image Stitching |
275 | Wu, Fei ; Zhou, Ruixuan ; Ji, Yimu ; Jing, Xiao-Yuan | Joint Decision Network with Modality-Specific and Dual Interactive Features for Fake News Detection |
276 | Liu, Jiajie; Zhang, Zhibin | Lightweight Dual Grouped Large-Kernel Convolutions for Salient Object Detection Network |
277 | Yang, Enhui; Zhang, Zhibin | MS-SAM:Multi-Scale SAM based on Dynamic Weighted Agent Attention |
281 | ZHU, YIJIE; Li, MingYong | Multi-Modal Information Multi-Angle Mining For Multimedia Recommendation |
283 | Wang, Qing; Ngo, Chong Wah; Lim, Ee-Peng; Sun, Qianru | LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets |
292 | Yip, Tin Yui; Chau, Chuck-jee | Music2MIDI: Pop Music to MIDI Piano Cover Generation |
293 | Chen, Xiangyu; Satoh, Shinichi | Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy Detection |
295 | Xiang, Yanru; Li, Yi | One-Shot Generative Domain Adaptation by Constructing Self-Amplifying Datasets |
296 | Zheng, Shuijing; Yu, Suxi; Wang, Yi; Wen, Jing | GWUNet: A UNet with Gated Attention and Improved Wavelet Transform for Thyroid Nodules Segmentation |
297 | Chen, Junjian ; Yang, Xuan | Uncertainty-guided Joint Semi-supervised Segmentation and Registration of Cardiac Images |
306 | Li, Yu ; Xie, Zhenping | Visual Anomaly Detection on Topological Connectivity under Improved YOLOv8 |
307 | Wang, Haodian | Frequency-Based Unsupervised Low-Light Image Enhancement Framework |
309 | Suo, Zihao; Pan, Shanliang | Target-Oriented Dynamic Denosing Curriculum Learning for Multimodel Stance Detection |
312 | Ai, Hanxu ; Tao, Xiaomei ; Li, Xingbing ; Gan, Yanling | Modeling High-order Relationships between Human and Video for Emotion Recognition |
315 | Falcon, Alex ; Abdari, Ali ; Serra, Giuseppe | HierArtEx: Hierarchical Representations and Art Experts Supporting the Retrieval of Museums in the Metaverse |
316 | Jiang, Wanchang; Jiang, Yuxin | Noise-robust Separating Multi-source Aliased Vibration Signal Based on Transformer Demucs |
317 | Han, Miaolin; Li, Huibin | DocMamba: Robust Document Image Dewarping via Selective State Space Sequence Modeling |
321 | Xu, Yixiao ; Li, Yubo ; Xu, Wanzhao ; Gu, Yicheng ; Wang, Yun ; Ma, Jiangyuan ; Qi, Zhengwei | gFlow: Distributed Real-Time Reverse Remote Rendering System Model |
326 | shih, Mu-Jan ; Hsu, Yi-Yu | Real-Time Action Detection in Volleyball Matches Using DETR Architecture |
332 | Huang, Hujiang; Xie, Yu; Gao, Jun; Fan, Chuanliu; Cao, Ziqiang | Select and Order: Enhancing Few-Shot Image Classification through In-Context Learning |
336 | Zhang, Yongliang; Liu, Jing | SMG-Diff: Adversarial Attack Method Based on Semantic Mask-Guided Diffusion |
337 | Terada, Takamasa; Toyoura, Masahiro | Wavelet Integrated Convolutional Neural Network for ECG Signal Denoising |
342 | Wang, Jingdong; Ding, XU; Meng, Fanqi | MC-YOLO: Multi-scale Transmission Line Defect Target Recognition Network |
344 | Sun, Ying; Wei, Meiyi; Chen, Gang | Dual-Task Feedback Learning for Tongue Detection via Super-Resolution Integration |
350 | Ma, Yuefeng ; Cheng, Zhiqi ; Liu, Deheng ; Tang, Shiying | A Novel Human Abnormal Posture Detection Method Based on Spatial-Topological Feature Fusion of Skeleton |
354 | Phueaksri, Itthisak ; Kastner, Marc A. ; Kawanishi, Yasutomo ; Komamizu, Takahiro ; Ide, Ichiro | Towards Visual Storytelling by Understanding Narrative Context through Scene-Graphs |
356 | Hürst, Wolfgang; Zeches, Leo | Rotation Methods for 360-degree Videos in Virtual Reality - A Comparative Study |
360 | Wang, JinYang; Wu, Wei | Camouflaged Object Detection Based on Localization Guidance and Multi-Scale Refinement |
362 | Su, Yulan; Zhang, Sisi; Lin, Zechao; Wang, Xingbin; Zhao, Lutan; Meng, Dan; Hou, Rui | Poseidon: A NAS-Based Ensemble Defense Method against Multiple Perturbations |
363 | Guo, Junhao ; Fu, Chenhan ; Wang, Guoming ; Lu, Rongxing ; Chen, Dong ; Tang, Siliang | MM-CARP: Multimodal Model with Cross-modal retrieval-Augmented and visual Region Perception |
365 | Shao, Xuan ; Huang, Leming ; Liu, Xinghua | Revisit Data Association in Semantic SLAM Systems for Autonomous Parking |
368 | KWON, ILHWAN ; Li, Jun ; Shah, Rajiv Ratn ; Prasad, Mukesh | Lightweight Motion-Aware Video Super-Resolution for Compressed Videos |
373 | Papadopoulos, Sotirios ; Ioannidis, Konstantinos ; Vrochidis, Stefanos ; Kompatsiaris, Ioannis ; Patras, Ioannis | Vision-Language Pretraining for Variable-shot Image Classification |
383 | Chen, Guanli ; Huang, Guoheng ; Yuan, Xiaochen ; Chen, Xuhang ; Zhong, Guo ; Pun, Chi-Man | Cross-View Geo-Localization via Learning Correspondence Semantic Similarity Knowledge |
386 | Yang, Dajiang; Wu, Wei; Lee, Yuxing | SCANet: Semantic Coherence Attention Network for Clothing Change Person Re-identification |
387 | Wu, Hao; Yang, Danping; Liu, Peng; Li, Xianxian | Chain of Thought Guided Few-shot Fine-tuning of LLMs for Multimodal Aspect-based Sentiment Classification |
392 | Cheng, Shyi-Chyi ; CHEN, YEN-LIN ; Li, Shih-Yu | MPPQNet: A Moment-Preserving Product Quantization Neural Network for Progressive 3D Point Cloud Transmission |
414 | Hezel, Nico; Barthel, Kai Uwe; Schilling, Bruno; Schall, Konstantin; Jung, Klaus | Dynamic Exploration Graph: A Novel Approach for Efficient Nearest Neighbor Search in Evolving Multimedia Datasets |
417 | Springer, Joshua David; Guðmundsson, Gylfi Þór; Kyas, Marcel | Toward A Full Pipeline Approach to Autonomous Drone Landing Site Identification: From Terrain Survey to Embedded Classifier |
420 | shi, shuai; Qi, Na; Li, Yezi; Zhu, Qing | Self-Supervised Reference-based Image Super-Resolution with Conditional Diffusion Model |
429 | Hürst, Wolfgang; Visser, Yannick | Innovative Lifelog Visualization and Exploration in Virtual Reality - A Comparative Study |
435 | Bonatto, Daniele ; Fernandes Pinto Fachada, Sarah ; Sancho, Jaime ; Juarez, Eduardo ; Lafruit, Gauthier ; Teratani, Mehrdad | Synchronization and Calibration of Video Sequences acquired using Multiple Plenoptic 2.0 Cameras |
436 | Lim, Xin ; Wong, Lai-Kuan ; Loh, Yuen Peng ; Gu, Ke ; Lin, Weisi | Mix-YOLONet: Deep Image Dehazing for Improving Object Detection |
438 | Lv, Jinyan; Xiao, Guoqiang | BiCA-YOLO: Bidirectional Feature Enhancement and Cross Coordinate Attention for Small Object Detection |
444 | Chen, Zhaoxin; Ma, Bo | A Dual-Branch Model for Color Constancy |
445 | Mu, Wenchuan; Lim, Kwan Hui | Data-free Functional Projection of Large Language Models onto Social Media Tagging Domain |
447 | Yao, Li; Huang, Qianni; Wan, Yan | TPS-YOLO: The Efficient Tiny Person Detection Network Based on Improved YOLOv8 and Model Pruning |
455 | Yan, Hao; Bai, Jing | MDT-Net: a mask decoder tuning strategy for CLIP-based zero-shot 3D Classification |
456 | wang, tiebiao; li, xiaoyang; cui, zhenchao | AMFT-YOLO: A Adaptive Multi-Scale YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes |
458 | Wu, Cheng-Yuan; Sun, Yuan-Chun; Lee, Cheng-Tse; Hsu, Cheng-Hsin | Optimally Planning Drone Trajectory to Capture a 3D Gaussian Splatting Object |
460 | Yi, Zepu; Lu, Songfeng; Tang, Xueming; Zhu, Jianxin; Wu, Junjun | MICAN: Multi-modal Inconsistency-based Cooperation Attention Network for fake news detection |
193 | Huang, Jia-Hong; Zhu, Hongyi; Shen, Yixian; Rudinac, Stevan; Kanoulas, Evangelos | Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models |
230 | Matsuhira, Chihaya ; Kastner, Marc A. ; Komamizu, Takahiro ; Hirayama, Takatsugu ; Ide, Ichiro | Quantifying Image-Adjective Associations by Leveraging Large-Scale Pretrained Models |
364 | C. Quan, Khanh-An ; Guinaudeau, Camille ; Satoh, Shin’ichi | Evaluating VQA Models' Consistency in the Scientific Domain |
137 | Fukuzawa, Takumi ; Hara, Kensho ; Kataoka, Hirokatsu ; Tamaki, Toru | Can masking background and object reduce static bias for zero-shot action recognition? |
355 | Tanabe, Hikaru; Yanai, Keiji | CalorieVoL: Integrating Volumetric Context into Multimodal Large Language Models for Image-based Calorie Estimation |
416 | Lim, Jia Yap ; See, John ; Dondrup, Christian | Multimodal Engagement Prediction in Human-Robot Interaction using Transformer Neural Networks |
431 | Yoshihara, Daichi ; Yuguchi, Akishige ; Kawano, Seiya ; Iio, Takamasa ; Yoshino, Koichiro | What Should Autonomous Robots Verbalize and What Should They Not? |
Demonstrations
paperID | authors | title |
---|---|---|
466 | Jheng, Duen-Chian ; Harchan, Bill Louis ; Kostka de Sztemberg, Berenika Nawoja ; Hsu, Jen-Hao ; Hu, Min-Chun | Badminton Footwork Practice via an Immersive Virtual Reality System |
468 | Wattasseril, Jobin Idiculla; Döllner, Jürgen | SelectSum: Topic-Based Selective Summarization of Speech-Based Videos |
469 | Hamanaka, Masatoshi | Real-time Visualizer for Turntablist Performance |
470 | Gan, Wenbin; Dao, Minh-Son; Zettsu, Koji | DriveCoach: Smart Driving Assistance with Multimodal Risk Prediction and Risk Adaptive Behavior Recommendation |
472 | Fernandez Roblero, Jaime Boanerjes ; Ali, Muhammad Intizar | System Demo of Modeling Smart University Campus Virtual Environments |
473 | Mohamed Serouis, Ibrahim; Sèdes, Florence | AMDA: Advancing Multimedia Data Annotation for human-centric situations |
475 | HUNG-YAO, PENG; ZI-HENG, ZHONG; CHENG-CHIH, TSAI; CHING-YEH, CHIANG; TSE-YU, PAN | FencBuddy: Action-aware Depth Perception Training for Fencing Attacks |
477 | Izumi, Kota; Yanai, Keiji | WaveFontStyler: Font Style Transfer Based on Sound |
479 | Korb, Martin; Bailer, Werner | Training a Segmentation-based Visual Anonymization Service for Street Scenes |
480 | Kawanishi, Yasutomo; Nakamura, Yutaka; Shintani, Taiken; Ishi, Carlos T.; Kawano, Seiya; Yoshino, Koichiro; Minato, Takashi; Minoh, Michihiko | RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations |
481 | Chiang, Yung-Chu ; Tang, Zi-Xian ; Luo, Yi-Ching ; Chang, Jason S. | CleverFox: Integrating Visual Mnemonics with AI for Enhanced Language Learning |
482 | Iino, Nami ; Iino, Akinaru | Fingering Prediction for Classical Guitar: Dataset Creation and Model Development |
483 | Kitahara, Tetsuro ; Tsutsumi, Takuya ; Nagoshi, Takaaki ; Suzuki, Taizan | An Implementation of Networked JamSketch |
485 | Garcia Contreras, Angel Fernando ; Chang, Wen-Yu ; Kawano, Seiya ; Chen, Yun-Nung ; Yoshino, Koichiro | Using Language Models to Generate and Forget the Narrative Memories of an Assistive Robot |
486 | Borgli, Hanna ; Stensland, Håkon Kvale ; Halvorsen, Pål | Better Image Segmentation with Classification: Guiding Zero-Shot Models Using Class Activation Maps |
487 | Li, Bohan ; Li, Xingyi ; Liang, Yangwen ; Wang, Shuangquan ; Song, Kee-Bong | Leveraging Latent Diffusion in 3D Gaussian Splatting for Novel View Synthesis |
488 | Limberg, Christian ; Zhang, Zhe ; Kastner, Marc A. | Transformer-Based Audio Generation Conditioned by 2D Latent Maps: A Demonstration |
489 | YUAN, HONGHUI; YANAI, KEIJI | KuzushijiFontDiff: Diffusion Model for Japanese Kuzushiji Font Generation |
490 | YUAN, HONGHUI; YANAI, KEIJI | SceneTextStyler: Editing Text with Style Transformation |
492 | Lynch, Kelley ; Rim, Kyeongmin ; King, Owen ; Pustejovsky, James | Multimodal Interoperability with the CLAMS Platform |
493 | Kontostathis, Ioannis; Apostolidis, Evlampios; Apostolidis, Konstantinos; Mezaris, Vasileios | Enhancing User Control in AI-Based Video Summarization for Social Media |
494 | Khan, Omar Shahbaz ; Duane, Aaron ; Hasnan, Hariz ; Blavec, Noé Le ; Ouvrard, Pierre ; Verdon, Johan ; d’Orazio, Laurent ; Thierry, Constance ; Jónsson, Björn Þór | Multi-Dimensional Exploration of Media Collection Metadata |
496 | Huang, Wei-Lun ; Hidayati, Shintami Chusnul ; Pan, Tse-Yu | Movie Retrieval Systems Using Genre-guided Multimodal Learning Techniques |
497 | Kongmeesub, Onanong; Gurrin, Cathal; Nie, Dongyun | A User Identification and Reading Style Detection System Based on Eye Movement Patterns During Reading |
484 | Le, Duy-Dong ; Huynh, Duy-Thanh ; Bao, Pham The | Federated Learning with Multimodal-Sensing and Knowledge Distillation: An application on real-world benchmark dataset |
499 | Vu, Dang ; Dang, Tien ; Nguyen, Quoc-Trung ; Pham, Tan | Efficient Deployment of Multimodal AI Models: Leveraging Pruning, Quantization and Multi-Objective Optimization for Edge Computing |
VBS: Video Browser Showdown
paperID | authors | title |
---|---|---|
406 | Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Kongmeesub, Onanong; Tran, Minh-Triet; Nie, Dongyun; Healy, Graham; Gurrin, Cathal | VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown |
501 | Tran, Quang-Linh; Nguyen, Binh; Jones, Gareth J. F.; Gurrin, Cathal | VideoEase at VBS2025: An Interactive Video Retrieval System |
502 | Rossetto, Luca; Gasser, Ralph | Feature-driven Video Segmentation and Advanced Querying with vitrivr-engine |
503 | Nguyen, Tai; Vo, Anh Ngoc Minh; Pham, Dat Duc; Tran, Vinh Quang; Duong, Nhu Thi Quynh; Le, Tien Anh; Le, Tan Duy; Nguyen, Binh T. | HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025 |
504 | CHENG, Yu Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; NGO, Chong Wah | Interactive Video Search with Multi-modal LLM Video Captioning |
505 | Le, Huy M.; Nguyen Tien, Dat; Le Duy, Khang; Nguyen Dang Quang, Tuan; Nguyen Khanh, Toan; Nguyen, Binh T. | FUSIONISTA: Fusion of 3-D Information of Video in Retrieval System |
506 | C. Quan, Khanh-An; Ngoc Nguyen, Qui; Tran, Minh-Triet | ViFi: A Video Finding System at Video Browser Showdown 2025 |
507 | Vuong, Gia-Huy; Ho, Van-Son; Nguyen-Dang, Tien-Thanh; Thai, Xuan-Dang; Ho-Le, Minh-Quan; Le, Tu-Khiem; Pham, Minh-Khoi; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet | ViewsInsight2.0: Enhancing Video Retrieval for VBS 2025 with an Automatic Query Generator Powered by Large Language Models |
508 | Pantelidis, Nick; Georgalis, Dimitris; Pegia, Maria; Galanopoulos, Damianos; Apostolidis, Konstantinos; Stavrothanasopoulos, Klearchos; Moumtzidou, Anastasia; Gkountakos, Konstantinos; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis | VERGE in VBS 2025 |
509 | Sharma, Ujjwal; Khan, Omar Shahbaz; Rudinac, Stevan; Jónsson, Björn Þór | Exquisitor at the Video Browser Showdown 2025: Unifying Conversational Search and User Relevance Feedback |
510 | Spiess, Florian; Rossetto, Luca; Schuldt, Heiko | Simplified Video Retrieval in Virtual Reality with vitrivr-VR |
511 | Leopold, Mario; Schöffmann, Klaus | diveXplore at the Video Browser Showdown 2025 |
512 | Tran Gia, Bao; Bui Cong Khanh, Tuong; Le Thi Thanh, Tam; Tran Doan, Thuyen; Le Tran Trong, Khiem; Do, Tien; Mai, Tien-Dung; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi | NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search |
513 | Stroh, Michael; Kloda, Vojtěch; Verner, Benjamin; Vopálková, Zuzana; Buchmüller, Raphael; Jäckl, Bastian; Lokoč, Jakub; Hajko, Jakob | PraK Tool V3: Enhancing Video Item Search Using Localized Text and Texture Queries |
514 | Arnold, Rahel; Kempf, Rahel; Waltenspül, Raphael; Schuldt, Heiko | MediaMix: Multimedia Retrieval in Mixed Reality |
515 | Ho-Le, Minh-Quan; Ho, Duy-Khang; Do-Huu, Huy-Hoang; Le-Hinh, Nhut-Thanh; Vo-Hoang, Hoa-Vien; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet | SnapSeek 2.0 at Video Browser Showdown 2025 |
517 | Luu, Duc-Tuan; C. Quan, Khanh-An; Nguyen, Duy-Ngoc; Bui-Le, Khanh-Linh; Doan, Nhat-Sang; Le-Ngo, Minh-Duc; Nguyen, Vinh-Tiep; Tran, Minh-Triet | IMSearch 2.0: Toward User-centric and Efficient Interactive Multimedia Retrieval System |
Social Events
Welcome Reception (Day 1: 8 January)
We warmly invite attendees to the reception.
- Time: 6:00 PM ~ 8:00 PM (tentative)
- Location: Reception Hall 1
- Refreshments including a variety of foods and drinks will be provided.
Banquet (Day 2: 9 January)
- Time
- Start: 6:00 PM (tentative)
- Location: KOTOWA Nara-Koen Premium View
Address
〒630-8374 奈良県奈良市今御門町15
15 Imamikadocho, Nara, 630-8374, Japan
- Foods and drinks will be provided.
- Highlight: Kiki-sake(利き酒) will be held as a part of banquet.
- “Kikisake” is the Japanese tradition of sake tasting. It involves sampling and evaluating different types of sake to appreciate their flavors, aromas, and characteristics, much like wine tasting in Western cultures. The word ‘kiki’ refers to discerning or distinguishing, and ‘sake’ is Japan’s traditional rice wine. It’s often done in a formal setting or as an enjoyable activity to explore the rich variety of sake styles.
- Highlight: Kiki-sake(利き酒) will be held as a part of banquet.
Accepted Special Sessions
- Spatial Intelligence in Multimedia Analytics (SpIMA)
- Multimedia Research in Robotics
- MLLMA: Multimodal Large Language Models and Applications
- ExpertSUM: Expert-Level Text Summarization from Fine-Grained Multimedia Analytics
- Simulating Edge Computing and Multimodal AI: A Benchmark for Real-World Applications