Table of Contents
Awards
Proceedings
The proceedings are available in Springer’s site as follows:
- Part I: https://link.springer.com/book/10.1007/978-981-96-2054-8
- Part II: https://link.springer.com/book/10.1007/978-981-96-2061-6
- Part III: https://link.springer.com/book/10.1007/978-981-96-2064-7
- Part IV: https://link.springer.com/book/10.1007/978-981-96-2071-5
- Part V: https://link.springer.com/book/10.1007/978-981-96-2074-6
Keynote Talks
Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning
Dr. Nancy F. Chen
Manga109 and MangaUB: How Far Can Large Multimodal Models (LMMs) Go in Understanding Manga?
Prof. Kiyoharu Aizawa
Multi-modal foundation models in the automotive industry
Dr. Andrei Bursuc
Oral Sessions
Day 1: 8 January
Paper ID | Paper Title | Authors |
---|---|---|
196 | RoLD: Robot Latent Diffusion for Multi-task Policy Modeling | Tan, Wenhui; Liu, Bei; Zhang, Junbo; Song, Ruihua; Fu, Jianlong |
379 | TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration | Li, Yizhou; Liu, Zihua; Monno, Yusuke; Okutomi, Masatoshi |
451 | ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing | Zhang, Zhihui; Pang, Jinhui; Li, Jianan; Hao, Xiaoshuai |
462 | Flat Local Minima for Continual learning on Semantic Segmentation | Huang, Zhongzhan; Liang, Mingfu; Liang, Senwei; Zhong, Shanshan |
Paper ID | Paper Title | Authors |
---|---|---|
268 | AD2AT: Audio Description to Alternative Text, a Dataset of Alternative Text from Movies | Lincker, Elise; Guinaudeau, Camille; Satoh, Shin’ichi |
310 | KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser | YUAN, HONGHUI; YANAI, KEIJI |
167 | Saliency Guided Optimization Of Diffusion Latents | Wang, Xiwen; Zhou, Jizhe; Li, Mao; Zhu, Xuekang; Li, Cheng |
308 | Skin-Adapter: Fine-Grained Skin-Color Preservation for Text-to-Image Generation | Chen, Zhuowei; Huang, Mengqi; Chen, Nan; Mao, Zhendong |
Paper ID | Paper Title | Authors |
---|---|---|
273 | Operatic Singing Voice Synthesis From Inexperienced Voice Considering Tempo and Vowel Change | Sugahara, Aoto; Kishimoto, Soma; Adachi, Yuji; Tai, Kiyoto; Takashima, Ryoichi; Takiguchi, Tetsuya |
129 | Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation | Lv, Yishan; Luo, Jing; Ju, Boyuan; Yang, Xinyu |
430 | WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition | Li, Feng; Luo, Jiusong; Xia, Wanjun |
374 | SPLGAN-TTS:Learning Semantic and Prosody to Enhance the Text-to-Speech Quality of Lightweight GAN Models | Chang, Ding-Chi; Li, Shiou-Chi; Huang, Jen-Wei |
Day 2: 9 January
Paper ID | Paper Title | Authors |
---|---|---|
236 | MineTinyNet-YOLO: An Efficient Small Object Detection Method for Complex Underground Coal Mine Scenarios | Yaling, Hao; Wei, Wu |
436 | Mix-YOLONet: Deep Image Dehazing for Improving Object Detection | Lim, Xin; Wong, Lai-Kuan; Loh, Yuen Peng; Gu, Ke; Lin, Weisi |
411 | Counting Unique Objects in Geo-Tagged Street Images: A Case Study Of Homeless Encampments in Los Angeles | Ghasemi, Narges; Kim, Seon Ho; Alfarrarjeh, Abdullah; Shahabi, Cyrus |
181 | HCV: Lightweight Hybrid CNN-Vision Transformer for Visual Object Tracking | Chen, Liang-Chia; Chu, Wei-Ta |
Paper ID | Paper Title | Authors |
---|---|---|
174 | Detoxification of Unlabeled Dataset: Reducing Implicit Class Imbalance Using Pseudo-Jacobian of GAN’s Generator | Suyama, Kosei; Nakamura, Kazuaki |
244 | Making strides Security in Multimodal Fake News Detection Models: A Comprehensive Analysis of Adversarial Attacks | Si, Jiahua; Wang, Youze; Hu, Wenbo; Liu, Qiang; Hong, Richang |
415 | AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection | Xu, Xiaoman; Li, Xiangrun; Wang, Taihang; Jiang, Ye |
Paper ID | Paper Title | Authors |
---|---|---|
297 | Uncertainty-guided Joint Semi-supervised Segmentation and Registration of Cardiac Images | Chen, Junjian; Yang, Xuan |
337 | Wavelet Integrated Convolutional Neural Network for ECG Signal Denoising | Terada, Takamasa; Toyoura, Masahiro |
392 | MPPQNet: A Moment-Preserving Product Quantization Neural Network for Progressive 3D Point Cloud Transmission | Cheng, Shyi-Chyi; CHEN, YEN-LIN; Li, Shih-Yu |
Day 3: 10 January
Paper ID | Paper Title | Authors |
---|---|---|
218 | A Multi-Expert Collaborative Framework for Multimodal Named Entity Recognition | Xu, Bo; Jiang, Haiqi; Wei, Shouang; Du, Ming; Song, Hui; Wang, Hongya |
266 | SSDL:Sensor-to-Skeleton Diffusion Model with Lipschitz Regularization for Human Activity Recognition | Sharma, Nikhil; Sun, Changchang; Zhao, Zhenghao; Ngu, Anne Hee Hiong; Latapie, Hugo; Yan, Yan |
395 | Open-vocabulary Scene Graph Generation via Synonym-based Predicate Descriptor | Goto, Yuta; Yamazaki, Satoshi; Shibata, Takashi; Liu, Jianquan |
274 | Grounding Deliberate Reasoning in Multimodal Large Language Models | Chen, Jiaxing; Liu, Yuxuan; Li, Dehu; An, Xiang; Deng, Weimo; Feng, Ziyong; Zhao, Yongle; Xie, Yin |
Paper ID | Paper Title | Authors |
---|---|---|
193 | Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models | Huang, Jia-Hong; Zhu, Hongyi; Shen, Yixian; Rudinac, Stevan; Kanoulas, Evangelos |
288 | Enhanced Anomaly Detection in 3D Motion through Language-Inspired Occlusion-Aware Modeling | Li, Su; Wang, Liang; Wang, Jianye; Zhang, Ziheng; Zhang, Junjun; Zhang, Lei |
364 | Evaluating VQA Models' Consistency in the Scientific Domain | C. Quan, Khanh-An; Guinaudeau, Camille; Satoh, Shin’ichi |
Panel Discussion |
Paper ID | Paper Title | Authors |
---|---|---|
346 | RobSparse: Automatic Search for GPU-Friendly Robust and Sparse Vision Transformers | Su, Yulan; Zhang, Sisi; Wang, Yan; Wang, Xingbin; Zhao, Lutan; Dan, Meng; Hou, Rui |
232 | Image-Generation AI Model Retrieval by Contrastive Learning-based Style Distance Calculation | Vu, Thi Ngoc Anh; Shoji, Yoshiyuki; Oe, Yuma; PHAM, Huu Long; Ohshima, Hiroaki |
414 | Dynamic Exploration Graph: A Novel Approach for Efficient Nearest Neighbor Search in Evolving Multimedia Datasets | Hezel, Nico; Barthel, Kai Uwe; Schilling, Bruno; Schall, Konstantin; Jung, Klaus |
Poster Sessions
To Presenters
- Please set up your poster after 1:00 PM and before your poster session starts.
Day 1: 8 January 14:00 – 15:30
Poster ID | Paper ID | Paper Title | Authors |
---|---|---|---|
PS1-1 | 120 | Quantized-ViT Efficient Training via Fisher Matrix Regularization | Shang, Yuzhang; Liu, Gaowen; Kompella, Ramana; Yan, Yan |
PS1-2 | 121 | Saliency based data augmentation for few-shot video action recognition | Kong, Yongqiang; Wang, Yunhong; Li, Annan |
PS1-3 | 128 | Hybrid Scalable Video Coding with Neural Compression and Enhancement for Streaming Media | Ye, Yuyao; Yang, Jiayu; Zhao, Yang; Gao, Mengping; Cao, Hongbin; Wang, Ronggang |
PS1-4 | 130 | Pubic Symphysis-Fetal Head Segmentation Network Using BiFormer Attention Mechanism and Multipath Dilated Convolution | Cai, Pengzhou; Jiang, Lu; Li, Yanxin; Liu, Xiaojuan; Lan, Libin |
PS1-5 | 131 | DART: Depth-Enhanced Accurate and Real-Time Background Matting | Li, Guofeng; Li, Hanxi; Li, Bo; Wu, Lin; Cheng, Yan |
PS1-6 | 141 | MLP-AMDC: A MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging | Cai, Zeyu; Chen, Xunhao; Zhang, Can; Chen, yuchong; Yang, Jiming; Shi, Wubin; Jin, Chengqian; Da, Feipeng |
PS1-7 | 144 | Kiite World: Socializing Map-Based Music Exploration Through Playlist Sharing and Synchronized Listening | Tsukuda, Kosetsu; Takahashi, Takumi; Ishida, Keisuke; Hamasaki, Masahiro; Goto, Masataka |
PS1-8 | 146 | Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste | Zhu, Qinfeng; Weng, Ningxin; Fan, Lei; Cai, Yuanzhi |
PS1-9 | 158 | Frequency-aware Convolution for Sound Event Detection | Song, Tao; Zhang, Wenwen |
PS1-10 | 163 | MSD-YOLO : An efficient algorithm for small target detection | Liu, Dongyu; Zhu, Yuan; liu, rui; Xing, Zhecong; Geng, Weiyang; Wang, Yanqiang |
PS1-11 | 166 | Robust Active Speaker Detection in Challenging Environments Using GNN-Fused Multi-Modal Cues and Body Language | Li, Yongqian; Luo, Yong; Zhou, Xin |
PS1-12 | 172 | Intra-Class Compact Facial Expression Recognition Based on Amplitude Phase Separation | Tian, Xiang; Zhang, Yuan; Mu, Chang; Zhang, Ziyang |
PS1-13 | 176 | PA2Net: Pyramid Attention Aggregation Network for Saliency detection | Yu, Jizhe; Liu, Yu; Wu, Xiaoshuai; Xu, Kaiping; Li, Jiangquan |
PS1-14 | 188 | LIESA: Low-light Image Enhancement with Semantic Awareness | Zhang, Jingyao; Hao, Shijie; Sun, Fuming Sun; Rao, Yuan |
PS1-15 | 195 | Deep Dual Internal Learning for Hyperspectral Image Super-Resolution | Sun, Yongqing; Liu, Hong; Chang, Qiong; Han, Xianhua |
PS1-16 | 198 | Zero-shot sketch-based image retrieval with hybrid information fusion and sample relationship modeling | Wu, Weijie; Li, Jun; Wu, Zhijian; Xu, Jianhua |
PS1-17 | 206 | The Right to an Explanation under the GDPR and the AI Act | Juliussen, Bjørn Aslak |
PS1-18 | 221 | Improving singing voice transcription generalization with AI generated accompaniments | Perez, Miguel; Kirchhoff, Holger; Grosche, Peter; Serra, Xavier |
PS1-19 | 228 | LITA: LMM-guided Image-Text Alignment for Art Assessment | Sunada, Tatsumi; Shiohara, Kaede; Xiao, Ling; Yamasaki, Toshihiko |
PS1-20 | 229 | Towards Inclusive Education: Multimodal Classification of Textbook Images for Accessibility | Yadav, Saumya; Lincker, Élise; Huron, Caroline; Martin, Stéphanie; Guinaudeau, Camille; Satoh, Shin’ichi; Shukla, Jainendra |
PS1-21 | 296 | GWUNet: A UNet with Gated Attention and Improved Wavelet Transform for Thyroid Nodules Segmentation | Zheng, Shuijing; Yu, Suxi; Wang, Yi; Wen, Jing |
PS1-22 | 111 | SCLSTE: Semi-Supervised Contrastive Learning-Guided Scene Text Editing | Yin, Min; Xie, Liang; Liang, HaoRan; Zhao, Xing; Chen, Ben; Liang, RongHua |
Day 2: 9 January 13:30 – 15:00
Poster ID | Paper ID | Paper Title | Authors |
---|---|---|---|
PS2-1 | 192 | Comparative Analysis of Relevance Feedback Techniques for Image Retrieval | Vadicamo, Lucia; Scotti, Francesca; Dearle, Alan; Connor, Richard |
PS2-2 | 241 | Understanding the Roles of Visual Modality in Multimodal Dialogue: An Empirical Study | Cao, Qian; Song, Ruihua; Chen, Xu |
PS2-3 | 242 | DistillSleep: Leverage Self-Distillation to Improve Performance After Representation Learning for Sleep Staging | Yu, Le; Zhang, Xianchao; Qian, Shuxia; Sun, Hong |
PS2-4 | 246 | Temporal Closeness for Enhanced Cross-Modal Retrieval of Sensor and Image Data | Yamamoto, Shuhei; Kando, Noriko |
PS2-5 | 247 | An Analytical Method for Rendering Plenoptic Cameras 2.0 on 3D Multi-Layer Displays | Losfeld, Armand; Seznec, Nicolas; Van Bogaert, Laurie; Lafruit, Gauthier; Teratani, Mehrdad |
PS2-6 | 251 | QRALadder: QoE and Resource Consumption-Aware Encoding Ladder Optimization for Live Video Streaming | Zhu, Yingqian; Gao, Guanyu |
PS2-7 | 256 | Boosting Human Pose Estimation via Heatmap Refinement | Jiang, Ling; Liu, Zhuocheng; Li, Kaige; Wu, Wei |
PS2-8 | 265 | FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation | Imajuku, Yuki; Yamakata, Yoko; Aizawa, Kiyoharu |
PS2-9 | 283 | LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets | Wang, Qing; Ngo, Chong Wah; Lim, Ee-Peng; Sun, Qianru |
PS2-10 | 292 | Music2MIDI: Pop Music to MIDI Piano Cover Generation | Yip, Tin Yui; Chau, Chuck-jee |
PS2-11 | 293 | Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy Detection | Chen, Xiangyu; Satoh, Shinichi |
PS2-12 | 295 | One-Shot Generative Domain Adaptation by Constructing Self-Amplifying Datasets | Xiang, Yanru; Li, Yi |
PS2-13 | 306 | Visual Anomaly Detection on Topological Connectivity under Improved YOLOv8 | Li, Yu; Xie, Zhenping |
PS2-14 | 315 | HierArtEx: Hierarchical Representations and Art Experts Supporting the Retrieval of Museums in the Metaverse | Falcon, Alex; Abdari, Ali; Serra, Giuseppe |
PS2-15 | 317 | DocMamba: Robust Document Image Dewarping via Selective State Space Sequence Modeling | Han, Miaolin; Li, Huibin |
PS2-16 | 326 | Real-Time Action Detection in Volleyball Matches Using DETR Architecture | shih, Mu-Jan; Hsu, Yi-Yu |
PS2-17 | 332 | Select and Order: Enhancing Few-Shot Image Classification through In-Context Learning | Huang, Hujiang; Xie, Yu; Gao, Jun; Fan, Chuanliu; Cao, Ziqiang |
PS2-18 | 336 | SMG-Diff: Adversarial Attack Method Based on Semantic Mask-Guided Diffusion | Zhang, Yongliang; Liu, Jing |
PS2-19 | 344 | Dual-Task Feedback Learning for Tongue Detection via Super-Resolution Integration | Sun, Ying; Wei, Meiyi; Chen, Gang |
PS2-20 | 354 | Towards Visual Storytelling by Understanding Narrative Context through Scene-Graphs | Phueaksri, Itthisak; Kastner, Marc A.; Kawanishi, Yasutomo; Komamizu, Takahiro; Ide, Ichiro |
PS2-21 | 456 | AMFT-YOLO: A Adaptive Multi-Scale YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes | wang, tiebiao; li, xiaoyang; cui, zhenchao |
PS2-22 | 276 | Lightweight Dual Grouped Large-Kernel Convolutions for Salient Object Detection Network | Liu, Jiajie; Zhang, Zhibin |
PS2-23 | 312 | Modeling High-order Relationships between Human and Video for Emotion Recognition | Ai, Hanxu; Tao, Xiaomei; Li, Xingbing; Gan, Yanling |
DP | 117 | EIA: Edge-aware Imperceptible Adversarial Attacks on 3D Point Clouds | Wang, Zhensu; Peng, Weilong; Wang, Le; Wu, Zhizhe; Zhu, Peican; Tang, Keke |
DP | 127 | MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms | Zhang, Jiahao; Gao, Guangyu; Zhao, Xiao |
DP | 140 | Infrared Small Target Detection with Feature Refinement and Context Enhancement | Li, Xiuhong; Zhu, Xinyue; Li, Boyuan; Li, Songlin; Wang, Luyao; Jia, Zhenhong |
DP | 173 | Modality-Specific Hashing: Transform Cross-Modal Retrieval into Single-Modal Retrieval | Ding, Guohui; Li, Zhonghua; Ren, Yongqiang |
DP | 178 | Multimodal Prompt Learning for Audio Visual Scene-aware Dialog | Xu, Feifei; Jia, Fumiaoyue; Zhou, Wang |
DP | 182 | MSA-Former: Multi-Scale Adaptive Transformer for Image Snow Removal | Wang, Bin; Chen, Zekun; Zhang, Lei; Liang, Shili; Guo, Sijia; Kang, Xinyu; Li, Huajing |
DP | 184 | SES-Net: Multi-dimensional Spot-Edge-Surface Network for Nuclei Segmentation | Lu, Congjian; Zhou, Shuwang; Shan, Ke; Zhang, Hongkuan; Liu, Zhaoyang |
DP | 189 | PianoPal: A Robotic Multimedia System for Interactive Piano Instruction Based on Q-learning and Real-time Feedback | Wang, Yufei; Yao, Junfeng; Wang, Zefeng |
DP | 199 | CLIP Multi-modal Hashing for Multimedia Retrieval | Zhu, Jian; Sheng, Mingkai; Huang, Zhangmin; Chang, Jingfei; Long, Jian; Jiang, Jinling; Liu, Lei; Luo, Cheng |
DP | 223 | Integrating S1&S2 Framework for Enhanced Semantic Match in Person Re-identification | Yang, Xiukang; Ge, Jingguo; Li, Hui; Li, Liangxiong; Wu, Bingzhen |
DP | 237 | Hyper-NeuS:Hypernetworks for Neural SDF Implicit Surface Reconstruction by Volume Rendering | Li, Jingkun; Qi, Na; Zhu, Qing |
DP | 253 | Structural Information-guided Fine-grained Texture Image Inpainting | Fang, Zhiyi; Qian, Yi; Dai, Xiyue |
DP | 272 | GFA-UDIS: Global-to-Flow Alignment for Unsupervised Deep Image Stitching | Han, Sijia; Zhang, Zhibin |
DP | 275 | Joint Decision Network with Modality-Specific and Dual Interactive Features for Fake News Detection | Wu, Fei; Zhou, Ruixuan; Ji, Yimu; Jing, Xiao-Yuan |
DP | 277 | MS-SAM:Multi-Scale SAM based on Dynamic Weighted Agent Attention | Yang, Enhui; Zhang, Zhibin |
DP | 281 | Multi-Modal Information Multi-Angle Mining For Multimedia Recommendation | ZHU, YIJIE; Li, MingYong |
DP | 305 | MambaTalk: Speech-driven 3D Facial Animation with Mamba | Zhu, Deli; Xu, Zhao; Yang*, Yunong |
Day 3: 10 January 13:30 – 15:00
Poster ID | Paper ID | Paper Title | Authors |
---|---|---|---|
PS3-1 | 356 | Rotation Methods for 360-degree Videos in Virtual Reality - A Comparative Study | Hürst, Wolfgang; Zeches, Leo |
PS3-2 | 360 | Camouflaged Object Detection Based on Localization Guidance and Multi-Scale Refinement | Wang, JinYang; Wu, Wei |
PS3-3 | 362 | Poseidon: A NAS-Based Ensemble Defense Method against Multiple Perturbations | Su, Yulan; Zhang, Sisi; Lin, Zechao; Wang, Xingbin; Zhao, Lutan; Meng, Dan; Hou, Rui |
PS3-4 | 363 | MM-CARP: Multimodal Model with Cross-modal retrieval-Augmented and visual Region Perception | Guo, Junhao; Fu, Chenhan; Wang, Guoming; Lu, Rongxing; Chen, Dong; Tang, Siliang |
PS3-5 | 365 | Revisit Data Association in Semantic SLAM Systems for Autonomous Parking | Shao, Xuan; Huang, Leming; Liu, Xinghua |
PS3-6 | 368 | Lightweight Motion-Aware Video Super-Resolution for Compressed Videos | KWON, ILHWAN; Li, Jun; Shah, Rajiv Ratn; Prasad, Mukesh |
PS3-7 | 373 | Vision-Language Pretraining for Variable-shot Image Classification | Papadopoulos, Sotirios; Ioannidis, Konstantinos; Vrochidis, Stefanos; Kompatsiaris, Ioannis; Patras, Ioannis |
PS3-8 | 377 | A Multi-Aspect Multi-Granularity Pronunciation Assessment Method Based on Branchformer Encoder and Hierarchical Aggregation | Du, Wenxu; Wumaier, Aishan; Shi, Yahui; Yi, Nian; Liu, Dehua |
PS3-9 | 386 | SCANet: Semantic Coherence Attention Network for Clothing Change Person Re-identification | Yang, Dajiang; Wu, Wei; Lee, Yuxing |
PS3-10 | 417 | Toward A Full Pipeline Approach to Autonomous Drone Landing Site Identification: From Terrain Survey to Embedded Classifier | Springer, Joshua David; Guðmundsson, Gylfi Þór; Kyas, Marcel |
PS3-11 | 429 | Innovative Lifelog Visualization and Exploration in Virtual Reality - A Comparative Study | Hürst, Wolfgang; Visser, Yannick |
PS3-12 | 435 | Synchronization and Calibration of Video Sequences acquired using Multiple Plenoptic 2.0 Cameras | Bonatto, Daniele; Fernandes Pinto Fachada, Sarah; Sancho, Jaime; Juarez, Eduardo; Lafruit, Gauthier; Teratani, Mehrdad |
PS3-13 | 444 | A Dual-Branch Model for Color Constancy | Chen, Zhaoxin; Ma, Bo |
PS3-14 | 445 | Data-free Functional Projection of Large Language Models onto Social Media Tagging Domain | Mu, Wenchuan; Lim, Kwan Hui |
PS3-15 | 455 | MDT-Net: a mask decoder tuning strategy for CLIP-based zero-shot 3D Classification | Yan, Hao; Bai, Jing |
PS3-16 | 458 | Optimally Planning Drone Trajectory to Capture a 3D Gaussian Splatting Object | Wu, Cheng-Yuan; Sun, Yuan-Chun; Lee, Cheng-Tse; Hsu, Cheng-Hsin |
PS3-17 | 230 | Quantifying Image-Adjective Associations by Leveraging Large-Scale Pretrained Models | Matsuhira, Chihaya; Kastner, Marc A.; Komamizu, Takahiro; Hirayama, Takatsugu; Ide, Ichiro |
PS3-18 | 137 | Can masking background and object reduce static bias for zero-shot action recognition? | Fukuzawa, Takumi; Hara, Kensho; Kataoka, Hirokatsu; Tamaki, Toru |
PS3-19 | 355 | CalorieVoL: Integrating Volumetric Context into Multimodal Large Language Models for Image-based Calorie Estimation | Tanabe, Hikaru; Yanai, Keiji |
PS3-20 | 416 | Multimodal Engagement Prediction in Human-Robot Interaction using Transformer Neural Networks | Lim, Jia Yap; See, John; Dondrup, Christian |
PS3-21 | 431 | What Should Autonomous Robots Verbalize and What Should They Not? | Yoshihara, Daichi; Yuguchi, Akishige; Kawano, Seiya; Iio, Takamasa; Yoshino, Koichiro |
PS3-22 | 438 | BiCA-YOLO: Bidirectional Feature Enhancement and Cross Coordinate Attention for Small Object Detection | Lv, Jinyan; Xiao, Guoqiang |
DP | 307 | Frequency-Based Unsupervised Low-Light Image Enhancement Framework | Wang, Haodian |
DP | 309 | Target-Oriented Dynamic Denosing Curriculum Learning for Multimodel Stance Detection | Suo, Zihao; Pan, Shanliang |
DP | 316 | Noise-robust Separating Multi-source Aliased Vibration Signal Based on Transformer Demucs | Jiang, Wanchang; Jiang, Yuxin |
DP | 321 | gFlow: Distributed Real-Time Reverse Remote Rendering System Model | Xu, Yixiao; Li, Yubo; Xu, Wanzhao; Gu, Yicheng; Wang, Yun; Ma, Jiangyuan; Qi, Zhengwei |
DP | 331 | BLCC: A Benchmark for Multi-LiDAR and Multi-Camera Calibration | Minghui, Hou; Gang, Wang; Zhiyang, Wang; Tongzhou, Zhang; Baorui, Ma |
DP | 342 | MC-YOLO: Multi-scale Transmission Line Defect Target Recognition Network | Wang, Jingdong; Ding, XU; Meng, Fanqi |
DP | 350 | A Novel Human Abnormal Posture Detection Method Based on Spatial-Topological Feature Fusion of Skeleton | Ma, Yuefeng; Cheng, Zhiqi; Liu, Deheng; Tang, Shiying |
DP | 359 | SSCDUF: Spatial-Spectral Correlation Transformer Based on Deep Unfolding Framework for Hyperspectral Image Reconstruction | Zhao, Hui; Qi, Na; Zhu, Qing; Lin, Xiumin |
DP | 383 | Cross-View Geo-Localization via Learning Correspondence Semantic Similarity Knowledge | Chen, Guanli; Huang, Guoheng; Yuan, Xiaochen; Chen, Xuhang; Zhong, Guo; Pun, Chi-Man |
DP | 385 | Style Separation and Content Recovery for Generalizable Sketch Re-identification and A New Benchmark | Lu, Lingyi; Xu, Xin; Wang, Xiao |
DP | 387 | Chain of Thought Guided Few-shot Fine-tuning of LLMs for Multimodal Aspect-based Sentiment Classification | Wu, Hao; Yang, Danping; Liu, Peng; Li, Xianxian |
DP | 393 | Progressive Neural Architecture Generation with Weaker Predictors | Zhang, Zhengzhuo; Zhuang, Liansheng |
DP | 420 | Self-Supervised Reference-based Image Super-Resolution with Conditional Diffusion Model | shi, shuai; Qi, Na; Li, Yezi; Zhu, Qing |
DP | 447 | TPS-YOLO: The Efficient Tiny Person Detection Network Based on Improved YOLOv8 and Model Pruning | Yao, Li; Huang, Qianni; Wan, Yan |
DP | 460 | MICAN: Multi-modal Inconsistency-based Cooperation Attention Network for fake news detection | Yi, Zepu; Lu, Songfeng; Tang, Xueming; Zhu, Jianxin; Wu, Junjun |
DP | 214 | TACST: Time-Aware Transformer for Robust Speech Emotion Recognition | Wei, Wei; Zhang, Bingkun; Wang, Yibing |
DP | 215 | TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion | Wei, Wei; Zhang, Bingkun; Wang, Yibing |
Demonstrations: Day 2 & 3 (9 and 10 January 13:30 – 15:00)
demoID | paperID | title | authors |
---|---|---|---|
D01 | 468 | SelectSum: Topic-Based Selective Summarization of Speech-Based Videos | Wattasseril, Jobin Idiculla; Döllner, Jürgen |
D02 | 469 | Real-time Visualizer for Turntablist Performance | Hamanaka, Masatoshi |
D03 | 494 | Multi-Dimensional Exploration of Media Collection Metadata | Khan, Omar Shahbaz ; Duane, Aaron ; Hasnan, Hariz ; Blavec, Noé Le ; Ouvrard, Pierre ; Verdon, Johan ; d’Orazio, Laurent ; Thierry, Constance ; Jónsson, Björn Þór |
D04 | 470 | DriveCoach: Smart Driving Assistance with Multimodal Risk Prediction and Risk Adaptive Behavior Recommendation | Gan, Wenbin; Dao, Minh-Son; Zettsu, Koji |
D05 | 472 | System Demo of Modeling Smart University Campus Virtual Environments | Fernandez Roblero, Jaime Boanerjes ; Ali, Muhammad Intizar |
D06 | 473 | AMDA: Advancing Multimedia Data Annotation for human-centric situations | Mohamed Serouis, Ibrahim; Sèdes, Florence |
D07 | 475 | FencBuddy: Action-aware Depth Perception Training for Fencing Attacks | HUNG-YAO, PENG; ZI-HENG, ZHONG; CHENG-CHIH, TSAI; CHING-YEH, CHIANG; TSE-YU, PAN |
D08 | 477 | WaveFontStyler: Font Style Transfer Based on Sound | Izumi, Kota; Yanai, Keiji |
D09 | 479 | Training a Segmentation-based Visual Anonymization Service for Street Scenes | Korb, Martin; Bailer, Werner |
D10 | 481 | CleverFox: Integrating Visual Mnemonics with AI for Enhanced Language Learning | Chiang, Yung-Chu ; Tang, Zi-Xian ; Luo, Yi-Ching ; Chang, Jason S. |
D11 | 482 | Fingering Prediction for Classical Guitar: Dataset Creation and Model Development | Iino, Nami ; Iino, Akinaru |
D12 | 483 | An Implementation of Networked JamSketch | Kitahara, Tetsuro ; Tsutsumi, Takuya ; Nagoshi, Takaaki ; Suzuki, Taizan |
D13 | 485 | Using Language Models to Generate and Forget the Narrative Memories of an Assistive Robot | Garcia Contreras, Angel Fernando ; Chang, Wen-Yu ; Kawano, Seiya ; Chen, Yun-Nung ; Yoshino, Koichiro |
D14 | 486 | Better Image Segmentation with Classification: Guiding Zero-Shot Models Using Class Activation Maps | Borgli, Hanna ; Stensland, Håkon Kvale ; Halvorsen, Pål |
D15 | 488 | Transformer-Based Audio Generation Conditioned by 2D Latent Maps: A Demonstration | Limberg, Christian ; Zhang, Zhe ; Kastner, Marc A. |
D16 | 489 | KuzushijiFontDiff: Diffusion Model for Japanese Kuzushiji Font Generation | YUAN, HONGHUI; YANAI, KEIJI |
D17 | 490 | SceneTextStyler: Editing Text with Style Transformation | YUAN, HONGHUI; YANAI, KEIJI |
D18 | 492 | Multimodal Interoperability with the CLAMS Platform | Lynch, Kelley ; Rim, Kyeongmin ; King, Owen ; Pustejovsky, James |
D19 | 493 | Enhancing User Control in AI-Based Video Summarization for Social Media | Kontostathis, Ioannis; Apostolidis, Evlampios; Apostolidis, Konstantinos; Mezaris, Vasileios |
D20 | 496 | Movie Retrieval Systems Using Genre-guided Multimodal Learning Techniques | Huang, Wei-Lun ; Hidayati, Shintami Chusnul ; Pan, Tse-Yu |
D21 | 497 | A User Identification and Reading Style Detection System Based on Eye Movement Patterns During Reading | Kongmeesub, Onanong; Gurrin, Cathal; Nie, Dongyun |
D22 | 484 | Federated Learning with Multimodal-Sensing and Knowledge Distillation: An application on real-world benchmark dataset | Le, Duy-Dong ; Huynh, Duy-Thanh ; Bao, Pham The |
D23 | 499 | Efficient Deployment of Multimodal AI Models: Leveraging Pruning, Quantization and Multi-Objective Optimization for Edge Computing | Vu, Dang ; Dang, Tien ; Nguyen, Quoc-Trung ; Pham, Tan |
D24 | 466 | Badminton Footwork Practice via an Immersive Virtual Reality System | Jheng, Duen-Chian ; Harchan, Bill Louis ; Kostka de Sztemberg, Berenika Nawoja ; Hsu, Jen-Hao ; Hu, Min-Chun |
D25 | 480 | RoboDJ: Live Commentary Robots System Driven by Physical- and Cyber-world Observations | Kawanishi, Yasutomo; Nakamura, Yutaka; Shintani, Taiken; Ishi, Carlos T.; Kawano, Seiya; Yoshino, Koichiro; Minato, Takashi; Minoh, Michihiko |
D26 | 487 | Leveraging Latent Diffusion in 3D Gaussian Splatting for Novel View Synthesis | Li, Bohan ; Li, Xingyi ; Liang, Yangwen ; Wang, Shuangquan ; Song, Kee-Bong |
VBS: Video Browser Showdown: Day 1 (8 January)
paperID | authors | title |
---|---|---|
406 | Nguyen-Ho, Thang-Long; Huynh, Viet-Tham; Kongmeesub, Onanong; Tran, Minh-Triet; Nie, Dongyun; Healy, Graham; Gurrin, Cathal | VEAGLE: Eye Gaze-Assisted Guidance for Video Browser Showdown |
501 | Tran, Quang-Linh; Nguyen, Binh; Jones, Gareth J. F.; Gurrin, Cathal | VideoEase at VBS2025: An Interactive Video Retrieval System |
502 | Rossetto, Luca; Gasser, Ralph | Feature-driven Video Segmentation and Advanced Querying with vitrivr-engine |
503 | Nguyen, Tai; Vo, Anh Ngoc Minh; Pham, Dat Duc; Tran, Vinh Quang; Duong, Nhu Thi Quynh; Le, Tien Anh; Le, Tan Duy; Nguyen, Binh T. | HORUS: Multimodal Large Language Models Framework for Video Retrieval at VBS 2025 |
504 | CHENG, Yu Tong; WU, Jiaxin; MA, Zhixin; HE, Jiangshan; WEI, Xiao-Yong; NGO, Chong Wah | Interactive Video Search with Multi-modal LLM Video Captioning |
505 | Le, Huy M.; Nguyen Tien, Dat; Le Duy, Khang; Nguyen Dang Quang, Tuan; Nguyen Khanh, Toan; Nguyen, Binh T. | FUSIONISTA: Fusion of 3-D Information of Video in Retrieval System |
506 | C. Quan, Khanh-An; Ngoc Nguyen, Qui; Tran, Minh-Triet | ViFi: A Video Finding System at Video Browser Showdown 2025 |
507 | Vuong, Gia-Huy; Ho, Van-Son; Nguyen-Dang, Tien-Thanh; Thai, Xuan-Dang; Ho-Le, Minh-Quan; Le, Tu-Khiem; Pham, Minh-Khoi; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet | ViewsInsight2.0: Enhancing Video Retrieval for VBS 2025 with an Automatic Query Generator Powered by Large Language Models |
508 | Pantelidis, Nick; Georgalis, Dimitris; Pegia, Maria; Galanopoulos, Damianos; Apostolidis, Konstantinos; Stavrothanasopoulos, Klearchos; Moumtzidou, Anastasia; Gkountakos, Konstantinos; Gialampoukidis, Ilias; Vrochidis, Stefanos; Mezaris, Vasileios; Kompatsiaris, Ioannis | VERGE in VBS 2025 |
509 | Sharma, Ujjwal; Khan, Omar Shahbaz; Rudinac, Stevan; Jónsson, Björn Þór | Exquisitor at the Video Browser Showdown 2025: Unifying Conversational Search and User Relevance Feedback |
510 | Spiess, Florian; Rossetto, Luca; Schuldt, Heiko | Simplified Video Retrieval in Virtual Reality with vitrivr-VR |
511 | Leopold, Mario; Schöffmann, Klaus | diveXplore at the Video Browser Showdown 2025 |
512 | Tran Gia, Bao; Bui Cong Khanh, Tuong; Le Thi Thanh, Tam; Tran Doan, Thuyen; Le Tran Trong, Khiem; Do, Tien; Mai, Tien-Dung; Duc Ngo, Thanh; Le, Duy-Dinh; Satoh, Shin’ichi | NII-UIT at VBS2025: Multimodal Video Retrieval with LLM Integration and Dynamic Temporal Search |
513 | Stroh, Michael; Kloda, Vojtěch; Verner, Benjamin; Vopálková, Zuzana; Buchmüller, Raphael; Jäckl, Bastian; Lokoč, Jakub; Hajko, Jakob | PraK Tool V3: Enhancing Video Item Search Using Localized Text and Texture Queries |
514 | Arnold, Rahel; Kempf, Rahel; Waltenspül, Raphael; Schuldt, Heiko | MediaMix: Multimedia Retrieval in Mixed Reality |
515 | Ho-Le, Minh-Quan; Ho, Duy-Khang; Do-Huu, Huy-Hoang; Le-Hinh, Nhut-Thanh; Vo-Hoang, Hoa-Vien; Ninh, Van-Tu; Gurrin, Cathal; Tran, Minh-Triet | SnapSeek 2.0 at Video Browser Showdown 2025 |
517 | Luu, Duc-Tuan; C. Quan, Khanh-An; Nguyen, Duy-Ngoc; Bui-Le, Khanh-Linh; Doan, Nhat-Sang; Le-Ngo, Minh-Duc; Nguyen, Vinh-Tiep; Tran, Minh-Triet | IMSearch 2.0: Toward User-centric and Efficient Interactive Multimedia Retrieval System |
Social Events
Welcome Reception (Day 1: 8 January)
We warmly invite attendees to the reception.
- Time: 6:00 PM ~ 8:00 PM (tentative)
- Location: Reception Hall 1
- Refreshments including a variety of foods and drinks will be provided.
Banquet (Day 2: 9 January)
- Time
- Start: 6:00 PM (tentative)
- Location: KOTOWA Nara-Koen Premium View
Address
〒630-8374 奈良県奈良市今御門町15
15 Imamikadocho, Nara, 630-8374, Japan
- Foods and drinks will be provided.
- Highlight: Kiki-sake(利き酒) will be held as a part of banquet.
- “Kikisake” is the Japanese tradition of sake tasting. It involves sampling and evaluating different types of sake to appreciate their flavors, aromas, and characteristics, much like wine tasting in Western cultures. The word ‘kiki’ refers to discerning or distinguishing, and ‘sake’ is Japan’s traditional rice wine. It’s often done in a formal setting or as an enjoyable activity to explore the rich variety of sake styles.
- Highlight: Kiki-sake(利き酒) will be held as a part of banquet.
Accepted Special Sessions
- Spatial Intelligence in Multimedia Analytics (SpIMA)
- Multimedia Research in Robotics
- MLLMA: Multimodal Large Language Models and Applications
- ExpertSUM: Expert-Level Text Summarization from Fine-Grained Multimedia Analytics
- Simulating Edge Computing and Multimodal AI: A Benchmark for Real-World Applications