##plugins.themes.bootstrap3.article.main##

Yen-Hua Lu Ching-Hung Lee

Abstract

With the rapid advancement of large language models (LLMs), intelligent chatbots are increasingly being adopted for maintenance documentation, fault diagnosis, and personnel training. This study introduces a multimodal Retrieval-Augmented Generation (RAG) chatbot designed to provide accurate and natural-language support for robotic arm maintenance tasks. The system separates textual and visual content from maintenance manuals and processes them through two complementary pipelines. Caption RAG employs a vision-language  model (VLM) to generate contextual captions for images, improving the retrieval of relevant documents. VLM RAG then integrates retrieved text and associated images, using GPT-4o to deliver more precise and context-aware answers. To address industrial data privacy concerns, the system supports local deployment using open-source LLaMA and Taiwan’s TAIDE LLM models. The evaluation dataset was curated and validated by senior experts from an industrial robotic arm manufacturer, ensuring strong domain alignment. Experimental results show high accuracy—96% with GPT-4o, 92% with LLaMA 8B, and 74.67% with TAIDE 8B. Incorporating visual context via VLM RAG further improved performance to 96.67%, highlighting the benefit of multimodal integration. In summary, the proposed chatbot enhances maintenance efficiency and fault resolution while preserving data privacy, making it a practical solution for real-world industrial deployment. 

Download Statistics

##plugins.themes.bootstrap3.article.details##

Keywords

Multimodal, RAG, Large language model, Vision-language model, Chatbot, Maintenance manuals, Robotic arm

References
Citation Format
Section
Articles