Abstract
This paper addresses the challenge of traffic monitoring and incident detection in remote areas, utilizing multimodal large language models (LLMs) deployed on edge AI devices. The key novelty of the LLM is to convert real-time video streams into descriptive texts, enabling low-bandwidth transmissions and reliable detection of anomalies and incidents in environments of intermittent connectivity. The model is developed based on fine-tuning open-source LLMs and extending it with multi-modal capabilities to analyze video frames. Our work also involves deploying this model on edge devices such as Nvidia IGX Orin and is planned to be tested in realistic environments in future work. The methodology includes data set curation, iterative model fine-tuning and compression, and hardware-based optimization. This approach aims to enhance traffic safety and response speed in remote areas, marking a significant advancement in the application of AI for traffic monitoring and safety management.