Combat Ready Vision AI

Introducing Combat Ready Vision AI

Artificial Intelligence (AI) is transforming modern military operations by enhancing intelligence gathering, situational awareness, and decision-making. This project focuses on developing an AI-powered video captioning model to generate real-time textual descriptions from military footage. By automating intelligence reporting, our solution aims to improve efficiency, reduce operational risks, and address manpower constraints in military intelligence. Through advanced machine learning techniques, our model will enable rapid analysis of video data, ensuring timely and accurate insights for mission-critical operations.

Team members

Teo Keng Wee, Skyler (EPD), Shaun Tan Ke Jia (ESD), Sherin Karuvallil Saji (CSD), Siddharth Ganesh (CSD), Jash Jignesh Veragiwala (CSD), Jerel Ong Bao Xiang (CSD)

Instructors:

Mahamarakkalage Dileepa Yasas Fernando

Writing Instructors:

Rashmi Kumar
Bernard Tan

Problem With Manual Report Writing

In military operations, timely and accurate intelligence reporting is critical for decision-making. However, the current manual report-filling process for documenting military observations and surveillance faces following challenges:

Limited Accuracy in Scene Understanding

Traditional surveillance and reconnaissance methods struggle with accurate object detection and scene captioning, making it difficult to extract meaningful intelligence from raw video data.

Cognitive Overload on Soldiers

Manually analyzing combat scenes and writing reports under high-stress conditions burdens soldiers, leading to fatigue, slower responses, and reduced mission effectiveness.

Delayed Decision-Making

Military operations require quick decision-making, but existing systems do not provide automated, real-time insights from combat footage.

Human Error & Inconsistency

Current battlefield reporting relies heavily on manual documentation, which is slow, inefficient, and prone to human errors.

Introducing,

Transforming Battlefield Footage into Actionable Intelligence.

Our Solution

What is a Multimodal Model?

InternVL is a powerful multimodal model that processes both visual and textual inputs, enabling deeper scene understanding than traditional image captioning models. Unlike single-modality systems that rely solely on images or text, multimodal models like InternVL can reason across both, making them more effective for complex tasks like visual question answering and context-aware image captioning.

InternVL: Bridging Vision and Language

InternVL is a cutting-edge multimodal model that seamlessly integrates visual and textual data, enabling deep comprehension of scenes and their contextual meanings. It processes an image or video frame alongside textual prompts to produce accurate, descriptive, and context-aware outputs. InternVL stands out due to its large-scale pretraining across diverse datasets and its ability to handle open-domain and task-specific challenges effectively.

How We Use InternVL in Our Project

In our project, InternVL is the core engine for battlefield scene interpretation. We fine-tuned it on a curated military dataset to help the model specialize in detecting and understanding combat-related elements—like military vehicles, personnel, weapons, and formations. By feeding each frame of a video into InternVL, the model generates precise and context-sensitive captions. These captions are then compiled into structured reports, automating the task of analyzing combat footage and enabling quicker, data-driven decision-making in mission-critical environments.

Finetuning Workflow

Hardware Used

Camera for Real-Time Video Capture

A battlefield-mounted camera captures live video streams, providing real-time input to the model for scene analysis and automated reporting.

GPU for Model Training and Inference

A GPU accelerates both model training and real-time inference, enabling fast, accurate processing of video feeds for instant scene understanding.

Tech Stack & Platforms Used

Project Video

Live Captioning Demo

Evaluation/Results

To assess the performance of our system, we evaluated the captioning quality using standard Natural Language Generation (NLG) metrics such as BLEU, METEOR, and CIDEr. These metrics compare the generated captions against human-annotated references to measure precision, relevance, and fluency.

User Testimonial

After using DeepSub, “…we believe this will significantly improve report accuracy while reducing the cognitive load on soldiers, resulting in a strategic advantage during missions. As the industry mentor, I am proud of the team’s effort and determination to see the success of this project.”

– Mr. Hing Wen Siang, Army Division Lead, AI Dev, ST Engg.

Our Team

Acknowledgements

This project on Combat Ready Vision AI, conducted in partnership with ST Engineering, has benefited greatly from the support and guidance of several key individuals.

We extend our gratitude to Prof. Mahamarakkalage Dileepa Yasas Fernando, our capstone mentor, for his expert advice and for encouraging us to challenge the limits of our understanding in this rapidly evolving field.

Our thanks also go to Dr. Benard Tan and Prof Rashmi Kumar from the Centre for Writing and Rhetoric, who provided crucial guidance on communicating complex ideas effectively, thereby enhancing the presentation of our project.

Special appreciation is given to Mr. Wen Siang and Ms. Wen Hui from ST Engineering, our industry mentor, whose insights and guidance were instrumental in the practical success of our project. Her expertise in the industry significantly contributed to the depth and applicability of our research.

The collective wisdom and support of our mentors have been invaluable. We are grateful for their contributions to our project’s success.

Days

Hours

Minutes

Seconds

Vote the project now!

Vote for our project at the exhibition! Your support is vital in recognizing our creativity. Join us in celebrating innovation and contributing to our success. Thank you for being part of our journey!