1. Introduction

1.1 The Growing UAS Threat

The rapid democratization of drone technology has created an asymmetric threat landscape. Consumer quadcopters costing under $1,000 can carry payloads, conduct surveillance, and penetrate restricted airspace with minimal operator skill. In military contexts, the Ukraine-Russia conflict has demonstrated the devastating effectiveness of first-person-view (FPV) attack drones, fiber-optic guided munitions, and loitering ammunition such as the Shahed-136 [1]. The Indian subcontinent faces similar challenges along its borders, where adversary drones conduct reconnaissance and cross-border smuggling operations [2].

1.2 Limitations of Existing C-UAS Systems

Current counter-drone systems -such as DroneShield's DroneSentry, Rafael's Drone Dome, and Dedrone's DedroneTracker -suffer from several limitations:

Cost: Military-grade systems cost $500K-$5M per installation
Infrastructure dependency: Require dedicated radar arrays, RF sensors, and command stations
Network dependency: Cloud-based ML processing requires persistent connectivity
Portability: Fixed installations cannot support mobile patrols or forward positions
Latency: Cloud round-trips add 200-2000ms to detection-to-response time

1.3 Our Contribution

VAJRA addresses these limitations by implementing a complete C-UAS pipeline on a single Android device:

Visual detection using a custom-trained YOLOv8n model (12MB) for real-time drone recognition via the device camera
Acoustic detection using a dual-layer approach: real-time FFT peak analysis for propeller frequency identification, combined with a custom-trained CNN classifier (129KB) for drone type classification from mel spectrograms
RF analysis for control link identification and protocol fingerprinting
Multi-sensor fusion combining all modalities into a unified tactical threat display
Countermeasure control interface for RF jamming and GPS spoofing operations

All inference runs on-device using TensorFlow Lite, requiring zero network connectivity. The complete application is under 40MB and runs on any Android 8.0+ device.

2. System Architecture

2.1 Overview

VAJRA follows a modular architecture where each sensor modality operates as an independent detection pipeline, with results fused at the threat display layer.

VAJRA System Architecture

CAMERA          MICROPHONE        SDR/RF
(CameraX)       (AudioRec)        (RTL-SDR)
    |               |                |
    v               v                v
YOLOv8n         FFT Engine       Spectrum
TFLite          1024-pt FFT      Analyzer
(12MB)          @ 44.1 kHz       2.048 MSPS
    |               |                |
    |               v                |
    |           CNN Mel              |
    |           Classifier           |
    |           (129KB)              |
    |               |                |
    v               v                v
         DRONE DATABASE
    8 profiles x 22 parameters
                |
                v
      THREAT DISPLAY (Fusion Layer)
  Radar sweep + bearing/range/speed/IFF
                |
                v
      COUNTERMEASURE CONTROL
  Barrage jam / Protocol exploit / GPS spoof

2.2 Drone Database

At the core of VAJRA's identification capability is a structured database of 8 drone profiles spanning consumer, commercial, military, FPV attack, and loitering munition categories. Each profile contains 22 parameters:

Parameter Category	Fields
Identity	ID, name, manufacturer, country of origin
Physical	Category, type (quadcopter/fixed-wing/hybrid), control link type
Performance	Max range (7-150 km), max speed (72-220 km/h), endurance (8 min-36 hrs)
RF Signature	Frequency bands (2.4/5.8 GHz, 900 MHz, C/Ku-band SAT), RF protocol
Acoustic Signature	Propeller fundamental frequency (28-400 Hz), acoustic description
Threat Assessment	Threat level (LOW/MEDIUM/HIGH/CRITICAL), payload capability, countermeasure

Table 1: Drone profiles in VAJRA database

Drone	Country	Category	Prop Freq	Control	Threat	Jammable
DJI Mavic 3	CN	Consumer	240 Hz	OcuSync 3.0	MEDIUM	Yes
DJI Phantom 4	CN	Consumer	215 Hz	Lightbridge 2	MEDIUM	Yes
Bayraktar TB2	TR	Military	95 Hz	Satellite	HIGH	No
Shahed-136	IR	Loitering	65 Hz	Autonomous	CRITICAL	No*
FPV Attack	UA/RU	FPV Attack	400 Hz	ExpressLRS	CRITICAL	Yes
Fiber-Optic FPV	UA/RU	FPV Attack	375 Hz	Fiber Optic	CRITICAL	No
Orlan-10	RU	Military	110 Hz	MIL-STD	HIGH	Partial
Heron TP	IL	Military	55 Hz	Satellite	HIGH	No

*GPS spoofing may be effective against GPS-guided autonomous drones.

2.3 Platform and Dependencies

Target platform: Android 8.0+ (API 26), optimized for landscape tablet/phone
ML runtime: TensorFlow Lite 2.14.0 with 4-thread CPU inference
Camera framework: AndroidX CameraX 1.4.1
Audio capture: Android AudioRecord at 44,100 Hz, 16-bit mono PCM
Total APK size: ~38 MB (including all models and assets)

3. Visual Detection Module

3.1 Model Architecture

For real-time visual drone detection, we employ YOLOv8n (nano variant) -the smallest model in the Ultralytics YOLOv8 family. YOLOv8n uses a CSPDarknet backbone with a Path Aggregation Network (PAN) feature pyramid and a decoupled detection head [3].

Model specifications:

Parameter	Value
Architecture	YOLOv8n (nano)
Input resolution	320 x 320 x 3 (RGB, float32, normalized 0-1)
Output format	[1, 8, 2100] -2100 predictions x 8 values
Output structure	[cx, cy, w, h, class0, class1, class2, class3] per prediction
Anchor boxes	2100 (40x40 + 20x20 + 10x10 multi-scale)
Number of classes	4 (all mapped to DRONE)
Model size	12 MB (TFLite, float16 quantized)
NMS IoU threshold	0.45
Confidence threshold	0.35
Max detections	20 per frame

3.2 Custom Training

The YOLOv8n model was custom-trained on a drone detection dataset rather than using the general-purpose COCO pre-trained weights. The training process:

Dataset: Drone imagery from multiple angles, distances, and backgrounds, including both consumer and military UAS types
Augmentation: Standard YOLOv8 augmentation pipeline (mosaic, mixup, random perspective, HSV shifts)
Training: Transfer learning from COCO pre-trained weights with custom drone classes
Export: Converted to TFLite with float16 quantization for mobile deployment

3.3 Inference Pipeline

The visual detection pipeline processes camera frames in real-time:

Camera Frame (YUV_420_888, 30 FPS)
    |
    v
YUV -> Bitmap Conversion (with rotation correction)
    |
    v
Resize to 320x320 (bilinear interpolation)
    |
    v
Normalize to [0, 1] float32 RGB
    |
    v
YOLOv8n TFLite Inference (4 threads)
    |
    v
Parse [1, 8, 2100] output tensor
    |
    v
Filter by confidence (>0.35)
    |
    v
Non-Maximum Suppression (IoU >0.45)
    |
    v
Tactical overlay rendering (corner brackets + label + confidence)

The inference runs on a dedicated single-thread executor to avoid blocking the UI thread. Detection results are marshaled to the main thread via Handler.post() for overlay rendering with an 800ms per-class cooldown to prevent log spam.

3.4 Tactical Display

Detections are rendered as tactical corner-bracket overlays (rather than simple rectangles) with:

Drone class label and confidence percentage
FPS counter for performance monitoring
Running detection statistics (contacts, drones, vehicles)
Timestamped detection log (50-entry circular buffer)

4. Acoustic Detection Module

4.1 Dual-Layer Architecture

VAJRA's acoustic detection employs two complementary layers that run simultaneously on live microphone audio:

Layer 1: Real-time FFT peak detection (every ~23ms)

Detects propeller fundamental frequency in the 50-500 Hz range
Validates harmonic structure (2nd harmonic presence)
Tracks frequency stability across consecutive frames
Matches against drone database profiles

Layer 2: ML CNN classifier (every 1 second)

Computes mel spectrogram from 1-second audio buffer
Classifies drone type using a trained CNN
Provides probabilistic output across 5 classes

4.2 FFT Engine

The FFT engine implements a Cooley-Tukey radix-2 decimation-in-time algorithm in pure Kotlin, operating on 1024-point windows at 44,100 Hz sampling rate.

FFT specifications:

Parameter	Value
FFT size	1024 points
Sample rate	44,100 Hz
Frequency resolution	43.07 Hz per bin
Frame rate	~43 frames/second
Window function	Hanning (periodic)
Output bins	256 (mapped to 0-1000 Hz)
Bit-reversal	Pre-computed lookup table
Twiddle factors	Pre-computed sin/cos table

Detection criteria (all must be satisfied):

Criterion	Threshold	Purpose
SNR	>= 12 dB	Peak must be significantly above noise floor
Amplitude	>= -55 dBFS	Reject quiet ambient fluctuations
Harmonic	2nd harmonic >= 4 dB above local noise	Confirm propeller signature vs. random peak
Stability	+/- 15 Hz across 3+ consecutive frames	Reject transient peaks
Confirmation	6 consecutive valid frames (~150ms)	Reject brief coincidences

Harmonic validation algorithm:

Drone propellers produce strong harmonics due to blade geometry. A peak at fundamental frequency f0 should have a corresponding peak at 2f0 (second harmonic). VAJRA validates this by:

Computing the expected harmonic bin: bin_h2 = 2 * bin_f0
Finding the maximum magnitude within +/-1 bin of the expected position
Computing the local noise floor in a +/-15 bin window (excluding +/-3 bins around harmonic)
Requiring the harmonic SNR >= 4 dB

This criterion effectively rejects non-propeller sources (speech, music, traffic) that may have energy in the 50-500 Hz range but lack the harmonic structure of rotating blades.

Distance estimation:

Signal amplitude provides a rough distance estimate via inverse-square-law scaling:

distance = DIST_CLOSE + (amplitude_dB - AMP_CLOSE) / (AMP_FAR - AMP_CLOSE)
         * (DIST_FAR - DIST_CLOSE)

Where AMP_CLOSE = -20 dBFS -> 50m, AMP_FAR = -60 dBFS -> 1000m.

4.3 CNN Acoustic Classifier

4.3.1 Model Architecture

The acoustic classifier is a compact CNN operating on log-mel spectrograms:

Input: (64, 44, 1) -64 mel bands x 44 time frames x 1 channel

Conv2D(16, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2)    -> (32, 22, 16)
Conv2D(32, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2)    -> (16, 11, 32)
Conv2D(64, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2)    -> (8, 5, 64)
Conv2D(64, 3x3, same) -> BatchNorm -> ReLU -> GlobalAvgPool   -> (64,)
Dropout(0.3) -> Dense(32, ReLU) -> Dropout(0.2) -> Dense(5, softmax)

Output: [ambient, quadcopter_small, quadcopter_large, fixed_wing, helicopter_uav]

Model specifications:

Parameter	Value
Total parameters	~52K
Model size (TFLite float16)	129 KB
Mel filterbank size	128 KB (64 x 513 float32)
Input shape	[1, 64, 44, 1]
Output classes	5
Confidence threshold	0.35
Inference time	< 50ms on mobile CPU

4.3.2 Mel Spectrogram Computation

The on-device mel spectrogram pipeline matches the training pipeline (Python librosa):

Windowing: 1024-sample Hanning window, 1024-sample hop (non-overlapping)
FFT: 1024-point radix-2 Cooley-Tukey (implemented in pure Kotlin)
Power spectrum: |X[k]|^2 for k = 0..512
Mel filterbank: 64 triangular filters spanning 20-2000 Hz, applied as matrix multiply
Log compression: 10 x log10(power + 10^-10), clipped to 80 dB dynamic range (matching librosa.power_to_db(ref=np.max, top_db=80))
Normalization: Min-max scaling to [0, 1]

The mel filterbank weights are pre-computed using librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64, fmin=20, fmax=2000) and shipped as a binary asset (128 KB).

4.3.3 Training Data: Synthetic Drone Audio Generation

A key contribution of this work is the synthetic training data generation pipeline. Due to the difficulty of collecting real drone audio across multiple types and conditions, we generate physically accurate propeller audio from known frequency profiles.

Generation parameters per drone profile:

Parameter	Description	Value range
Fundamental frequency	Blade-pass frequency	28-400 Hz
Harmonics	2nd, 3rd, 4th overtones	Amplitude: 0.5, 0.25, 0.12 x fundamental
Number of propellers	Multi-rotor detuning	1 (fixed-wing) to 6 (hexcopter)
Propeller detuning	Inter-motor frequency offset	+/-3-8 Hz
RPM modulation	Throttle variation	+/-5% at 0.2-0.5 Hz
Amplitude modulation	Doppler/distance simulation	+/-20% at 0.05-0.15 Hz
Background noise	Pink noise (1/f spectrum)	SNR: 15, 25, 40 dB

Drone profiles used for training:

Class	Profiles	Frequencies
quadcopter_small	DJI Mavic 3, DJI Mini, DJI Phantom 4, Generic	215, 230, 240, 260 Hz
quadcopter_large	FPV Racing, FPV Attack, Matrice 600, Heavy Quad	160, 180, 375, 400 Hz
fixed_wing	Bayraktar TB2, Shahed-136, Orlan-10, RC Plane	65, 95, 110, 130 Hz
helicopter_uav	Heron TP, Coaxial Heli, Heli UAV, Medium Heli	28, 35, 45, 55 Hz
ambient	ESC-50 dataset (21 categories)	N/A

Signal synthesis formula:

For a drone with N propellers, fundamental frequency f0, and detuning delta:

signal(t) = Sum_p=1^N  Sum_h=1^4  (a_h / N) x sin(2pi x integral_0^t f_ph(tau) dtau)

where:
  f_ph(t) = h x [f0 + delta_p + delta_f x sin(2pi x fmod x t + phi_p)]
  delta_p = (p - N/2) x delta / N        (per-propeller detuning)
  delta_f = 0.05 x f0                     (RPM modulation depth)
  fmod    ~ U(0.2, 0.5) Hz               (modulation rate)
  a_h     = [1.0, 0.5, 0.25, 0.12]       (harmonic amplitudes)

This produces audio with the characteristic "buzzing" quality of multi-rotor drones, where slightly detuned motors create beating patterns in the frequency domain.

Training dataset composition:

Source	Samples	Duration
Synthetic drone audio (16 profiles x 3 SNR x 60s)	~5,600 clips	48 minutes
ESC-50 ambient (21 categories)	~6,300 clips	35 minutes
Total (before augmentation)	~11,900	83 minutes

Data augmentation (SpecAugment-style [4]):

Time masking: random 1-5 frame zeroing
Frequency masking: random 1-8 mel band zeroing
Gaussian noise injection: sigma in [0.01, 0.05]
Time shifting: +/-3 frames circular shift

After augmentation and class balancing, the final training set contains ~3,000+ samples per class.

Training configuration:

Parameter	Value
Optimizer	Adam (lr=0.001)
Loss	Sparse categorical cross-entropy
Batch size	32
Epochs	100 (early stopping, patience=15)
LR schedule	ReduceLROnPlateau (factor=0.5, patience=5)
Train/val split	80/20 (stratified)
Best val accuracy	100% (42 epochs)

4.4 Spectrogram Visualization

The acoustic module includes a real-time waterfall spectrogram display:

Resolution: 256 frequency bins x 200 time rows (~6 seconds visible history)
Frequency range: 0-1000 Hz
Color mapping: Dark green -> bright green -> yellow -> orange/red -> white (magnitude-proportional)
Harmonic markers: Vertical dashed lines at detected fundamental and harmonics when drone is detected

5. RF Analysis Module

5.1 Architecture

The RF analysis module performs spectrum monitoring to detect drone control link transmissions. In production deployment, this requires an external SDR (software-defined radio) connected via USB-OTG:

Hardware	Cost	Bandwidth	Frequency Range
RTL-SDR v3	$25	2.048 MSPS	25 MHz - 1.7 GHz
HackRF One	$300	20 MSPS	1 MHz - 6 GHz

5.2 Detection Pipeline

IQ Samples (SDR) -> FFT -> Peak Detection -> Bandwidth Extraction
    -> Modulation Classification -> Protocol Fingerprinting -> Database Match

Target frequency bands:

Band	Usage	Typical drones
2.4 GHz	Control link (primary)	DJI, consumer quads
5.8 GHz	HD video downlink	DJI, racing drones
900 MHz	Long-range control	FPV (Crossfire), military
1575.42 MHz	GPS L1 navigation	All GPS-dependent drones

5.3 Protocol Fingerprinting

Each drone control protocol has a distinctive RF fingerprint:

Protocol	Bandwidth	Modulation	Characteristic
DJI OcuSync 3.0	10-20 MHz	OFDM	Dual-band simultaneous
DJI Lightbridge 2	10 MHz	OFDM	Single-band
ExpressLRS	500 kHz	LoRa	Narrow-band, frequency-hopping
Crossfire	1 MHz	LoRa	900 MHz long-range
MIL-STD encrypted	Variable	FHSS	Frequency-hopping spread spectrum

6. Multi-Sensor Fusion and Threat Display

6.1 Fusion Architecture

The threat display module aggregates detections from all sensor modalities into a unified tactical picture. Each detection source provides complementary information:

Sensor	Provides	Limitations
Visual (camera)	Bearing, visual classification, size estimation	Limited range (~500m), LoS only, weather-dependent
Acoustic (microphone)	Bearing (with arrays), type classification, presence detection	Limited range (~1km), affected by ambient noise
RF (SDR)	Control protocol, operator direction, jammability assessment	Cannot detect autonomous drones

6.2 Radar Display

The threat display renders a rotating radar sweep with contact blips:

ThreatBlip(
    id: String,           // Database profile ID
    label: String,        // Display name
    bearingDeg: Float,    // 0-360 from north
    rangeMeter: Float,    // Distance from sensor
    speedMps: Float,      // Closing speed
    headingDeg: Float,    // Direction of travel
    threatLevel: ThreatLevel,  // LOW/MEDIUM/HIGH/CRITICAL
    isFriendly: Boolean   // IFF classification
)

6.3 Threat Assessment

Each detected contact receives a composite threat score based on:

Drone type: Military/FPV attack -> CRITICAL, consumer -> LOW/MEDIUM
Control link: Autonomous/fiber-optic -> higher threat (unjammable)
Payload capability: Munition-capable -> CRITICAL
Closing speed and range: Fast-approaching -> elevated priority

7. Countermeasure Control

7.1 Jammer Specifications

VAJRA includes an interface for directing RF countermeasures (requires external jammer hardware in production):

Band	Power	Target
2.4 GHz	10W	Wi-Fi, DJI OcuSync, ExpressLRS
5.8 GHz	10W	Video downlinks
900 MHz	10W	Crossfire, military FH protocols
GPS L1	5W	Navigation denial

7.2 Countermeasure Modes

Barrage jamming: Simultaneous wideband transmission across all bands, forcing drone failsafe (hover -> land)
DJI RTH spoof: Protocol-level command injection exploiting OcuSync vulnerability to trigger return-to-home
GPS spoofing: Transmission of false GPS L1 signals at +20 dBm over real signal, redirecting GPS-dependent drones to a designated safe zone

7.3 Jammability Matrix

Control Link	Jammable	Alternative
Wireless RF (2.4/5.8 GHz)	Yes	Barrage jam + protocol exploit
Fiber-optic	No	Kinetic intercept required
Satellite (C/Ku-band)	No	GPS spoofing (if GPS-dependent)
Autonomous (GPS-guided)	No (RF)	GPS spoofing may redirect

8. Results and Discussion

8.1 Visual Detection Performance

The custom-trained YOLOv8n achieves real-time inference on mobile hardware:

Metric	Value
Inference speed	> 25 FPS on mid-range Android
Model size	12 MB (float16 TFLite)
Input resolution	320 x 320
NMS processing	< 2ms
Memory footprint	~45 MB (model + buffers)

8.2 Acoustic Detection Performance

FFT layer:

Metric	Value
Processing rate	~43 frames/second
Detection latency	~150ms (6 confirmation frames)
Frequency resolution	43.07 Hz
Effective range	50-1000m (estimated)

CNN classifier:

Metric	Value
Inference speed	< 50ms per 1-second clip
Model size	129 KB (float16 TFLite) + 128 KB filterbank
Validation accuracy	100% on synthetic test set
Classes	5 (ambient + 4 drone types)

8.3 System-Level Performance

Metric	Value
Total APK size	38 MB
Cold start time	< 3 seconds
Battery impact	~15% per hour (active scanning)
Network requirement	None (fully offline)
Minimum hardware	Android 8.0, ARM64, 2GB RAM

8.4 Limitations

Synthetic training data gap: The acoustic classifier achieves 100% accuracy on synthetic test data but shows reduced performance on real-world audio played through speakers. This is expected -synthetic audio has idealized harmonic structure that differs from audio captured through the speaker -> air -> microphone chain. The spectral characteristics are transformed by speaker frequency response, room acoustics, and microphone response curves.
Visual detection range: At 320x320 input resolution, small drones at distances beyond ~200m occupy very few pixels, reducing detection confidence. Higher input resolution (640x640) would improve range at the cost of inference speed.
Low-frequency drones: Fixed-wing drones (65-130 Hz) and helicopter UAVs (28-55 Hz) produce fundamental frequencies below the reproduction capability of most phone speakers (typically > 150 Hz), making speaker-based testing impossible for these categories. Real-world testing with actual drones is required to validate these classes.
RF module: Currently operates with simulated data; production deployment requires external SDR hardware.

9. Improving with More Data

9.1 Visual Model Improvements

The current YOLOv8n model can be significantly improved through expanded training data:

Priority datasets:

Source	Type	Expected benefit
Real drone flight recordings	Video frames at various distances, angles, backgrounds	+10-15% accuracy at range
Thermal/IR imagery	FLIR sensor captures	Night-time detection capability
Adverse weather footage	Rain, fog, dusk/dawn	Robustness in operational conditions
Drone-specific negative samples	Birds, aircraft, kites, balloons	Reduced false positive rate
Synthetic data (AirSim/Gazebo)	Rendered drone models on diverse backgrounds	Scale training data 10-100x

Recommended training pipeline:

Start with current custom-trained model as baseline
Collect 10,000+ real drone images across 5+ drone types
Include 20,000+ negative samples (birds at various sizes, aircraft)
Train YOLOv8s (small) or YOLOv8m (medium) for higher accuracy
Apply INT8 quantization for mobile deployment
Target: > 90% mAP@0.5 at 50-500m range

9.2 Acoustic Model Improvements

Path 1: Real drone recordings

The most impactful improvement is supplementing synthetic data with real recordings:

Recording Protocol:
1. Record each drone type at 3 distances (50m, 200m, 500m)
2. Record in 3 environments (open field, urban, forested)
3. Use phone microphone (matches deployment hardware)
4. Minimum 2 minutes per recording
5. Include hover, forward flight, and approach maneuvers

Even 30 seconds of real drone audio per type would significantly improve generalization from synthetic-only training.

Path 2: Transfer learning from large audio models

Pre-trained audio models (AudioSet, VGGish, YAMNet) can provide learned feature representations that generalize better than training from scratch:

Use YAMNet (Google's AudioSet model) as feature extractor
Replace final classification head with 5-class drone classifier
Fine-tune on synthetic + real drone data
Expected benefit: better ambient rejection, more robust features

Path 3: Environmental calibration

Record 1-2 minutes of ambient sound at the deployment location. Use this as additional "ambient" training data specific to the operating environment. This trains the model to reject site-specific background noise (nearby roads, industrial equipment, wildlife).

Path 4: Data augmentation improvements

Augmentation	Purpose
Room impulse response (RIR) convolution	Simulate speaker-to-mic acoustic path
Speaker frequency response simulation	Model typical phone/laptop speaker rolloff
Multi-drone mixing	Detect when 2+ drones present simultaneously
Wind noise overlay	Outdoor robustness
Variable-distance amplitude scaling	Continuous distance modeling

9.3 Cross-Modal Learning

Future work could explore cross-modal training where visual and acoustic detections reinforce each other:

When visual detection confirms a drone, label the concurrent acoustic data
Build a self-supervised dataset from field deployments
Train a fusion model that jointly processes visual features + mel spectrograms
This approach could improve detection confidence when either modality alone is uncertain

10. Related Work

Commercial C-UAS systems: DroneShield DroneSentry [5] uses acoustic arrays + radar + RF detection with costs exceeding $500K. Dedrone DedroneTracker [6] uses camera + RF + radar with cloud processing. Neither operates on mobile devices.

Academic drone detection: Kim et al. [7] demonstrated acoustic drone detection using mel spectrograms with 94% accuracy on recorded drone audio. Al-Emadi et al. [8] proposed a CNN-based acoustic classifier achieving 96% accuracy on real drone recordings. Our synthetic data approach complements these works by enabling rapid prototyping without drone access.

On-device ML: YOLOv8n represents the state-of-the-art in mobile object detection [3], achieving 37.3% mAP on COCO at 80+ FPS on mobile GPUs. Our application demonstrates its viability for specialized drone detection tasks.

11. Conclusion

VAJRA demonstrates that a practical multi-sensor counter-UAS system can operate entirely on a commercial smartphone. By combining custom-trained visual (YOLOv8n, 12MB) and acoustic (CNN, 129KB) deep learning models with real-time FFT signal processing and an RF analysis pipeline, the system provides drone detection, classification, and countermeasure guidance without any network dependency.

The synthetic audio training approach enables rapid model development for new drone types without requiring physical access to each drone. While the synthetic-to-real domain gap remains a challenge, we outline a clear path to closing this gap through real-world recordings, transfer learning, and environmental calibration.

VAJRA's fully on-device architecture makes it uniquely suited for:

Forward military positions with denied/degraded communications
Border security posts in remote areas without network infrastructure
Critical infrastructure protection where data sovereignty requires on-premises processing
Rapid deployment -any soldier with a smartphone becomes a drone detection node

Future work will focus on expanding the training datasets with real drone recordings, implementing acoustic direction-of-arrival using multi-microphone arrays, and integrating external SDR hardware for production RF analysis capability.

References

[1] Watling, J. & Reynolds, N. "The Role of Drones in the Russia-Ukraine War." Royal United Services Institute, 2023.
[2] Ministry of Defence, Government of India. "Anti-Drone Technology Requirements for Border Security." DRDO Technology Perspective, 2024.
[3] Jocher, G., Chaurasia, A., & Qiu, J. "Ultralytics YOLOv8." Ultralytics, 2023.
[4] Park, D.S. et al. "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech, 2019.
[5] DroneShield. "DroneSentry: Integrated Detect-and-Defeat Counter-Drone System." DroneShield Technical Datasheet, 2024.
[6] Dedrone. "DedroneTracker: AI-Powered Airspace Security Platform." Dedrone Product Documentation, 2024.
[7] Kim, J. et al. "Acoustic-Based Drone Detection and Classification Using Mel Spectrograms and Convolutional Neural Networks." IEEE Access, vol. 9, 2021.
[8] Al-Emadi, S. et al. "Audio Based Drone Detection and Identification Using Deep Learning." IEEE International Workshop on Signal Processing Advances in Wireless Communications, 2019.

Abstract