Technical Report • 2026
VAJRA: A Multi-Sensor On-Device Counter-UAS System with Custom-Trained Visual and Acoustic Deep Learning Models
360Labs
Abstract
The proliferation of small unmanned aerial systems (sUAS) poses significant security challenges across military, critical infrastructure, and civilian domains. Current counter-UAS (C-UAS) solutions rely on expensive, centralized, network-dependent hardware that is impractical for forward-deployed or resource-constrained environments. We present VAJRA, a fully on-device, multi-sensor drone detection and neutralization system that runs entirely on a commercial Android smartphone. VAJRA integrates three detection modalities -visual (camera + custom-trained YOLOv8n), acoustic (microphone + FFT + custom-trained CNN classifier), and RF spectrum analysis -fused into a unified threat display with countermeasure control. All machine learning inference runs locally with zero network dependency, enabling operation in denied/degraded communications environments. We describe our approach to custom training both the visual object detection model on drone-specific imagery and the acoustic classification model on synthetically generated propeller audio profiles. The system achieves real-time performance (>25 FPS visual, <100ms acoustic classification) on mid-range Android hardware with a total application size under 40MB. We discuss the current limitations of synthetic training data and outline a roadmap for improving detection accuracy through expanded real-world datasets.
1. Introduction
1.1 The Growing UAS Threat
The rapid democratization of drone technology has created an asymmetric threat landscape. Consumer quadcopters costing under $1,000 can carry payloads, conduct surveillance, and penetrate restricted airspace with minimal operator skill. In military contexts, the Ukraine-Russia conflict has demonstrated the devastating effectiveness of first-person-view (FPV) attack drones, fiber-optic guided munitions, and loitering ammunition such as the Shahed-136 [1]. The Indian subcontinent faces similar challenges along its borders, where adversary drones conduct reconnaissance and cross-border smuggling operations [2].
1.2 Limitations of Existing C-UAS Systems
Current counter-drone systems -such as DroneShield's DroneSentry, Rafael's Drone Dome, and Dedrone's DedroneTracker -suffer from several limitations:
- Cost: Military-grade systems cost $500K-$5M per installation
- Infrastructure dependency: Require dedicated radar arrays, RF sensors, and command stations
- Network dependency: Cloud-based ML processing requires persistent connectivity
- Portability: Fixed installations cannot support mobile patrols or forward positions
- Latency: Cloud round-trips add 200-2000ms to detection-to-response time
1.3 Our Contribution
VAJRA addresses these limitations by implementing a complete C-UAS pipeline on a single Android device:
- Visual detection using a custom-trained YOLOv8n model (12MB) for real-time drone recognition via the device camera
- Acoustic detection using a dual-layer approach: real-time FFT peak analysis for propeller frequency identification, combined with a custom-trained CNN classifier (129KB) for drone type classification from mel spectrograms
- RF analysis for control link identification and protocol fingerprinting
- Multi-sensor fusion combining all modalities into a unified tactical threat display
- Countermeasure control interface for RF jamming and GPS spoofing operations
All inference runs on-device using TensorFlow Lite, requiring zero network connectivity. The complete application is under 40MB and runs on any Android 8.0+ device.
2. System Architecture
2.1 Overview
VAJRA follows a modular architecture where each sensor modality operates as an independent detection pipeline, with results fused at the threat display layer.
VAJRA System Architecture
CAMERA MICROPHONE SDR/RF
(CameraX) (AudioRec) (RTL-SDR)
| | |
v v v
YOLOv8n FFT Engine Spectrum
TFLite 1024-pt FFT Analyzer
(12MB) @ 44.1 kHz 2.048 MSPS
| | |
| v |
| CNN Mel |
| Classifier |
| (129KB) |
| | |
v v v
DRONE DATABASE
8 profiles x 22 parameters
|
v
THREAT DISPLAY (Fusion Layer)
Radar sweep + bearing/range/speed/IFF
|
v
COUNTERMEASURE CONTROL
Barrage jam / Protocol exploit / GPS spoof
2.2 Drone Database
At the core of VAJRA's identification capability is a structured database of 8 drone profiles spanning consumer, commercial, military, FPV attack, and loitering munition categories. Each profile contains 22 parameters:
| Parameter Category | Fields |
|---|---|
| Identity | ID, name, manufacturer, country of origin |
| Physical | Category, type (quadcopter/fixed-wing/hybrid), control link type |
| Performance | Max range (7-150 km), max speed (72-220 km/h), endurance (8 min-36 hrs) |
| RF Signature | Frequency bands (2.4/5.8 GHz, 900 MHz, C/Ku-band SAT), RF protocol |
| Acoustic Signature | Propeller fundamental frequency (28-400 Hz), acoustic description |
| Threat Assessment | Threat level (LOW/MEDIUM/HIGH/CRITICAL), payload capability, countermeasure |
Table 1: Drone profiles in VAJRA database
| Drone | Country | Category | Prop Freq | Control | Threat | Jammable |
|---|---|---|---|---|---|---|
| DJI Mavic 3 | CN | Consumer | 240 Hz | OcuSync 3.0 | MEDIUM | Yes |
| DJI Phantom 4 | CN | Consumer | 215 Hz | Lightbridge 2 | MEDIUM | Yes |
| Bayraktar TB2 | TR | Military | 95 Hz | Satellite | HIGH | No |
| Shahed-136 | IR | Loitering | 65 Hz | Autonomous | CRITICAL | No* |
| FPV Attack | UA/RU | FPV Attack | 400 Hz | ExpressLRS | CRITICAL | Yes |
| Fiber-Optic FPV | UA/RU | FPV Attack | 375 Hz | Fiber Optic | CRITICAL | No |
| Orlan-10 | RU | Military | 110 Hz | MIL-STD | HIGH | Partial |
| Heron TP | IL | Military | 55 Hz | Satellite | HIGH | No |
*GPS spoofing may be effective against GPS-guided autonomous drones.
2.3 Platform and Dependencies
- Target platform: Android 8.0+ (API 26), optimized for landscape tablet/phone
- ML runtime: TensorFlow Lite 2.14.0 with 4-thread CPU inference
- Camera framework: AndroidX CameraX 1.4.1
- Audio capture: Android AudioRecord at 44,100 Hz, 16-bit mono PCM
- Total APK size: ~38 MB (including all models and assets)
3. Visual Detection Module
3.1 Model Architecture
For real-time visual drone detection, we employ YOLOv8n (nano variant) -the smallest model in the Ultralytics YOLOv8 family. YOLOv8n uses a CSPDarknet backbone with a Path Aggregation Network (PAN) feature pyramid and a decoupled detection head [3].
Model specifications:
| Parameter | Value |
|---|---|
| Architecture | YOLOv8n (nano) |
| Input resolution | 320 x 320 x 3 (RGB, float32, normalized 0-1) |
| Output format | [1, 8, 2100] -2100 predictions x 8 values |
| Output structure | [cx, cy, w, h, class0, class1, class2, class3] per prediction |
| Anchor boxes | 2100 (40x40 + 20x20 + 10x10 multi-scale) |
| Number of classes | 4 (all mapped to DRONE) |
| Model size | 12 MB (TFLite, float16 quantized) |
| NMS IoU threshold | 0.45 |
| Confidence threshold | 0.35 |
| Max detections | 20 per frame |
3.2 Custom Training
The YOLOv8n model was custom-trained on a drone detection dataset rather than using the general-purpose COCO pre-trained weights. The training process:
- Dataset: Drone imagery from multiple angles, distances, and backgrounds, including both consumer and military UAS types
- Augmentation: Standard YOLOv8 augmentation pipeline (mosaic, mixup, random perspective, HSV shifts)
- Training: Transfer learning from COCO pre-trained weights with custom drone classes
- Export: Converted to TFLite with float16 quantization for mobile deployment
3.3 Inference Pipeline
The visual detection pipeline processes camera frames in real-time:
Camera Frame (YUV_420_888, 30 FPS)
|
v
YUV -> Bitmap Conversion (with rotation correction)
|
v
Resize to 320x320 (bilinear interpolation)
|
v
Normalize to [0, 1] float32 RGB
|
v
YOLOv8n TFLite Inference (4 threads)
|
v
Parse [1, 8, 2100] output tensor
|
v
Filter by confidence (>0.35)
|
v
Non-Maximum Suppression (IoU >0.45)
|
v
Tactical overlay rendering (corner brackets + label + confidence)
The inference runs on a dedicated single-thread executor to avoid blocking the UI thread. Detection results are marshaled to the main thread via Handler.post() for overlay rendering with an 800ms per-class cooldown to prevent log spam.
3.4 Tactical Display
Detections are rendered as tactical corner-bracket overlays (rather than simple rectangles) with:
- Drone class label and confidence percentage
- FPS counter for performance monitoring
- Running detection statistics (contacts, drones, vehicles)
- Timestamped detection log (50-entry circular buffer)
4. Acoustic Detection Module
4.1 Dual-Layer Architecture
VAJRA's acoustic detection employs two complementary layers that run simultaneously on live microphone audio:
Layer 1: Real-time FFT peak detection (every ~23ms)
- Detects propeller fundamental frequency in the 50-500 Hz range
- Validates harmonic structure (2nd harmonic presence)
- Tracks frequency stability across consecutive frames
- Matches against drone database profiles
Layer 2: ML CNN classifier (every 1 second)
- Computes mel spectrogram from 1-second audio buffer
- Classifies drone type using a trained CNN
- Provides probabilistic output across 5 classes
4.2 FFT Engine
The FFT engine implements a Cooley-Tukey radix-2 decimation-in-time algorithm in pure Kotlin, operating on 1024-point windows at 44,100 Hz sampling rate.
FFT specifications:
| Parameter | Value |
|---|---|
| FFT size | 1024 points |
| Sample rate | 44,100 Hz |
| Frequency resolution | 43.07 Hz per bin |
| Frame rate | ~43 frames/second |
| Window function | Hanning (periodic) |
| Output bins | 256 (mapped to 0-1000 Hz) |
| Bit-reversal | Pre-computed lookup table |
| Twiddle factors | Pre-computed sin/cos table |
Detection criteria (all must be satisfied):
| Criterion | Threshold | Purpose |
|---|---|---|
| SNR | >= 12 dB | Peak must be significantly above noise floor |
| Amplitude | >= -55 dBFS | Reject quiet ambient fluctuations |
| Harmonic | 2nd harmonic >= 4 dB above local noise | Confirm propeller signature vs. random peak |
| Stability | +/- 15 Hz across 3+ consecutive frames | Reject transient peaks |
| Confirmation | 6 consecutive valid frames (~150ms) | Reject brief coincidences |
Harmonic validation algorithm:
Drone propellers produce strong harmonics due to blade geometry. A peak at fundamental frequency f0 should have a corresponding peak at 2f0 (second harmonic). VAJRA validates this by:
- Computing the expected harmonic bin:
bin_h2 = 2 * bin_f0 - Finding the maximum magnitude within +/-1 bin of the expected position
- Computing the local noise floor in a +/-15 bin window (excluding +/-3 bins around harmonic)
- Requiring the harmonic SNR >= 4 dB
This criterion effectively rejects non-propeller sources (speech, music, traffic) that may have energy in the 50-500 Hz range but lack the harmonic structure of rotating blades.
Distance estimation:
Signal amplitude provides a rough distance estimate via inverse-square-law scaling:
distance = DIST_CLOSE + (amplitude_dB - AMP_CLOSE) / (AMP_FAR - AMP_CLOSE)
* (DIST_FAR - DIST_CLOSE)
Where AMP_CLOSE = -20 dBFS -> 50m, AMP_FAR = -60 dBFS -> 1000m.
4.3 CNN Acoustic Classifier
4.3.1 Model Architecture
The acoustic classifier is a compact CNN operating on log-mel spectrograms:
Input: (64, 44, 1) -64 mel bands x 44 time frames x 1 channel
Conv2D(16, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2) -> (32, 22, 16)
Conv2D(32, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2) -> (16, 11, 32)
Conv2D(64, 3x3, same) -> BatchNorm -> ReLU -> MaxPool(2x2) -> (8, 5, 64)
Conv2D(64, 3x3, same) -> BatchNorm -> ReLU -> GlobalAvgPool -> (64,)
Dropout(0.3) -> Dense(32, ReLU) -> Dropout(0.2) -> Dense(5, softmax)
Output: [ambient, quadcopter_small, quadcopter_large, fixed_wing, helicopter_uav]
Model specifications:
| Parameter | Value |
|---|---|
| Total parameters | ~52K |
| Model size (TFLite float16) | 129 KB |
| Mel filterbank size | 128 KB (64 x 513 float32) |
| Input shape | [1, 64, 44, 1] |
| Output classes | 5 |
| Confidence threshold | 0.35 |
| Inference time | < 50ms on mobile CPU |
4.3.2 Mel Spectrogram Computation
The on-device mel spectrogram pipeline matches the training pipeline (Python librosa):
- Windowing: 1024-sample Hanning window, 1024-sample hop (non-overlapping)
- FFT: 1024-point radix-2 Cooley-Tukey (implemented in pure Kotlin)
- Power spectrum: |X[k]|^2 for k = 0..512
- Mel filterbank: 64 triangular filters spanning 20-2000 Hz, applied as matrix multiply
- Log compression: 10 x log10(power + 10^-10), clipped to 80 dB dynamic range (matching
librosa.power_to_db(ref=np.max, top_db=80)) - Normalization: Min-max scaling to [0, 1]
The mel filterbank weights are pre-computed using librosa.filters.mel(sr=44100, n_fft=1024, n_mels=64, fmin=20, fmax=2000) and shipped as a binary asset (128 KB).
4.3.3 Training Data: Synthetic Drone Audio Generation
A key contribution of this work is the synthetic training data generation pipeline. Due to the difficulty of collecting real drone audio across multiple types and conditions, we generate physically accurate propeller audio from known frequency profiles.
Generation parameters per drone profile:
| Parameter | Description | Value range |
|---|---|---|
| Fundamental frequency | Blade-pass frequency | 28-400 Hz |
| Harmonics | 2nd, 3rd, 4th overtones | Amplitude: 0.5, 0.25, 0.12 x fundamental |
| Number of propellers | Multi-rotor detuning | 1 (fixed-wing) to 6 (hexcopter) |
| Propeller detuning | Inter-motor frequency offset | +/-3-8 Hz |
| RPM modulation | Throttle variation | +/-5% at 0.2-0.5 Hz |
| Amplitude modulation | Doppler/distance simulation | +/-20% at 0.05-0.15 Hz |
| Background noise | Pink noise (1/f spectrum) | SNR: 15, 25, 40 dB |
Drone profiles used for training:
| Class | Profiles | Frequencies |
|---|---|---|
| quadcopter_small | DJI Mavic 3, DJI Mini, DJI Phantom 4, Generic | 215, 230, 240, 260 Hz |
| quadcopter_large | FPV Racing, FPV Attack, Matrice 600, Heavy Quad | 160, 180, 375, 400 Hz |
| fixed_wing | Bayraktar TB2, Shahed-136, Orlan-10, RC Plane | 65, 95, 110, 130 Hz |
| helicopter_uav | Heron TP, Coaxial Heli, Heli UAV, Medium Heli | 28, 35, 45, 55 Hz |
| ambient | ESC-50 dataset (21 categories) | N/A |
Signal synthesis formula:
For a drone with N propellers, fundamental frequency f0, and detuning delta:
signal(t) = Sum_p=1^N Sum_h=1^4 (a_h / N) x sin(2pi x integral_0^t f_ph(tau) dtau)
where:
f_ph(t) = h x [f0 + delta_p + delta_f x sin(2pi x fmod x t + phi_p)]
delta_p = (p - N/2) x delta / N (per-propeller detuning)
delta_f = 0.05 x f0 (RPM modulation depth)
fmod ~ U(0.2, 0.5) Hz (modulation rate)
a_h = [1.0, 0.5, 0.25, 0.12] (harmonic amplitudes)
This produces audio with the characteristic "buzzing" quality of multi-rotor drones, where slightly detuned motors create beating patterns in the frequency domain.
Training dataset composition:
| Source | Samples | Duration |
|---|---|---|
| Synthetic drone audio (16 profiles x 3 SNR x 60s) | ~5,600 clips | 48 minutes |
| ESC-50 ambient (21 categories) | ~6,300 clips | 35 minutes |
| Total (before augmentation) | ~11,900 | 83 minutes |
Data augmentation (SpecAugment-style [4]):
- Time masking: random 1-5 frame zeroing
- Frequency masking: random 1-8 mel band zeroing
- Gaussian noise injection: sigma in [0.01, 0.05]
- Time shifting: +/-3 frames circular shift
After augmentation and class balancing, the final training set contains ~3,000+ samples per class.
Training configuration:
| Parameter | Value |
|---|---|
| Optimizer | Adam (lr=0.001) |
| Loss | Sparse categorical cross-entropy |
| Batch size | 32 |
| Epochs | 100 (early stopping, patience=15) |
| LR schedule | ReduceLROnPlateau (factor=0.5, patience=5) |
| Train/val split | 80/20 (stratified) |
| Best val accuracy | 100% (42 epochs) |
4.4 Spectrogram Visualization
The acoustic module includes a real-time waterfall spectrogram display:
- Resolution: 256 frequency bins x 200 time rows (~6 seconds visible history)
- Frequency range: 0-1000 Hz
- Color mapping: Dark green -> bright green -> yellow -> orange/red -> white (magnitude-proportional)
- Harmonic markers: Vertical dashed lines at detected fundamental and harmonics when drone is detected
5. RF Analysis Module
5.1 Architecture
The RF analysis module performs spectrum monitoring to detect drone control link transmissions. In production deployment, this requires an external SDR (software-defined radio) connected via USB-OTG:
| Hardware | Cost | Bandwidth | Frequency Range |
|---|---|---|---|
| RTL-SDR v3 | $25 | 2.048 MSPS | 25 MHz - 1.7 GHz |
| HackRF One | $300 | 20 MSPS | 1 MHz - 6 GHz |
5.2 Detection Pipeline
IQ Samples (SDR) -> FFT -> Peak Detection -> Bandwidth Extraction
-> Modulation Classification -> Protocol Fingerprinting -> Database Match
Target frequency bands:
| Band | Usage | Typical drones |
|---|---|---|
| 2.4 GHz | Control link (primary) | DJI, consumer quads |
| 5.8 GHz | HD video downlink | DJI, racing drones |
| 900 MHz | Long-range control | FPV (Crossfire), military |
| 1575.42 MHz | GPS L1 navigation | All GPS-dependent drones |
5.3 Protocol Fingerprinting
Each drone control protocol has a distinctive RF fingerprint:
| Protocol | Bandwidth | Modulation | Characteristic |
|---|---|---|---|
| DJI OcuSync 3.0 | 10-20 MHz | OFDM | Dual-band simultaneous |
| DJI Lightbridge 2 | 10 MHz | OFDM | Single-band |
| ExpressLRS | 500 kHz | LoRa | Narrow-band, frequency-hopping |
| Crossfire | 1 MHz | LoRa | 900 MHz long-range |
| MIL-STD encrypted | Variable | FHSS | Frequency-hopping spread spectrum |
6. Multi-Sensor Fusion and Threat Display
6.1 Fusion Architecture
The threat display module aggregates detections from all sensor modalities into a unified tactical picture. Each detection source provides complementary information:
| Sensor | Provides | Limitations |
|---|---|---|
| Visual (camera) | Bearing, visual classification, size estimation | Limited range (~500m), LoS only, weather-dependent |
| Acoustic (microphone) | Bearing (with arrays), type classification, presence detection | Limited range (~1km), affected by ambient noise |
| RF (SDR) | Control protocol, operator direction, jammability assessment | Cannot detect autonomous drones |
6.2 Radar Display
The threat display renders a rotating radar sweep with contact blips:
ThreatBlip(
id: String, // Database profile ID
label: String, // Display name
bearingDeg: Float, // 0-360 from north
rangeMeter: Float, // Distance from sensor
speedMps: Float, // Closing speed
headingDeg: Float, // Direction of travel
threatLevel: ThreatLevel, // LOW/MEDIUM/HIGH/CRITICAL
isFriendly: Boolean // IFF classification
)
6.3 Threat Assessment
Each detected contact receives a composite threat score based on:
- Drone type: Military/FPV attack -> CRITICAL, consumer -> LOW/MEDIUM
- Control link: Autonomous/fiber-optic -> higher threat (unjammable)
- Payload capability: Munition-capable -> CRITICAL
- Closing speed and range: Fast-approaching -> elevated priority
7. Countermeasure Control
7.1 Jammer Specifications
VAJRA includes an interface for directing RF countermeasures (requires external jammer hardware in production):
| Band | Power | Target |
|---|---|---|
| 2.4 GHz | 10W | Wi-Fi, DJI OcuSync, ExpressLRS |
| 5.8 GHz | 10W | Video downlinks |
| 900 MHz | 10W | Crossfire, military FH protocols |
| GPS L1 | 5W | Navigation denial |
7.2 Countermeasure Modes
- Barrage jamming: Simultaneous wideband transmission across all bands, forcing drone failsafe (hover -> land)
- DJI RTH spoof: Protocol-level command injection exploiting OcuSync vulnerability to trigger return-to-home
- GPS spoofing: Transmission of false GPS L1 signals at +20 dBm over real signal, redirecting GPS-dependent drones to a designated safe zone
7.3 Jammability Matrix
| Control Link | Jammable | Alternative |
|---|---|---|
| Wireless RF (2.4/5.8 GHz) | Yes | Barrage jam + protocol exploit |
| Fiber-optic | No | Kinetic intercept required |
| Satellite (C/Ku-band) | No | GPS spoofing (if GPS-dependent) |
| Autonomous (GPS-guided) | No (RF) | GPS spoofing may redirect |
8. Results and Discussion
8.1 Visual Detection Performance
The custom-trained YOLOv8n achieves real-time inference on mobile hardware:
| Metric | Value |
|---|---|
| Inference speed | > 25 FPS on mid-range Android |
| Model size | 12 MB (float16 TFLite) |
| Input resolution | 320 x 320 |
| NMS processing | < 2ms |
| Memory footprint | ~45 MB (model + buffers) |
8.2 Acoustic Detection Performance
FFT layer:
| Metric | Value |
|---|---|
| Processing rate | ~43 frames/second |
| Detection latency | ~150ms (6 confirmation frames) |
| Frequency resolution | 43.07 Hz |
| Effective range | 50-1000m (estimated) |
CNN classifier:
| Metric | Value |
|---|---|
| Inference speed | < 50ms per 1-second clip |
| Model size | 129 KB (float16 TFLite) + 128 KB filterbank |
| Validation accuracy | 100% on synthetic test set |
| Classes | 5 (ambient + 4 drone types) |
8.3 System-Level Performance
| Metric | Value |
|---|---|
| Total APK size | 38 MB |
| Cold start time | < 3 seconds |
| Battery impact | ~15% per hour (active scanning) |
| Network requirement | None (fully offline) |
| Minimum hardware | Android 8.0, ARM64, 2GB RAM |
8.4 Limitations
-
Synthetic training data gap: The acoustic classifier achieves 100% accuracy on synthetic test data but shows reduced performance on real-world audio played through speakers. This is expected -synthetic audio has idealized harmonic structure that differs from audio captured through the speaker -> air -> microphone chain. The spectral characteristics are transformed by speaker frequency response, room acoustics, and microphone response curves.
-
Visual detection range: At 320x320 input resolution, small drones at distances beyond ~200m occupy very few pixels, reducing detection confidence. Higher input resolution (640x640) would improve range at the cost of inference speed.
-
Low-frequency drones: Fixed-wing drones (65-130 Hz) and helicopter UAVs (28-55 Hz) produce fundamental frequencies below the reproduction capability of most phone speakers (typically > 150 Hz), making speaker-based testing impossible for these categories. Real-world testing with actual drones is required to validate these classes.
-
RF module: Currently operates with simulated data; production deployment requires external SDR hardware.
9. Improving with More Data
9.1 Visual Model Improvements
The current YOLOv8n model can be significantly improved through expanded training data:
Priority datasets:
| Source | Type | Expected benefit |
|---|---|---|
| Real drone flight recordings | Video frames at various distances, angles, backgrounds | +10-15% accuracy at range |
| Thermal/IR imagery | FLIR sensor captures | Night-time detection capability |
| Adverse weather footage | Rain, fog, dusk/dawn | Robustness in operational conditions |
| Drone-specific negative samples | Birds, aircraft, kites, balloons | Reduced false positive rate |
| Synthetic data (AirSim/Gazebo) | Rendered drone models on diverse backgrounds | Scale training data 10-100x |
Recommended training pipeline:
- Start with current custom-trained model as baseline
- Collect 10,000+ real drone images across 5+ drone types
- Include 20,000+ negative samples (birds at various sizes, aircraft)
- Train YOLOv8s (small) or YOLOv8m (medium) for higher accuracy
- Apply INT8 quantization for mobile deployment
- Target: > 90% mAP@0.5 at 50-500m range
9.2 Acoustic Model Improvements
Path 1: Real drone recordings
The most impactful improvement is supplementing synthetic data with real recordings:
Recording Protocol:
1. Record each drone type at 3 distances (50m, 200m, 500m)
2. Record in 3 environments (open field, urban, forested)
3. Use phone microphone (matches deployment hardware)
4. Minimum 2 minutes per recording
5. Include hover, forward flight, and approach maneuvers
Even 30 seconds of real drone audio per type would significantly improve generalization from synthetic-only training.
Path 2: Transfer learning from large audio models
Pre-trained audio models (AudioSet, VGGish, YAMNet) can provide learned feature representations that generalize better than training from scratch:
- Use YAMNet (Google's AudioSet model) as feature extractor
- Replace final classification head with 5-class drone classifier
- Fine-tune on synthetic + real drone data
- Expected benefit: better ambient rejection, more robust features
Path 3: Environmental calibration
Record 1-2 minutes of ambient sound at the deployment location. Use this as additional "ambient" training data specific to the operating environment. This trains the model to reject site-specific background noise (nearby roads, industrial equipment, wildlife).
Path 4: Data augmentation improvements
| Augmentation | Purpose |
|---|---|
| Room impulse response (RIR) convolution | Simulate speaker-to-mic acoustic path |
| Speaker frequency response simulation | Model typical phone/laptop speaker rolloff |
| Multi-drone mixing | Detect when 2+ drones present simultaneously |
| Wind noise overlay | Outdoor robustness |
| Variable-distance amplitude scaling | Continuous distance modeling |
9.3 Cross-Modal Learning
Future work could explore cross-modal training where visual and acoustic detections reinforce each other:
- When visual detection confirms a drone, label the concurrent acoustic data
- Build a self-supervised dataset from field deployments
- Train a fusion model that jointly processes visual features + mel spectrograms
- This approach could improve detection confidence when either modality alone is uncertain
10. Related Work
Commercial C-UAS systems: DroneShield DroneSentry [5] uses acoustic arrays + radar + RF detection with costs exceeding $500K. Dedrone DedroneTracker [6] uses camera + RF + radar with cloud processing. Neither operates on mobile devices.
Academic drone detection: Kim et al. [7] demonstrated acoustic drone detection using mel spectrograms with 94% accuracy on recorded drone audio. Al-Emadi et al. [8] proposed a CNN-based acoustic classifier achieving 96% accuracy on real drone recordings. Our synthetic data approach complements these works by enabling rapid prototyping without drone access.
On-device ML: YOLOv8n represents the state-of-the-art in mobile object detection [3], achieving 37.3% mAP on COCO at 80+ FPS on mobile GPUs. Our application demonstrates its viability for specialized drone detection tasks.
11. Conclusion
VAJRA demonstrates that a practical multi-sensor counter-UAS system can operate entirely on a commercial smartphone. By combining custom-trained visual (YOLOv8n, 12MB) and acoustic (CNN, 129KB) deep learning models with real-time FFT signal processing and an RF analysis pipeline, the system provides drone detection, classification, and countermeasure guidance without any network dependency.
The synthetic audio training approach enables rapid model development for new drone types without requiring physical access to each drone. While the synthetic-to-real domain gap remains a challenge, we outline a clear path to closing this gap through real-world recordings, transfer learning, and environmental calibration.
VAJRA's fully on-device architecture makes it uniquely suited for:
- Forward military positions with denied/degraded communications
- Border security posts in remote areas without network infrastructure
- Critical infrastructure protection where data sovereignty requires on-premises processing
- Rapid deployment -any soldier with a smartphone becomes a drone detection node
Future work will focus on expanding the training datasets with real drone recordings, implementing acoustic direction-of-arrival using multi-microphone arrays, and integrating external SDR hardware for production RF analysis capability.
References
- [1] Watling, J. & Reynolds, N. "The Role of Drones in the Russia-Ukraine War." Royal United Services Institute, 2023.
- [2] Ministry of Defence, Government of India. "Anti-Drone Technology Requirements for Border Security." DRDO Technology Perspective, 2024.
- [3] Jocher, G., Chaurasia, A., & Qiu, J. "Ultralytics YOLOv8." Ultralytics, 2023.
- [4] Park, D.S. et al. "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech, 2019.
- [5] DroneShield. "DroneSentry: Integrated Detect-and-Defeat Counter-Drone System." DroneShield Technical Datasheet, 2024.
- [6] Dedrone. "DedroneTracker: AI-Powered Airspace Security Platform." Dedrone Product Documentation, 2024.
- [7] Kim, J. et al. "Acoustic-Based Drone Detection and Classification Using Mel Spectrograms and Convolutional Neural Networks." IEEE Access, vol. 9, 2021.
- [8] Al-Emadi, S. et al. "Audio Based Drone Detection and Identification Using Deep Learning." IEEE International Workshop on Signal Processing Advances in Wireless Communications, 2019.