// slm360-base
SLM360 Base
ReleasedOn-Device Decoder for Text Generation
SLM360 Base is a 125M-parameter causal decoder for autoregressive text generation. It generates text at sub-50ms per token with INT4 quantization on mobile hardware, using a KV cache for O(1) per-token generation. Weight-tied LM head saves 20% parameters. Grouped Query Attention provides 3x KV cache reduction. LoRA adapters enable per-user personalization in under 500KB.
Specifications
125M
Parameters
768
Embedding Dim
16
Layers
12 / 4 (GQA)
Attention Heads (Q/KV)
32,000
Vocabulary
2,048
Max Sequence Length
501MB
Size (f32)
63MB
Size (INT4)
<50ms
Latency/Token (INT4, ARM)
<20ms
Latency/Token (INT4, x86)
8x
Compression Ratio
3x vs MHA
KV Cache Savings
Architecture
1 Token IDs > Embedding (32,000 x 768)2 + RoPE Position Encoding3 16 x DecoderBlock: RMSNorm > GQA (12 heads, 4 KV, causal mask) > + Residual4 16 x DecoderBlock: RMSNorm > SwiGLU (768 > 2,048 > 768) > + Residual5 RMSNorm > LM Head (weight-tied with embedding, > 32,000)
Features
- Causal attention mask for autoregressive generation
- Weight-tied LM head saves 24.6M parameters (20% reduction)
- Pre-allocated KV cache for O(1) per-token generation
- 12:4 Grouped Query Attention for 3x KV cache reduction (~38MB saved at 2,048 seq length)
- LoRA adapters for per-user personalization (<500KB per adapter)
- Sampling: greedy, top-k, top-p, temperature, repetition penalty
- Two-phase generation: parallel prefill + sequential decode
- Binary serialization with 64-byte alignment for SIMD and cache-line optimization
- Memory-mapped loading for zero-copy weight access on native platforms
- INT4 group-wise quantization with >0.99 cosine similarity to f32
Benchmarks
| Dataset | Score | Comparison |
|---|---|---|
| Prefill (128 tokens, ARM) | ~200ms | - |
| Prefill (128 tokens, x86) | ~80ms | - |
| INT4 Accuracy Retention | 98.5% | - |
| Mean Absolute Error (INT4) | <0.01 | - |
| Energy per Query | 0.001 Wh | Cloud LLM: 0.42-29 Wh |
Deployment Targets
- >Native ARM with NEON SIMD acceleration
- >Native x86 with AVX2 + FMA acceleration
- >WebAssembly for browser-based generation
- >Android via JNI bindings
- >iOS via FFI bindings