{ slm360-base }· Released
SLM360 Base
On-Device Decoder for Text Generation
// overview
SLM360 Base is a 125M-parameter causal decoder for autoregressive text generation. It generates text at sub-50ms per token with INT4 quantization on mobile hardware, using a KV cache for O(1) per-token generation. Weight-tied LM head saves 20% parameters. Grouped Query Attention provides 3x KV cache reduction. LoRA adapters enable per-user personalization in under 500KB.
// specs
Specifications
125M
Parameters
768
Embedding Dim
16
Layers
12 / 4 (GQA)
Attention Heads (Q/KV)
32,000
Vocabulary
2,048
Max Sequence Length
501MB
Size (f32)
63MB
Size (INT4)
<50ms
Latency/Token (INT4, ARM)
<20ms
Latency/Token (INT4, x86)
8x
Compression Ratio
3x vs MHA
KV Cache Savings
// architecture
Architecture
1 Token IDs > Embedding (32,000 x 768)2 + RoPE Position Encoding3 16 x DecoderBlock: RMSNorm > GQA (12 heads, 4 KV, causal mask) > + Residual4 16 x DecoderBlock: RMSNorm > SwiGLU (768 > 2,048 > 768) > + Residual5 RMSNorm > LM Head (weight-tied with embedding, > 32,000)
// features
Features
- 01Causal attention mask for autoregressive generation
- 02Weight-tied LM head saves 24.6M parameters (20% reduction)
- 03Pre-allocated KV cache for O(1) per-token generation
- 0412:4 Grouped Query Attention for 3x KV cache reduction (~38MB saved at 2,048 seq length)
- 05LoRA adapters for per-user personalization (<500KB per adapter)
- 06Sampling: greedy, top-k, top-p, temperature, repetition penalty
- 07Two-phase generation: parallel prefill + sequential decode
- 08Binary serialization with 64-byte alignment for SIMD and cache-line optimization
- 09Memory-mapped loading for zero-copy weight access on native platforms
- 10INT4 group-wise quantization with >0.99 cosine similarity to f32
// benchmarks
Benchmarks
| Dataset | Score | Comparison |
|---|---|---|
| Prefill (128 tokens, ARM) | ~200ms | - |
| Prefill (128 tokens, x86) | ~80ms | - |
| INT4 Accuracy Retention | 98.5% | - |
| Mean Absolute Error (INT4) | <0.01 | - |
| Energy per Query | 0.001 Wh | Cloud LLM: 0.42-29 Wh |
// deployment
Deployment
- 01Native ARM with NEON SIMD acceleration
- 02Native x86 with AVX2 + FMA acceleration
- 03WebAssembly for browser-based generation
- 04Android via JNI bindings
- 05iOS via FFI bindings
// end of modelSLM360 Base