SLM360 Base

On-Device Decoder for Text Generation

// overview

SLM360 Base is a 125M-parameter causal decoder for autoregressive text generation. It generates text at sub-50ms per token with INT4 quantization on mobile hardware, using a KV cache for O(1) per-token generation. Weight-tied LM head saves 20% parameters. Grouped Query Attention provides 3x KV cache reduction. LoRA adapters enable per-user personalization in under 500KB.

// specs

Specifications

125M

Parameters

768

Embedding Dim

Layers

12 / 4 (GQA)

Attention Heads (Q/KV)

32,000

Vocabulary

2,048

Max Sequence Length

501MB

Size (f32)

63MB

Size (INT4)

<50ms

Latency/Token (INT4, ARM)

<20ms

Latency/Token (INT4, x86)

Compression Ratio

3x vs MHA

KV Cache Savings

// architecture

Architecture

 1  Token IDs > Embedding (32,000 x 768)
 2  + RoPE Position Encoding
 3  16 x DecoderBlock: RMSNorm > GQA (12 heads, 4 KV, causal mask) > + Residual
 4  16 x DecoderBlock: RMSNorm > SwiGLU (768 > 2,048 > 768) > + Residual
 5  RMSNorm > LM Head (weight-tied with embedding, > 32,000)

// features

Features

01Causal attention mask for autoregressive generation
02Weight-tied LM head saves 24.6M parameters (20% reduction)
03Pre-allocated KV cache for O(1) per-token generation
0412:4 Grouped Query Attention for 3x KV cache reduction (~38MB saved at 2,048 seq length)
05LoRA adapters for per-user personalization (<500KB per adapter)
06Sampling: greedy, top-k, top-p, temperature, repetition penalty
07Two-phase generation: parallel prefill + sequential decode
08Binary serialization with 64-byte alignment for SIMD and cache-line optimization
09Memory-mapped loading for zero-copy weight access on native platforms
10INT4 group-wise quantization with >0.99 cosine similarity to f32

// benchmarks

Benchmarks

Dataset	Score	Comparison
Prefill (128 tokens, ARM)	~200ms	-
Prefill (128 tokens, x86)	~80ms	-
INT4 Accuracy Retention	98.5%	-
Mean Absolute Error (INT4)	<0.01	-
Energy per Query	0.001 Wh	Cloud LLM: 0.42-29 Wh

// deployment

Deployment

01Native ARM with NEON SIMD acceleration
02Native x86 with AVX2 + FMA acceleration
03WebAssembly for browser-based generation
04Android via JNI bindings
05iOS via FFI bindings

// end of modelSLM360 Base