[360Labs.ai]
0%
[360Labs.ai]
0%
// slm360-base

SLM360 Base

Released

On-Device Decoder for Text Generation

SLM360 Base is a 125M-parameter causal decoder for autoregressive text generation. It generates text at sub-50ms per token with INT4 quantization on mobile hardware, using a KV cache for O(1) per-token generation. Weight-tied LM head saves 20% parameters. Grouped Query Attention provides 3x KV cache reduction. LoRA adapters enable per-user personalization in under 500KB.

Specifications

125M

Parameters

768

Embedding Dim

16

Layers

12 / 4 (GQA)

Attention Heads (Q/KV)

32,000

Vocabulary

2,048

Max Sequence Length

501MB

Size (f32)

63MB

Size (INT4)

<50ms

Latency/Token (INT4, ARM)

<20ms

Latency/Token (INT4, x86)

8x

Compression Ratio

3x vs MHA

KV Cache Savings

Architecture

1 Token IDs > Embedding (32,000 x 768)
2 + RoPE Position Encoding
3 16 x DecoderBlock: RMSNorm > GQA (12 heads, 4 KV, causal mask) > + Residual
4 16 x DecoderBlock: RMSNorm > SwiGLU (768 > 2,048 > 768) > + Residual
5 RMSNorm > LM Head (weight-tied with embedding, > 32,000)

Features

  • Causal attention mask for autoregressive generation
  • Weight-tied LM head saves 24.6M parameters (20% reduction)
  • Pre-allocated KV cache for O(1) per-token generation
  • 12:4 Grouped Query Attention for 3x KV cache reduction (~38MB saved at 2,048 seq length)
  • LoRA adapters for per-user personalization (<500KB per adapter)
  • Sampling: greedy, top-k, top-p, temperature, repetition penalty
  • Two-phase generation: parallel prefill + sequential decode
  • Binary serialization with 64-byte alignment for SIMD and cache-line optimization
  • Memory-mapped loading for zero-copy weight access on native platforms
  • INT4 group-wise quantization with >0.99 cosine similarity to f32

Benchmarks

DatasetScoreComparison
Prefill (128 tokens, ARM)~200ms-
Prefill (128 tokens, x86)~80ms-
INT4 Accuracy Retention98.5%-
Mean Absolute Error (INT4)<0.01-
Energy per Query0.001 WhCloud LLM: 0.42-29 Wh

Deployment Targets

  • >Native ARM with NEON SIMD acceleration
  • >Native x86 with AVX2 + FMA acceleration
  • >WebAssembly for browser-based generation
  • >Android via JNI bindings
  • >iOS via FFI bindings