[360Labs.ai]
[360Labs.ai]
{ 360Labs }Model
{ slm360-base }· Released

SLM360 Base

On-Device Decoder for Text Generation

// overview

SLM360 Base is a 125M-parameter causal decoder for autoregressive text generation. It generates text at sub-50ms per token with INT4 quantization on mobile hardware, using a KV cache for O(1) per-token generation. Weight-tied LM head saves 20% parameters. Grouped Query Attention provides 3x KV cache reduction. LoRA adapters enable per-user personalization in under 500KB.

// specs

Specifications

125M

Parameters

768

Embedding Dim

16

Layers

12 / 4 (GQA)

Attention Heads (Q/KV)

32,000

Vocabulary

2,048

Max Sequence Length

501MB

Size (f32)

63MB

Size (INT4)

<50ms

Latency/Token (INT4, ARM)

<20ms

Latency/Token (INT4, x86)

8x

Compression Ratio

3x vs MHA

KV Cache Savings

// architecture

Architecture

1 Token IDs > Embedding (32,000 x 768)
2 + RoPE Position Encoding
3 16 x DecoderBlock: RMSNorm > GQA (12 heads, 4 KV, causal mask) > + Residual
4 16 x DecoderBlock: RMSNorm > SwiGLU (768 > 2,048 > 768) > + Residual
5 RMSNorm > LM Head (weight-tied with embedding, > 32,000)
// features

Features

  • 01Causal attention mask for autoregressive generation
  • 02Weight-tied LM head saves 24.6M parameters (20% reduction)
  • 03Pre-allocated KV cache for O(1) per-token generation
  • 0412:4 Grouped Query Attention for 3x KV cache reduction (~38MB saved at 2,048 seq length)
  • 05LoRA adapters for per-user personalization (<500KB per adapter)
  • 06Sampling: greedy, top-k, top-p, temperature, repetition penalty
  • 07Two-phase generation: parallel prefill + sequential decode
  • 08Binary serialization with 64-byte alignment for SIMD and cache-line optimization
  • 09Memory-mapped loading for zero-copy weight access on native platforms
  • 10INT4 group-wise quantization with >0.99 cosine similarity to f32
// benchmarks

Benchmarks

DatasetScoreComparison
Prefill (128 tokens, ARM)~200ms-
Prefill (128 tokens, x86)~80ms-
INT4 Accuracy Retention98.5%-
Mean Absolute Error (INT4)<0.01-
Energy per Query0.001 WhCloud LLM: 0.42-29 Wh
// deployment

Deployment

  • 01Native ARM with NEON SIMD acceleration
  • 02Native x86 with AVX2 + FMA acceleration
  • 03WebAssembly for browser-based generation
  • 04Android via JNI bindings
  • 05iOS via FFI bindings
// end of modelSLM360 Base