Product / Technology

Enterprise AI on Your Hardware

Local AI that grows with your organization: from a compact entry-level server to a high-availability cluster. Two AI model tiers, fully on-premises — comparable to commercial cloud AI services, but under your control.

Cluster Architecture

Every Layer Optimized for Its Purpose

The hardware scales with your user count — from a single GPU server (entry level) to a compact DGX Spark cluster to a high-availability rack setup. The table below shows the logical layers; the compute layer grows with the chosen configuration (see below).

Layer	Component	Specification	Role
Compute	2× NVIDIA L40S → 4× DGX Spark	96 GB → 512 GB	LLM Inference
Interconnect	InfiniBand / Load-Balancer	200 Gbps (Cluster / HA)	Node Fabric
Model (Quality)	Qwen3.5-35B-A3B (MoE)	3.3B active / 35B total, FP8	Sonnet-Tier Tasks
Model (Throughput)	Qwen3.5-4B	FP8, Mamba+MoE	Haiku-Tier Tasks
Inference Stack	SGLang / vLLM	CUDA, TRT-LLM, NCCL	Request Routing
API Layer	OpenAI-compatible REST API	HTTPS, mTLS, JWT Auth	Atlas Integration
Application	contboxx Atlas	On-premises installation	Knowledge Management

Hardware Configurations

Three Configurations — Scaled to Your Size

The local AI runs on your own hardware — a one-time purchase, no recurring cloud costs. The right size depends on user count and usage intensity: from a single GPU server for entry level to a high-availability cluster. Hardware is not part of the license and can also be customer-provided.

Entry · Baseline

Compact GPU server

up to ~250 employees

2× NVIDIA L40S 48 GB (96 GB total) — 864 GB/s per card
One model tier per card
2U standard server — no special rack, no water cooling
Incl. next-business-day support, redundancy optional

Cluster

4× NVIDIA DGX Spark

up to ~500 employees

512 GB Unified Memory (4× 128 GB)
200 Gbps InfiniBand RDMA fabric
Higher concurrency & throughput headroom
Desktop form factor, ~1,000 W, air cooling

High Availability · With Redundancy

2× rack servers, redundant

500+ employees

2× redundant GPU servers with load balancer
N+1 fault tolerance, SLA-capable
GPU class scalable: L40S to H100/H200
For business-critical continuous operation

NVIDIA DGX Spark Cluster — die Cluster-Konfiguration von contboxx Vault

Shown: the cluster configuration (4× NVIDIA DGX Spark).

Indicative guidance; final sizing is determined by the load profile. Prices and configuration details are on the pricing page.

Two-Tier Model Architecture

The cluster runs two LLM tiers simultaneously, tuned to the different processing requirements of contboxx Atlas.

Sonnet Tier — Deep Processing

Qwen3.5-35B-A3B

Sonnet-Tier Quality

Mixture-of-experts with just 3.3 billion active of 35 billion parameters, FP8-quantized — runs efficiently on a single GPU. For tasks where quality, nuance, and reasoning depth matter:

Complex RAG queries
Long-form summaries
Cross-document synthesis
Search intent detection
Compliance analysis
Draft generation
Onboarding assistance

Throughput: ~30–75 tokens/s Parameter: 35B (3,3B aktiv) VRAM: ~30 GB (FP8)

Haiku Tier — Fast Processing

Qwen3.5-4B

Haiku-Tier Speed

Compact, FP8-quantized Mamba+MoE model with ample concurrency headroom. For routine operations that require speed over deep reasoning:

Full-text indexing
Embedding generation
Auto-tagging & classification
Short Q&A
Duplicate detection
Automatic summaries

Throughput: ~30–40 tokens/s Success rate: 98.8% VRAM: ~8 GB (FP8)

Performance & Capacity

Enterprise-Grade Throughput

Measured in a multi-week sustained test on NVIDIA DGX Spark (GB10) under real pipeline load:

Model	Tier	Architecture	Decode (Tok/s)	Success rate
Qwen3.5-4B	Speed tier	Mamba+MoE · 4B	27–42	98,8 %
Qwen3.5-35B-A3B	Quality-Tier	MoE · 3,3B aktiv	28–77	95–100 %

Model memory (FP8)

~40 GB

Weights of both model tiers (FP8)

Streaming

Real-time

Progressive output after first token

Speculative Decoding

1,5–2× Speedup

EAGLE3, minimal accuracy loss

Software Stack

Inference SGLang / vLLM — optimized for continuous batching and high throughput, CUDA, TRT-LLM, NCCL

API OpenAI-compatible REST API (POST /v1/chat/completions) — drop-in replacement for existing cloud integrations

RAG Retrieval-Augmented Generation with vector database for semantic search, local embedding generation

Security mTLS, JWT-based authorization, encrypted storage, audit logging, network isolation

Network Fully air-gapped capable — internet only required for initial model download

Operating System NVIDIA DGX OS (Ubuntu-based) with defined security patch cycle

Availability & Reliability

Fault-Tolerant by Design

SGLang Inference Server runs as a systemd service with automatic restart on failure

Graceful Model Failover: On Sonnet-tier error, Atlas falls back to Haiku tier — limited, but functional

DGX Spark nodes run independently; failure of one node degrades the service but does not eliminate it

Optional redundant QM8700 switch for full high availability

NAS backup system secures model weights, configuration, and indices for recovery on node failure

Technical questions? We have answers.

Schedule a technical conversation with our architecture team.

Schedule Technical Call