Product / Technology
Enterprise AI on Your Hardware
Local AI that grows with your organization: from a compact entry-level server to a high-availability cluster. Two AI model tiers, fully on-premises — comparable to commercial cloud AI services, but under your control.
Cluster Architecture
Every Layer Optimized for Its Purpose
The hardware scales with your user count — from a single GPU server (entry level) to a compact DGX Spark cluster to a high-availability rack setup. The table below shows the logical layers; the compute layer grows with the chosen configuration (see below).
| Layer | Component | Specification | Role |
|---|---|---|---|
| Compute | 2× NVIDIA L40S → 4× DGX Spark | 96 GB → 512 GB | LLM Inference |
| Interconnect | InfiniBand / Load-Balancer | 200 Gbps (Cluster / HA) | Node Fabric |
| Model (Quality) | Qwen3.5-35B-A3B (MoE) | 3.3B active / 35B total, FP8 | Sonnet-Tier Tasks |
| Model (Throughput) | Qwen3.5-4B | FP8, Mamba+MoE | Haiku-Tier Tasks |
| Inference Stack | SGLang / vLLM | CUDA, TRT-LLM, NCCL | Request Routing |
| API Layer | OpenAI-compatible REST API | HTTPS, mTLS, JWT Auth | Atlas Integration |
| Application | contboxx Atlas | On-premises installation | Knowledge Management |
Hardware Configurations
Three Configurations — Scaled to Your Size
The local AI runs on your own hardware — a one-time purchase, no recurring cloud costs. The right size depends on user count and usage intensity: from a single GPU server for entry level to a high-availability cluster. Hardware is not part of the license and can also be customer-provided.
Compact GPU server
up to ~250 employees
- 2× NVIDIA L40S 48 GB (96 GB total) — 864 GB/s per card
- One model tier per card
- 2U standard server — no special rack, no water cooling
- Incl. next-business-day support, redundancy optional
4× NVIDIA DGX Spark
up to ~500 employees
- 512 GB Unified Memory (4× 128 GB)
- 200 Gbps InfiniBand RDMA fabric
- Higher concurrency & throughput headroom
- Desktop form factor, ~1,000 W, air cooling
2× rack servers, redundant
500+ employees
- 2× redundant GPU servers with load balancer
- N+1 fault tolerance, SLA-capable
- GPU class scalable: L40S to H100/H200
- For business-critical continuous operation
Shown: the cluster configuration (4× NVIDIA DGX Spark).
Indicative guidance; final sizing is determined by the load profile. Prices and configuration details are on the pricing page.
Two-Tier Model Architecture
The cluster runs two LLM tiers simultaneously, tuned to the different processing requirements of contboxx Atlas.
Qwen3.5-35B-A3B
Mixture-of-experts with just 3.3 billion active of 35 billion parameters, FP8-quantized — runs efficiently on a single GPU. For tasks where quality, nuance, and reasoning depth matter:
- Complex RAG queries
- Long-form summaries
- Cross-document synthesis
- Search intent detection
- Compliance analysis
- Draft generation
- Onboarding assistance
Qwen3.5-4B
Compact, FP8-quantized Mamba+MoE model with ample concurrency headroom. For routine operations that require speed over deep reasoning:
- Full-text indexing
- Embedding generation
- Auto-tagging & classification
- Short Q&A
- Duplicate detection
- Automatic summaries
Performance & Capacity
Enterprise-Grade Throughput
Measured in a multi-week sustained test on NVIDIA DGX Spark (GB10) under real pipeline load:
| Model | Tier | Architecture | Decode (Tok/s) | Success rate |
|---|---|---|---|---|
| Qwen3.5-4B | Speed tier | Mamba+MoE · 4B | 27–42 | 98,8 % |
| Qwen3.5-35B-A3B | Quality-Tier | MoE · 3,3B aktiv | 28–77 | 95–100 % |
Model memory (FP8)
~40 GB
Weights of both model tiers (FP8)
Streaming
Real-time
Progressive output after first token
Speculative Decoding
1,5–2× Speedup
EAGLE3, minimal accuracy loss
Software Stack
Availability & Reliability
Fault-Tolerant by Design
Technical questions? We have answers.
Schedule a technical conversation with our architecture team.
Schedule Technical Call