Product / Technology

Enterprise AI on Your Hardware

Local AI that grows with your organization: from a compact entry-level server to a high-availability cluster. Two AI model tiers, fully on-premises — comparable to commercial cloud AI services, but under your control.

Cluster Architecture

Every Layer Optimized for Its Purpose

The hardware scales with your user count — from a single GPU server (entry level) to a compact DGX Spark cluster to a high-availability rack setup. The table below shows the logical layers; the compute layer grows with the chosen configuration (see below).

Layer Component Specification Role
Compute 2× NVIDIA L40S → 4× DGX Spark 96 GB → 512 GB LLM Inference
Interconnect InfiniBand / Load-Balancer 200 Gbps (Cluster / HA) Node Fabric
Model (Quality) Qwen3.5-35B-A3B (MoE) 3.3B active / 35B total, FP8 Sonnet-Tier Tasks
Model (Throughput) Qwen3.5-4B FP8, Mamba+MoE Haiku-Tier Tasks
Inference Stack SGLang / vLLM CUDA, TRT-LLM, NCCL Request Routing
API Layer OpenAI-compatible REST API HTTPS, mTLS, JWT Auth Atlas Integration
Application contboxx Atlas On-premises installation Knowledge Management

Hardware Configurations

Three Configurations — Scaled to Your Size

The local AI runs on your own hardware — a one-time purchase, no recurring cloud costs. The right size depends on user count and usage intensity: from a single GPU server for entry level to a high-availability cluster. Hardware is not part of the license and can also be customer-provided.

Entry · Baseline

Compact GPU server

up to ~250 employees

  • 2× NVIDIA L40S 48 GB (96 GB total) — 864 GB/s per card
  • One model tier per card
  • 2U standard server — no special rack, no water cooling
  • Incl. next-business-day support, redundancy optional
Cluster

4× NVIDIA DGX Spark

up to ~500 employees

  • 512 GB Unified Memory (4× 128 GB)
  • 200 Gbps InfiniBand RDMA fabric
  • Higher concurrency & throughput headroom
  • Desktop form factor, ~1,000 W, air cooling
High Availability · With Redundancy

2× rack servers, redundant

500+ employees

  • 2× redundant GPU servers with load balancer
  • N+1 fault tolerance, SLA-capable
  • GPU class scalable: L40S to H100/H200
  • For business-critical continuous operation
NVIDIA DGX Spark Cluster — die Cluster-Konfiguration von contboxx Vault

Shown: the cluster configuration (4× NVIDIA DGX Spark).

Indicative guidance; final sizing is determined by the load profile. Prices and configuration details are on the pricing page.

Two-Tier Model Architecture

The cluster runs two LLM tiers simultaneously, tuned to the different processing requirements of contboxx Atlas.

Sonnet Tier — Deep Processing

Qwen3.5-35B-A3B

Mixture-of-experts with just 3.3 billion active of 35 billion parameters, FP8-quantized — runs efficiently on a single GPU. For tasks where quality, nuance, and reasoning depth matter:

  • Complex RAG queries
  • Long-form summaries
  • Cross-document synthesis
  • Search intent detection
  • Compliance analysis
  • Draft generation
  • Onboarding assistance
Throughput: ~30–75 tokens/s Parameter: 35B (3,3B aktiv) VRAM: ~30 GB (FP8)
Haiku Tier — Fast Processing

Qwen3.5-4B

Compact, FP8-quantized Mamba+MoE model with ample concurrency headroom. For routine operations that require speed over deep reasoning:

  • Full-text indexing
  • Embedding generation
  • Auto-tagging & classification
  • Short Q&A
  • Duplicate detection
  • Automatic summaries
Throughput: ~30–40 tokens/s Success rate: 98.8% VRAM: ~8 GB (FP8)

Performance & Capacity

Enterprise-Grade Throughput

Measured in a multi-week sustained test on NVIDIA DGX Spark (GB10) under real pipeline load:

Model Tier Architecture Decode (Tok/s) Success rate
Qwen3.5-4B Speed tier Mamba+MoE · 4B 27–42 98,8 %
Qwen3.5-35B-A3B Quality-Tier MoE · 3,3B aktiv 28–77 95–100 %

Model memory (FP8)

~40 GB

Weights of both model tiers (FP8)

Streaming

Real-time

Progressive output after first token

Speculative Decoding

1,5–2× Speedup

EAGLE3, minimal accuracy loss

Software Stack

Inference SGLang / vLLM — optimized for continuous batching and high throughput, CUDA, TRT-LLM, NCCL
API OpenAI-compatible REST API (POST /v1/chat/completions) — drop-in replacement for existing cloud integrations
RAG Retrieval-Augmented Generation with vector database for semantic search, local embedding generation
Security mTLS, JWT-based authorization, encrypted storage, audit logging, network isolation
Network Fully air-gapped capable — internet only required for initial model download
Operating System NVIDIA DGX OS (Ubuntu-based) with defined security patch cycle

Availability & Reliability

Fault-Tolerant by Design

SGLang Inference Server runs as a systemd service with automatic restart on failure
Graceful Model Failover: On Sonnet-tier error, Atlas falls back to Haiku tier — limited, but functional
DGX Spark nodes run independently; failure of one node degrades the service but does not eliminate it
Optional redundant QM8700 switch for full high availability
NAS backup system secures model weights, configuration, and indices for recovery on node failure

Technical questions? We have answers.

Schedule a technical conversation with our architecture team.

Schedule Technical Call