TonoFabric™ — Enterprise AI Orchestration Platform | Tonomia
TonoFabric™ Platform Architecture

TonoFabric™, The Brain Behind
AI Factories

TonoFabric™ is the enterprise orchestration platform that transforms distributed TonoForge™ hardware into a unified, intelligent AI cloud. Ten microservices, six global regions, one seamless API.

API GATEWAY AUTH :8001 ROUTER :8004 SESSION :8005 INFER :8007 MODEL REGISTRY CLUSTER MANAGER STORAGE :8006 TELEM :8008 USAGE :8009 TONOFABRIC™ MICROSERVICES 10 SERVICES · 6 REGIONS · 200 CLUSTERS · DUAL TRANSPORT
10
Microservices
6
Global Regions
200
Cluster Nodes
gRPC+HTTP
Dual Transport
OTel
Full Observability
Platform Architecture

Four Layers, One API

TonoFabric™ is structured as a layered microservices platform. Every request traverses the same deterministic path — from authentication through intelligent routing to distributed inference — with full telemetry at every hop.

TONOFABRIC™ REQUEST LIFECYCLE EDGE CONTROL COMPUTE API GATEWAY :8000 · JWT · Rate Limit AUTH :8001 · JWT Sign USAGE :8009 · Quota · Billing ROUTER :8004 · Orchestration MODEL REGISTRY :8002 · 10B–1T params CLUSTER MANAGER :8003 · 200 clusters INFERENCE PROXY :8007 · gRPC :9000 SESSION :8005 · TTL · Events STORAGE :8006 · JSON OBSERVABILITY Telemetry :8008 OpenTelemetry Prometheus Grafana Distributed tracing across all services DISTRIBUTED GPU CLUSTERS EU-WEST · EU-CENTRAL ~66 clusters · NVIDIA + AMD US-EAST · US-WEST ~66 clusters · NVIDIA + AMD AP-S · AP-NE ~68 clusters MI355X GB300 MODEL CATALOGUE m-10b-general · 8K ctx · any m-34b-general · 16K ctx · any m-70b-general · 32K ctx · nvidia m-120b-coder · 32K ctx · nvidia m-250b-general · 64K ctx · nvidia m-1000b-research · 128K ctx · nvidia
Request Lifecycle

From Prompt to Response in 7 Hops

Every API call follows a deterministic path through the platform. Authentication, quota enforcement, model selection, cluster allocation, and inference happen in a single round-trip.

01
Authenticate
JWT token validated via Auth Service
02
Rate Check
Token-bucket limiter per user at Gateway
03
Quota Gate
Usage Service enforces daily quota & plan
04
Open Session
Session Service tracks conversation state
05
Route
Model recommended → cluster allocated
06
Infer
gRPC/HTTP proxy to GPU cluster
07
Record
Usage metered, session updated, response sent
POST /v1/query
// Single-call API — TonoFabric™ handles everything
{
  “prompt”: “Analyse this turbine sensor data for anomalies”,
  “purpose”: “general”,
  “region_pref”: “eu-west”,
  “session_id”: null // auto-created if omitted
}

// Response
{
  “session_id”: “a3f8…”,
  “model_id”: “m-70b-general”,
  “answer”: “Based on the sensor readings, I detect 3 anomalies…”,
  “latency_ms”: 142
}
Microservices

10 Services, Zero Single Points of Failure

API Gateway
:8000 · FastAPI
Unified entry point for all client requests. Validates JWT tokens, enforces per-user token-bucket rate limiting, checks quotas, orchestrates the full request lifecycle, and records usage after each call. Supports both gRPC and HTTP inference backends.
JWT AuthRate LimitinggRPC ClientOTel Tracing
Auth Service
:8001 · SQLAlchemy + Alembic
Handles user signup, login, and JWT issuance. Password hashing with SHA-256 salting. Token validation endpoint consumed by the Gateway on every request. Database-backed user store with Alembic migrations for schema evolution.
JWT HS256SQLite/PostgresAlembic Migrations
Model Registry
:8002 · In-Memory Catalogue
Central catalogue of all available AI models — from 10B general-purpose to 1T research-class. Intelligent recommendation engine matches prompt length, purpose (general/coder/research), latency SLA, and hardware affinity (NVIDIA/AMD/any).
10B–1T ModelsAuto-RecommendHW AffinityContext Sizing
Cluster Manager
:8003 · 200 Clusters
Manages a fleet of 200 GPU clusters across 6 regions and dual data centres (dc-a / dc-b). Three-tier allocation with failover: exact region+DC match → same region fallback → global fallback. Health-aware — only routes to healthy racks.
6 RegionsDual DCMulti-Tier FailoverNVIDIA/AMD
Router Service
:8004 · Smart Orchestration
The brain of TonoFabric™. Receives a prompt + purpose, calls Model Registry for the optimal model, then Cluster Manager for the best-fit GPU cluster. Returns a complete routing decision: model_id, cluster_id, rack_id, and endpoint URL.
Model SelectionCluster AllocationRegion Preference
Session Service
:8005 · SQLAlchemy + APScheduler
Manages conversational state with full event sourcing. Opens sessions, appends user/assistant messages with timestamps, enforces TTL (default 4 hours), and ships expired or closed sessions to central Storage for long-term retention.
Event SourcingTTL ExpiryAuto-ArchivalAPScheduler
Storage Service
:8006 · File System / S3
Persistent JSON store for completed sessions. Receives session + events payload and writes to mounted volume or cloud object store. Each session is stored as an individual JSON file, enabling simple backup and compliance export.
JSON PersistenceVolume MountSession Archive
Inference Proxy
:8007 HTTP · :9000 gRPC
Dual-transport inference endpoint. Exposes both a REST /infer endpoint and a gRPC InferenceServicer (compiled from proto/inference.proto). The Gateway auto-detects the transport scheme (grpc:// vs http://) and routes accordingly.
gRPC + HTTPProto CompileDual TransportGPU Backend
Telemetry Service
:8008 · Prometheus-compatible
Collects metrics from all services and exposes them in Prometheus exposition format. Ingests structured telemetry items (name, value, labels, timestamp) and feeds the Prometheus → Grafana observability pipeline.
PrometheusOTel CollectorGrafanaMetrics API
Usage Service
:8009 · Billing Engine
Tracks per-user daily quotas, per-model pricing (price_per_request, price_per_1k_tokens), and billing summaries. Enforces quota gates before inference and increments counters after each request. Supports tiered plan-based rate limits.
Daily QuotaPer-Model PricingBilling SummaryPlan Tiers
Global Infrastructure

Six Regions, Dual Data Centres

The Cluster Manager seeds 200 GPU clusters across six regions, each with dual data centres for redundancy. The three-tier allocation algorithm guarantees a healthy endpoint for every request — with graceful failover across DCs and regions.

EU-WEST
Europe West
MAIN STORAGE · dc-a
EU-CENTRAL
Europe Central
BACK-UP STORAGE· dc-b
US-EAST
North America East
MAIN STORAGE · dc-a
US-WEST
North America West
BACK-UP STORAGE · dc-b
AP-SOUTH
Asia Pacific South
MAIN STORAGE · dc-a
AP-NORTHEAST
Asia Pacific Northeast
BACK-UP STORAGE · dc-b

Three-Tier Failover

TIER 1
Exact Match
Requested region + preferred DC + vendor match
TIER 2
Region Fallback
Same region, alternate DC, vendor match
TIER 3
Global Fallback
Any healthy cluster in any region
Model Catalogue

Six Model Classes, One API

Model IDParametersFamilyHardwareMax ContextThroughput Hint
m-10b-general10 BGeneralAny (NVIDIA / AMD)8,192 tokens2,000 rps
m-34b-general34 BGeneralAny (NVIDIA / AMD)16,384 tokens1,200 rps
m-70b-general70 BGeneralNVIDIA preferred32,768 tokens600 rps
m-120b-coder120 BCoderNVIDIA preferred32,768 tokens400 rps
m-250b-general250 BGeneralNVIDIA preferred65,536 tokens200 rps
m-1000b-research1 TResearchNVIDIA required131,072 tokens50 rps

The recommendation engine matches prompt token count, purpose, and latency SLA to the optimal model. Hardware affinity ensures large models run on NVIDIA clusters with sufficient HBM3e capacity.

Observability

Full-Stack Visibility

Every service is instrumented with OpenTelemetry. Traces flow through the OTel Collector to Prometheus for metrics and Grafana for visualisation. Every hop is observable — from API Gateway to GPU rack.

OpenTelemetry SDK
Distributed tracing across all 10 services with BatchSpanProcessor and OTLP HTTP export
OTel Collector
Central telemetry hub receiving traces on :4317 (gRPC) and :4318 (HTTP)
Prometheus
Time-series metrics store scraping /metrics from Telemetry Service on :9090
Grafana
Real-time dashboards on :3000 with pre-provisioned Prometheus datasource
.env
# Enable distributed tracing
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

# Grafana access
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASS=[secure-password]
Security

Defence in Depth

JWT Authentication
Every API call requires a Bearer JWT token (HS256). The Auth Service signs tokens on login with a configurable secret and 1-hour expiry. The Gateway validates every token before processing.
Internal Token Guard
Service-to-service calls are protected by HMAC-compared internal tokens. Every inter-service request carries an x-internal-token header, verified with constant-time comparison to prevent timing attacks.
Secret Management
Production secrets are managed via Sealed Secrets (kubeseal for in-cluster encryption) or External Secrets Operator for syncing from AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault.
Rate Limiting & Quotas
Token-bucket rate limiting (configurable RPS) at the Gateway level. Daily quotas enforced by the Usage Service. Plan-based tiers dynamically adjust rate limits per user. Excess requests receive HTTP 429 or 402.
Deployment

From Docker Compose to Kubernetes

Development Stack

Single-command local development with Docker Compose. All 10 services, plus the observability stack (OTel Collector, Prometheus, Grafana), spin up with docker compose up.

docker-compose.yml
services:
  api_gateway:    # :8000
  auth_service:   # :8001
  model_registry: # :8002
  cluster_manager:# :8003
  router_service: # :8004
  session_service:# :8005
  storage_service:# :8006
  inference_proxy:# :8007 + gRPC :9000
  telemetry_service:# :8008
  usage_service:  # :8009
  otel-collector: # :4317/:4318
  prometheus:     # :9090
  grafana:        # :3000

Production Stack

Kubernetes deployment with Kustomize overlays or Helm umbrella chart. Horizontal Pod Autoscalers for central and edge clusters. CI/CD via GitHub Actions — build, push to GHCR, deploy.

Kubernetes Features
HPAs — Central + edge autoscaling based on CPU/memory
Sealed Secrets — Encrypt secrets with kubeseal, commit safely
External Secrets — Sync from AWS/GCP/Azure secret stores
Kustomize — Environment overlays (dev, staging, prod)
Helm Chart — Umbrella chart for full platform deployment
GitHub Actions — Build → push GHCR → deploy overlay/Helm
Tech Stack

Built for Scale

LayerTechnologyPurpose
API FrameworkFastAPI (Python)Async HTTP services with Pydantic validation
RPC TransportgRPC + protobufHigh-performance inference calls (:9000)
HTTP Clienthttpx (async)Inter-service communication
AuthPyJWT + HS256Token signing & verification
DatabaseSQLAlchemy (async) + AlembicUser store, session store, schema migrations
SchedulingAPSchedulerSession TTL expiry & archival
TracingOpenTelemetry SDKDistributed tracing across all services
MetricsPrometheus + GrafanaTime-series metrics & dashboards
OrchestrationDocker Compose / KubernetesDev & production deployment
CI/CDGitHub ActionsBuild, push GHCR, deploy Kustomize/Helm
SecretsSealed Secrets / External SecretsEncrypted in-cluster or cloud-synced secrets
Get Started

Deploy TonoFabric™ in your infrastructure

One API to orchestrate AI across every TonoForge™ node. Contact us to schedule an architecture review.