Operator's Manual · Architecture · Procedures

A virtual Kubernetes engineer you talk to.

One engineer, three faculties on three ports — it watches and converses (9998), acts and stays awake (9999 · Telegram), and trains its own domain models on what it sees (9996). This manual is the complete operating guide: every surface, every procedure, with screenshots and worked use cases.

Explore DKube SRE
Surfaces9998 · 9999 · 9996
PillarsSRE · Factory · Hermes-TG
Revision2026-06-11 · v2
Learning loopClosed ✓
3
Faculties, one engineer
2
Self-trained models served
1
Stable endpoint, never re-URLs
Overview

One engineer, three faculties.

It is not three tools. It is one virtual Kubernetes engineer, split across three ports only to keep each surface uncluttered. You converse with it; it acts on its own; and it gets measurably better at your cluster over time.

Fig 1 — three faculties architecture
Fig 1 · Three ports, one engineer. Each faculty is independent but reads as one operator.

The differentiator is not the chat — plenty of tools chat about Kubernetes. It is that this engineer observes (9998), acts (9999), and retrains on its own incident history (9996), becoming a specialist in your cluster rather than a generic model. That closed loop is covered in The closed learning loop.

How to read this manual. Each pillar has its own section with annotated screenshots; the Procedures section is task-oriented (“how do I…”), and Reference holds the ports, env vars, API index, and deployment matrix.
SRE Console · Port 9998

The console & sidebar.

The conversational front. You ask in plain English; every turn is grounded in a live cluster snapshot. The model answering is selectable — including the Factory's own finetunes.

SRE Console
The SRE Console. Note the 🏭 Model Factory ↗ sidebar tab, the k8s-sre model in the top-right dropdown, and the K8s slash-command cards.

§Chat — the cluster-aware agent

The Chat pane injects a live snapshot (nodes ready, pod phases, restart counts, pending pods, top-memory pods) ahead of your question, so answers reason from real signals. Type plain English, or click a slash-command card.

Use case“Is the cluster healthy and what should I watch?” → the agent reads the snapshot and replies with a verdict, the 3–5 notable items, and one action — the same shape as the daily audit.

§The panes

SRE Console sidebar
PaneWhat it does
ChatCluster-aware conversational agent (predictive — reasons from live signals).
🏭 Model Factory ↗Opens the Factory (9996) in a new tab — the SRE's model-support port.
ResourcesLive K8s / PM2 / Helm control surface.
Cloud accountsMulti-cloud credential & resource discovery.
RunbooksExecutable per-resource remediation (RunWhen SLX).
TopologyCluster graph from discovery.
TelemetryLogs + compute metrics via ClickHouse / OTel.
HistorySQLite audit store — every state-changing op (the loop's raw material).
Analytics · Reports · AlertsHealth snapshots & improvement proposals · printable incidents · rule-based watchdog.
SettingsProvider config, playbook roles, theme (defaults to light).

§The model dropdown & per-model personas

The top-right dropdown is provider-aware: it probes Ollama, LM Studio, Claude Max, LiteLLM — and the Model Factory (9996/v1/models). Every finetune the Factory serves shows up as a distinct entry under the <domain>-sre naming convention.

Selecting a Factory model swaps both the system prompt and the chat cards:

ModelPersona & cards when selected
k8s-sreCluster-SRE prompt; cards: assess-health, triage-restarts, disk-pressure, daily-audit.
insurance-sreUnderwriting prompt; cards: triage-claim, explain-risk, premium-band, audit-cohort.
Verified selecting k8s-sre yields on-domain cluster answers; insurance-sre yields underwriting answers — same endpoint, correct persona each time.
Model Factory · Port 9996

Tabs & the stable endpoint.

The model lifecycle, decoupled from the SRE: build a dataset, finetune it on burst GPU, serve the result on a stable endpoint the SRE consumes. Re-finetuning hot-swaps the weights — the URL and the dropdown entry never change.

Model Factory — Insurance Demo tab
The Insurance Demo tab — the golden path read live from real artifacts (dataset → finetune → serve → SRE consumes), with a chat box on the stable endpoint.
Model Factory tabs
TabPurpose
🎬 Insurance DemoLanding tab — the end-to-end story, live.
🧪 DatasetsSynthetic · BigSet · SRE-history sources + registry.
🔁 FinetuneRunPod burst, spend-gated, persists adapter + job ledger.
📦 ModelsMulti-alias serving registry behind the stable /v1.
⏱️ SchedulerDataset + base model + cadence; on-demand & Telegram retrain.
💬 PlaygroundChat / compare base vs finetuned.

§The stable, multi-alias endpoint

Model Factory — Models tab
The Models tab — the serving registry. Each named model sits behind one fixed endpoint.
# one URL, many models, never re-URLs on retrain
GET http://hub:9996/v1/models # → k8s-sre (default), insurance-sre
POST http://hub:9996/v1/chat/completions {"model":"k8s-sre", "messages":[...]}
Verified hot-swap the backend behind an alias (re-finetune, or swap qwen→gemma) and the SRE dropdown entry, selection, and chat are unchanged.

§Datasets

Model Factory — Datasets tab
The Datasets tab — synthetic generator, BigSet (real web data), and the registry.

Three sources feed one registry:

a.Synthetic

An LLM (OpenRouter, Ollama fallback) turns a plain-English description into a structured CSV. Controlled, repeatable — ideal for demos.

b.BigSet (real web data)

Self-hosted TinyFish BigSet (ports 3500/3501) researches the live web and returns a verified, source-attributed table. Used to build the us-health-insurers set (15 real insurers).

c.SRE-history

The cluster's own record — /api/history events plus daily audit summaries and published reports — converted to training rows. This is what makes the loop close.

§Finetune — burst, persist, serve

Model Factory — Finetune tab
The Finetune tab — RunPod burst with a spend-confirmation gate and a live job ledger.
Fig 2 — finetune pipeline
Fig 2 · The finetune pipeline. Engine: transformers + peft (Unsloth deferred — pip overlay broke torchvision on the generic CUDA image).
Verified runs: insurance-uw (loss 1.47), us-health-insurers (loss 1.78), k8s-sre (loss 3.22). ~$0.06/run, every GPU torn down. Total build spend ≈ $0.21.
Operational note. Don't restart the Model Factory during a finetune — the in-process monitor thread is orphaned and the run won't auto-persist. (Recovery: the RunPod cluster has a 10-min idle autostop; rsync the adapter down before it tears down.)

§Scheduler — keep the model fresh

Model Factory — Scheduler tab
The Scheduler tab — pick dataset + base model + cadence; run on schedule, on-demand, or from Telegram.

The Scheduler owns the retrain knobs: alias, dataset (dropdown from the registry), base model (swap to Qwen/Qwen2.5-1.5B-Instruct when the corpus is rich), and cadence (manual / 6h / 12h / daily / weekly) with an Enabled toggle. Each fire rebuilds+enriches the corpus, finetunes, and hot-swaps behind the alias.

Each scheduled run spends ~$0.06 on RunPod. The cadence only fires when Enabled + a non-manual cadence; otherwise it's manual/Telegram only.
Hermes-TG · Port 9999 · Telegram

Autonomous SRE & Telegram.

The faculty that acts on its own and reaches your phone: scheduled audits, report generation, and a two-way chat surface.

§Autonomous SRE

Hermes drives scheduled work via cron writers (daily-audit.sh, daily-report.sh): a deterministic cluster snapshot, summarized by the Console's model, written to the audit log + a saved HTML report, and pinged to Telegram.

§Two bots, two directions

BotWhereDirection · role
Telegram bot (VPS)VPSOutbound — daily audits + report links (production).
Telegram bot (hub)Hub · 9998Inbound — chat the SRE agent from your phone. One poller per token (DKUBE_SRE_TELEGRAM_BOT=1).

§Commands

Plain English routes to the cluster-aware agent. Slash commands map to the active model's cards. And:

/retrain # retrain the default alias
/retrain k8s-sre # retrain a specific model — hot-swaps behind its alias when done
/assess_health /triage_restarts /disk_pressure /daily_audit
Core idea

The closed learning loop.

Most assistants are read-only or generic. This one feeds its own operational record back into training — so it specializes in your cluster, and improves every retrain.

Fig 3 — closed learning loop
Fig 3 · The loop. New incidents accumulate → the Scheduler retrains → the model gets sharper, behind the same alias.
Verified the k8s-sre model, trained on the cluster's own audit history, answers cluster-health questions on-domain (disk I/O, pod restarts) through the SRE dropdown — the loop closed end-to-end.
Operate

Procedures — how do I…

1 · Build a dataset (synthetic or real)

  1. Open Model Factory → Datasets.
  2. Synthetic: describe the schema (“US health-insurance underwriting: policy_id, state, …”), set rows, Generate, then Save to registry.
  3. Real (BigSet): type a plain-English request and Build via BigSet (2–5 min of live web research). It lands in the registry.

2 · Train the cluster's own model (close the loop)

  1. Datasets → build the SRE-history dataset (pulls /api/history + audit summaries + reports).
  2. Finetune → pick that dataset → Burst finetune → confirm the ~$0.06 spend.
  3. On completion the adapter is merged and served behind k8s-sre automatically.

3 · Make a model selectable in the SRE

  1. Serving registers the alias on the stable endpoint (/api/serve/register) — automatic after a finetune.
  2. In the SRE, open the top-right dropdown → Model Factory group → pick the model by name.
  3. The chat persona + cards switch to that model's domain.

4 · Schedule retrains (or trigger from Telegram)

  1. Scheduler → set alias, dataset, base model, cadence → toggle EnabledSave schedule.
  2. Or hit Run now for a one-off.
  3. From your phone: message the SRE bot /retrain k8s-sre.

5 · Graduate the base model

  1. When the corpus is rich, set the Scheduler's Base model to Qwen/Qwen2.5-1.5B-Instruct (or set RETRAIN_BASE_MODEL).
  2. Run a retrain — the better model hot-swaps behind the same alias; no SRE reconfig.
Reference

Ports, deploys, env & API.

§Port map

PortServiceNotes
9998SRE Console (native)Converse + observe · light theme default
9997SRE Console (docker)Same code, container (host→9998 inside)
9999HermesAutonomous SRE
9996Model FactoryDatasets · finetune · serve · scheduler
3500/3501BigSetSelf-hosted dataset builder
<sre-host>VPS SREProduction demo + Telegram bot

§Deployment matrix

SurfaceFactory tabTG inboundAppstore tile
Hub native (9998)appstore (local)
Hub docker (9997)off
VPS (production)

§Env vars

VariableEffect
DKUBE_SRE_MODEL_FACTORY_BASEWhere the SRE dropdown finds the Factory (default 9996)
DKUBE_SRE_TELEGRAM_BOT1 = run inbound TG poller (one per token)
SRE_HISTORY_URLOperational-history source (hub vs real cluster)
SRE_REPORTS_BASEDaily-report base to mine (default: your SRE host)
RETRAIN_BASE_MODELBase model for scheduled retrains

§API index (Model Factory)

EndpointPurpose
/v1/models · /v1/chat/completionsStable multi-alias OpenAI surface
/api/datasets/{generate,save,bigset,sre-history}Dataset gen + registry + sources
/api/finetune/{burst,jobs,gpus}Spend-gated burst finetune + ledger
/api/serve/{served,register,unregister}Multi-alias serving registry
/api/scheduler/{status,config,run}Retrain cadence + on-demand + Telegram
Operate

Troubleshooting.

SymptomCause & fix
Model answers in the wrong domainThe Ollama model's baked SYSTEM prompt is generic/wrong — recreate via serve_finetune.sh <run> <alias> (it sets a domain-aware prompt by alias).
Output shows [UNK_BYTE_…]The merged model is missing the SentencePiece tokenizer.modelserve_finetune.sh copies it from the base; re-run it.
Finetune stuck “running” foreverModel Factory was restarted mid-run (monitor thread orphaned). Rsync the adapter from the still-up cluster, sky down, then re-serve.
Telegram bot doesn't replyInbound poller is off — set DKUBE_SRE_TELEGRAM_BOT=1 on exactly one deployment (one poller per token).
Model not in SRE dropdownCheck 9996/v1/models lists it; the SRE probes that endpoint. Register via /api/serve/register.
— one engineer, three faculties —

The SRE converses and observes; Hermes acts and stays awake; the Model Factory learns and specializes.
Re-finetune all you like — the endpoint and the dropdown never change.