Debugging a Self-Hosted Grafana LLM Router Dashboard

Q: Why does restart: unless-stopped not restart my container?

Because unless-stopped excludes containers stopped on purpose. It restarts containers that crashed or that a host reboot took down, but docker stop on the container, or systemctl stop docker on the whole daemon, counts as intentional and Docker leaves it down. That is the difference from restart: always. After maintenance that stops Docker, run docker compose up -d to bring those containers back.

Within a week of standing up the dashboard, three panels were showing bad data. None of them was a Grafana bug.

I had built a Grafana dashboard to watch my homelab LLM router dispatch requests across three GPU backends and a handful of cloud fallbacks. The point of LLM observability is to see which backend is hot, which lane is busy, and which model is actually answering, instead of guessing from “the local tier feels slow today.” It worked on day one. By the following weekend, three panels were lying to me, and each lie traced back to a different gap between what a config file claimed and what the system was actually doing.

The self-hosted LLM observability stack I’m watching

The LLM router it watches is the daemon from the hook above, and the monitoring stack underneath it is four pieces:

A small Python host-metrics exporter running in Docker on the GPU server, exposing GPU temperature and utilization over HTTP for Prometheus to scrape.
llama.cpp’s built-in /metrics endpoint on each local backend, also scraped by Prometheus.
Request and task logs flowing into Loki via Alloy.
Grafana on top, pulling from Prometheus for time-series metrics and from Loki for log-based aggregations, rendered as a “Local Model Backend Performance” dashboard.

Three backends sit behind it:

Backend	Host	Model	Notes
RTX 3090 Ti	ubuntu1	Qwen3.6-27B, Q4_K_M	~21.5 GB VRAM
Quadro M4000	ubuntu1	Ministral-8B
Mac Studio	(Apple Silicon)	Qwen3.6-35B-A3B (MoE, llama.cpp)	the model that lied

None of this needs a Grafana Cloud account. It is the free, self-hosted stack running on hardware I already own, and every bug below is about the monitoring pipeline, not the router.

Lie 1: every temperature panel went dark

Every GPU temperature panel showed a flat “No data.” Not a weird number, not a stale reading. Nothing.

My first instinct was to blame myself, since the panels were new and I assumed I had fat-fingered a query. The query looked right. So did the Prometheus targets page, which showed the scrape job as up. So I walked one step upstream.

The exporter container was gone. docker ps did not list it. docker ps -a did, and the status column said it had exited with code 137, which is the polite Docker way of telling you something sent it a SIGKILL.

Then I remembered what I had done two days earlier. A disk migration on that host needed me to unmount the old Docker data-root cleanly, and the safe way to do that is systemctl stop docker, which stops the daemon and every container with it. I had restarted the router itself deliberately. The little metrics exporter I never noticed, because nothing depends on it except a dashboard I was not looking at that day.

The exporter’s Compose file had restart: unless-stopped, and I had assumed that meant “always bring this back.” It does not. Per Docker’s restart-policy docs, unless-stopped revives a container that crashed or that the host rebooted out from under, but it will not revive a container that was stopped on purpose, and systemctl stop docker counts as on purpose. That is the whole difference from restart: always: the policy lets you stop something deliberately without the daemon fighting you to bring it back.

The fix was one command:

docker compose up -d

Everything came back, temperatures included. After any operation that runs docker stop or systemctl stop docker, your unless-stopped containers need a manual up -d.

Lie 2: the backend that misreported its own model

This one was sneakier, because the panel was not blank. It was confidently wrong.

The backend-comparison section had two panels both labelled “27B.” Only one of my backends runs a 27B model. The other panel was the Mac Studio, which serves the 35B-A3B mixture-of-experts model, and it had been wearing a 27B label the whole time. The performance numbers underneath were subtly off in a way you would only catch if you already knew the Mac was running something bigger.

The label came from the job field in prometheus.yml, a hardcoded static string I had typed in months earlier, when the routing table looked different and that slot really was a 27B model. When I later swapped the Mac’s backend, I updated the model on the Mac and never touched the scrape config that described it.

The backend already knew the truth. llama.cpp exposes a /v1/models endpoint that reports the model it has loaded right now, and a quick curl against it returned the real name. The config had never asked.

The immediate fix was to correct the string in prometheus.yml. The better fix, which I will do when it matters more, is a small startup hook that reads /v1/models and writes the value into a Prometheus text-file collector, so the label is sourced from the backend instead of from my memory.

Lie 3: the task-mix panel that wouldn’t break down

The third panel was the one I most wanted to work, and it stubbornly refused.

The “task type distribution” panel is supposed to show how router traffic splits across task types: local classification, cloud reasoning, cloud code, and the rest. The breakdown is the whole value. Instead it rendered a single fat bar with no per-series split, which tells you the total volume and nothing about the mix.

The panel was a Loki instant query feeding a bargauge visualization, and that pairing is the problem. A Loki instant query always returns a result frame carrying a Time field, and Grafana’s bargauge collapses a multi-series frame that includes a Time column down into one value. The data was there. The visualization was throwing the breakdown away before it drew anything.

I reached for a transform, switching to a bar chart and stacking a Reduce plus a seriesToRows transform to flatten the frame into the shape I wanted. It failed with “No numeric fields found,” because the transform pipeline expects a frame shape the Loki query does not produce in that visualization context. I spent a while convinced the transform was one checkbox away from working. It was not.

What actually fixed it was sitting right next to the broken panel. The panel immediately to its right already rendered a correctly broken-down, stacked-bar timeseries for a different metric. So I cloned it and swapped the query for this one:

sum by (task_type) (count_over_time({job="ilp-router"} | json | task_type != "" [$__interval]))

A range query with sum by (task_type) produces one series per label value, and a timeseries panel stacks those cleanly. Bargauge and timeseries are not interchangeable for multi-series Loki data, no matter how similar the query looks. When a sibling panel already produces the shape you want, the fastest fix is to clone it rather than reconfigure the transforms from scratch.

The pattern: declared state versus running state

Three panels, three layers, one shape. The temperature panels looked fine at the Prometheus config level; the container behind them was not running. The model label looked correct in the YAML; the backend had drifted out from under it. The task-mix query logic was sound; the visualization was wrong for the data shape. In every case the diagnostic move was the same: stop reading the config file and look at the running system. I learned to trust docker ps -a over docker-compose.yml, the backend’s /v1/models over prometheus.yml, and a working panel over the transform documentation.

Was it worth the trouble

The whole stack took a weekend to stand up. The three debugging sessions added maybe another half day spread across a couple of weeks. Real cost, not enormous, but real.

Two concrete things have paid it back. The GPU-temperature visibility caught a stuck llama.cpp process once, which before the dashboard would have shown up only as “the local tier is slow” with no obvious cause. And the task-mix panel, once it actually broke down, revealed that roughly 40% of router traffic was hitting the local classification lane rather than a cloud one, which pushed me to rebalance some of that load and free GPU headroom for reasoning work. Neither of those is a finding you can pull out of a config file.

LLM Observability FAQ

How do I monitor a self-hosted LLM?

Scrape the metrics your inference server already exposes. llama.cpp publishes a /metrics endpoint with latency and token-throughput counters Prometheus can read, paired with a host exporter for GPU temperature and utilization. Ship request logs to Loki to aggregate by task type or model, then chart it all per backend in Grafana so a stuck server stands out instead of disappearing into an average.

Why does restart: unless-stopped not restart my container?

Because unless-stopped excludes containers stopped on purpose. It restarts containers that crashed or that a host reboot took down, but docker stop on the container, or systemctl stop docker on the whole daemon, counts as intentional and Docker leaves it down. That is the difference from restart: always. After maintenance that stops Docker, run docker compose up -d to bring those containers back.

Do I need Prometheus and Loki, or just one?

Both, because they answer different questions. Prometheus handles numeric time-series like GPU temperature, latency, and throughput, the things you chart and alert on. Loki stores logs, so it handles per-request detail like which task type or model served a call. Grafana reads both in one dashboard. You can skip Loki if you never need log-level breakdowns, but the task-mix panel that earned its keep here came straight from it.

How a CEO uses Claude Code and Hermes to do the knowledge work

A blank or generic config file means every session re-explains your workflow. These are the files I run daily as CEO of a cybersecurity company managing autonomous agents, cron jobs, and publishing pipelines.

CLAUDE.md template with session lifecycle, subagent strategy, and cost controls
8 slash commands from my actual workflow (flush, project, morning, eod, and more)
Token cost calculator: find out what each session is actually costing you

One email when the pack ships. Occasional posts after that. Unsubscribe anytime.

Three Green Lies: Debugging a Self-Hosted LLM Observability Dashboard

The self-hosted LLM observability stack I’m watching

Lie 1: every temperature panel went dark

Lie 2: the backend that misreported its own model

Lie 3: the task-mix panel that wouldn’t break down

The pattern: declared state versus running state

Was it worth the trouble

LLM Observability FAQ

How do I monitor a self-hosted LLM?

Why does restart: unless-stopped not restart my container?

Do I need Prometheus and Loki, or just one?

Inference Arbitrage: How I Route 200+ Daily LLM Calls Across Five Models

I Strapped a $15 Sensor to My GPU. It Read the Load Backwards. Here’s How to Calibrate It.

How I Drive WordPress From Claude Code (REST, Playwright, wp-cli)

OpenClaw: 13 Errors, $1.50/Month, and an AI Team That Doesn’t Need the Cloud

Building llama.cpp from source on a Dell Precision T5820 with an RTX 3090 Ti (after seven power cycles)

Free LLM API Tiers in 2026: What Groq, Cerebras, Mistral, Gemini and Cohere Actually Give You

Leave a Reply Cancel reply

The self-hosted LLM observability stack I’m watching

Lie 1: every temperature panel went dark

Lie 2: the backend that misreported its own model

Lie 3: the task-mix panel that wouldn’t break down

The pattern: declared state versus running state

Was it worth the trouble

LLM Observability FAQ

How do I monitor a self-hosted LLM?

Why does restart: unless-stopped not restart my container?

Do I need Prometheus and Loki, or just one?

Similar Posts

Leave a Reply Cancel reply