← Return to Blog

Building a Production Monitoring Stack from Scratch — Part 1: Prometheus, Grafana, Node Exporter & AlertManager

Series: From NagiosXI to a Modern Observability Stack Part 1 of 4


The Problem with NagiosXI

We had been running NagiosXI for a while. It worked, in the way that something can work while also quietly frustrating everyone who touches it. It checked hosts, fired alerts, and we had even wired up scripts to push notifications to Mattermost. But the gaps were real and getting harder to ignore.

It was a paid solution running on our own infrastructure — a licensing cost that got harder to justify every time someone asked for something it couldn't do. OpenTelemetry support was essentially nonexistent. Application log aggregation wasn't on the table at all. Every extension we had made through plugins had taken us about as far as plugins could go.

The conversation about replacing it had been happening for a while. Eventually it stopped being a conversation and became a project. The goal: a full open-source replacement covering host metrics, alerting, log aggregation, and eventually distributed tracing. One cohesive system instead of a patchwork.

I took on the work. Phase 1 was about standing up the foundation and proving it could actually replace what NagiosXI was doing before we went further.


The Starting Point

The first week was spent getting four things working together:

ComponentRole
PrometheusTime-series metrics database and scraping engine
GrafanaVisualization and dashboarding
Node ExporterHost-level metrics (CPU, memory, disk, network)
AlertManagerAlert routing, grouping, and silencing

The deployment runs across two nodes — mon-node-a for data collection (Prometheus, AlertManager, and agent-side components) and mon-node-b for presentation (Grafana). Keeping the presentation layer separate from the data layer was a deliberate decision: if we need to update or rebuild Grafana, it doesn't touch Prometheus, and vice versa. Everything runs in Docker.

How these pieces talk to each other matters, because one architectural choice here — pull vs push — ended up being the central problem in Part 2.

[ Linux Hosts ]
      |
  node_exporter  (runs on each host, exposes /metrics on port 9100)
      |
      ↓  (pull — Prometheus reaches out every 15s)
[ Prometheus ]  ←── scrape_configs + alerting_rules
      |
      ├──→ [ AlertManager ]
      |           |
      |           └──→ Email / Mattermost
      |
[ Grafana ]  ←── queries Prometheus via PromQL

Prometheus is pull-based. It reaches out to each target on a schedule and pulls metrics. The targets don't know Prometheus exists — they just expose an HTTP endpoint and wait. This distinction ends up mattering a lot.


Getting Host Metrics In

Node Exporter is a lightweight binary that runs on each host and exposes hardware and OS-level metrics at a /metrics HTTP endpoint. Deploy one per machine, point Prometheus at it, done.

# Verify it's running
curl http://<host-ip>:9100/metrics | head -50

If you see # HELP and # TYPE blocks followed by metric lines, you're good.

Getting the metrics in wasn't the hard part. The harder part was getting them in cleanly, with enough context attached that alerts and dashboards would actually be useful. A raw IP address as the target label tells you very little when something breaks at 2am.

The solution was file-based service discovery with rich labels. Instead of listing targets directly in prometheus.yml, Prometheus watches a directory of JSON files:

[
  {
    "targets": ["192.168.0.101:9100"],
    "labels": {
      "hostname": "web-server-01",
      "environment": "production",
      "location": "Primary Rack",
      "maintainers": "admin@domain.com"
    }
  }
]
# prometheus.yml
scrape_configs:
  - job_name: "node_exporter"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
        refresh_interval: 30s

Drop a file in, get a monitored host within 30 seconds. No reload required. The labels on each target flow through to every metric scraped from that host — which means they're available in alert annotations, in Grafana, everywhere. When HostDown fires, the alert can say which host, in which environment, and who to contact. That's the payoff.


The up Metric

One of Prometheus's built-in synthetic metrics is up. For every scrape target:

  • up = 1 — scrape succeeded
  • up = 0 — scrape failed

This is the most fundamental health signal in the stack. Everything else — CPU, memory, disk — is meaningless if you can't even reach the host. And because up carries all the labels from your target file, you can immediately see which host is down, in which environment.

# All down hosts right now
up{job="node_exporter"} == 0

I keep coming back to up throughout this series because it's also where things can silently break if you change the architecture carelessly. More on that in Part 2.


Dashboards

Grafana connects to Prometheus as a data source and queries it via PromQL. The community dashboards are easy to import and useful for getting started, but building your own is worth doing because it forces you to understand exactly what you're looking at.

The core panels and the queries behind them:

CPU Usage (%)

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage (%)

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk Usage (%) — the fstype filter excludes Docker overlays and tmpfs mounts that inflate results

(1 - (
  node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} /
  node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}
)) * 100

Fleet status — a stat panel showing every host's current state

up{job="node_exporter"}

Value mappings: 1 → 🟢 UP, 0 → 🔴 DOWN.

Adding a dashboard variable for instancelabel_values(up{job="node_exporter"}, instance) — gives you a dropdown to filter to a specific host or view the whole fleet. That one change makes the dashboard genuinely useful for day-to-day operations.


Alerting

Prometheus evaluates alerting rules and forwards firing alerts to AlertManager. AlertManager handles the business logic: who gets notified, when, how often, and what to suppress.

The rules themselves live in separate YAML files:

groups:
  - name: node_exporter_alerts
    rules:

      - alert: HostDown
        expr: up{job="node_exporter"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is down"
          description: >
            {{ $labels.hostname }} has been unreachable for more than 2 minutes.
            Maintainers: {{ $labels.maintainers }}

      - alert: HighCPUUsage
        expr: >
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: >
            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.
            Current: {{ $value | printf "%.1f" }}%

The for: 2m on HostDown absorbs brief network glitches. Without it, a momentary scrape failure sends an alert. The rich labels on the target — hostname, maintainers — show up directly in the alert annotations.

One AlertManager config worth explaining is the inhibit rule:

inhibit_rules:
  - source_match:
      alertname: "HostDown"
    target_match_re:
      alertname: "HighCPUUsage|HighMemoryUsage|DiskSpaceLow"
    equal: ["instance"]

When HostDown fires for a host, AlertManager suppresses all other alerts for that same host. There's no useful signal in a HighMemoryUsage alert for a machine that isn't reachable. Without this, a single dead host can generate a cascade of noise.


The last_seen Pattern

When a host disappears completely, Prometheus eventually stops having active series data for it. up{instance="..."} doesn't return 0 — it returns nothing, because there's no scrape happening. You lose the ability to answer "when did this thing last check in?"

A recording rule fixes this by continuously writing a timestamp whenever a host is up:

groups:
  - name: recording_rules
    rules:
      - record: node_last_seen_timestamp
        expr: time() * up{job="node_exporter"}

This writes the current Unix timestamp on every evaluation cycle, but only when up == 1. When a host goes dark, the last written value persists in storage. In Grafana:

time() - node_last_seen_timestamp

Format as duration and you get: "last seen 3h 22m ago." It's a small thing but it's become one of the most-used panels.


Where This Left Off

By the end of the first week, the stack was functionally replacing NagiosXI for host monitoring. Prometheus scraping every host every 15 seconds, dashboards showing the fleet, AlertManager routing alerts with inhibit rules and deduplication, recording rules keeping last-seen timestamps for hosts that went dark.

But there was a question I hadn't resolved yet.

Node Exporter is a single-purpose binary — host metrics and nothing else. The moment we wanted logs or traces from these same hosts, we'd need additional agents running alongside it. And adding a host to monitoring still meant four manual steps: SSH in, install Node Exporter, write the target file, reload Prometheus.

My colleague had been working in parallel, exploring the multiple-exporter approach — a separate binary for each signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle all of it. We hadn't converged yet, and there were real questions about whether Alloy was ready enough to build on.

That's what Part 2 is about.

Article/blog/lgtm-stack/part-2
Building a Production Monitoring Stack from Scratch — Part 2: Grafana Alloy & the Push vs Pull Problem