[{"data":1,"prerenderedAt":2753},["ShallowReactive",2],{"nav-stories":3,"blog-lgtm-stack\u002Fpart-4":61,"ref-\u002Fblog\u002Flgtm-stack\u002Fpart-1":343,"ref-\u002Fblog\u002Flgtm-stack\u002Fpart-2":1454,"ref-\u002Fblog\u002Flgtm-stack\u002Fpart-3":2133},[4,16,25,34,43,52],{"id":5,"color":6,"extension":7,"image":8,"label":9,"link":10,"meta":11,"order":12,"stem":13,"text":14,"__hash__":15},"stories\u002Fstories\u002F01-data-center.yml",null,"yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1558494949-ef010cbdcc31?w=1080","DATA_CENTER","https:\u002F\u002Fx.com\u002Fabbeytetteh_",{},1,"stories\u002F01-data-center","Racking new servers. 40gbit backbone online.","0QUZQbaANhdO8WemZxkDdO7vbVopfnynHtH9FxBZb_w",{"id":17,"color":6,"extension":7,"image":18,"label":19,"link":6,"meta":20,"order":21,"stem":22,"text":23,"__hash__":24},"stories\u002Fstories\u002F02-thoughts.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1498050108023-c5249f4df085?w=1080","THOUGHTS",{},2,"stories\u002F02-thoughts","Late night bug hunting. Found the memory leak.","Gd1am954aasY6HRHD7hCtOuessXb6zYZ8iizS501ICg",{"id":26,"color":27,"extension":7,"image":6,"label":28,"link":6,"meta":29,"order":30,"stem":31,"text":32,"__hash__":33},"stories\u002Fstories\u002F03-coding.yml","#3b82f6","CODING",{},3,"stories\u002F03-coding","Just thinking about how much easier life is with Swarm. https:\u002F\u002Fgoogle.com","-WTk-47jnLM-TZRWBg0VbJyZJfIM7FpQ5HGbc8LEdhQ",{"id":35,"color":6,"extension":7,"image":36,"label":37,"link":6,"meta":38,"order":39,"stem":40,"text":41,"__hash__":42},"stories\u002Fstories\u002F04-update.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1591799264318-7e6ef8ddb7ea?w=1080","UPDATE",{},4,"stories\u002F04-update","New cluster nodes arrived. Prepping for installation.","kyT60N5C6Re_jMonZbgNy0PbQhzXmUWxDbD0D_v43ts",{"id":44,"color":45,"extension":7,"image":6,"label":46,"link":6,"meta":47,"order":48,"stem":49,"text":50,"__hash__":51},"stories\u002Fstories\u002F05-setup.yml","#86868b","SETUP",{},5,"stories\u002F05-setup","Optimizing the telemetry pipeline for 1M req\u002Fs.","cPOBkzoyXsCmPgRO2d80Hj3vm4MP-6nAejtlQ5iuSzw",{"id":53,"color":6,"extension":7,"image":54,"label":55,"link":6,"meta":56,"order":57,"stem":58,"text":59,"__hash__":60},"stories\u002Fstories\u002F06-travel.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1560969184-10fe8719e047?w=1080","TRAVEL",{},6,"stories\u002F06-travel","Travel log — system architecture workshop in Berlin.","jnOxerdF6usAIHdR35Z-opx0LJAy9kZluXnZhtz62Z0",{"id":62,"title":63,"body":64,"category":323,"date":324,"description":325,"extension":326,"meta":327,"navigation":328,"path":329,"readTime":330,"seo":331,"stem":332,"tags":333,"thumbnail":341,"__hash__":342},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-4.md","Building a Production Monitoring Stack from Scratch — Part 4: The Enrollment API",{"type":65,"value":66,"toc":313},"minimark",[67,81,85,88,91,94,99,102,130,133,136,138,142,145,148,151,154,156,160,163,166,169,171,175,178,209,215,218,225,228,230,234,241,244,247,249,253,256,267,270,273,275,279,282,288,298,304,310],[68,69,70],"blockquote",{},[71,72,73,77,78],"p",{},[74,75,76],"strong",{},"Series:"," From NagiosXI to a Modern Observability Stack\n",[74,79,80],{},"Part 4 of 4",[82,83],"reference",{"path":84},"\u002Fblog\u002Flgtm-stack\u002Fpart-1",[82,86],{"path":87},"\u002Fblog\u002Flgtm-stack\u002Fpart-2",[82,89],{"path":90},"\u002Fblog\u002Flgtm-stack\u002Fpart-3",[92,93],"hr",{},[95,96,98],"h2",{"id":97},"the-remaining-friction","The Remaining Friction",[71,100,101],{},"Three parts in, the monitoring stack was complete in terms of capability. But adding a new host still looked like this:",[103,104,105,109,112,115,118,121,124,127],"ol",{},[106,107,108],"li",{},"SSH onto the host",[106,110,111],{},"Install Grafana Alloy",[106,113,114],{},"Write the Alloy config for that host's specific services",[106,116,117],{},"Start the service",[106,119,120],{},"Write a Prometheus target JSON file with the host's full metadata",[106,122,123],{},"Place that file in the targets directory on mon-node-a",[106,125,126],{},"Reload Prometheus",[106,128,129],{},"Verify the host appeared in Grafana",[71,131,132],{},"At ten hosts this is manageable. We were heading toward significantly more than ten, across different subnets, different OS versions, different service combinations. Each enrollment was a context switch — SSH session, config writing, file placement — and each one was another opportunity for a typo in a label that would surface as a confusing gap in a dashboard weeks later.",[71,134,135],{},"The obvious answer was automation. The question was what form it should take.",[92,137],{},[95,139,141],{"id":140},"why-not-ansible","Why Not Ansible",[71,143,144],{},"Ansible was the straightforward choice. Write a playbook, run it against a host, done. It's what most teams would reach for.",[71,146,147],{},"The problem was the operational model it would create. A playbook lives in a repository. Adding a host means committing a vars file, or updating an inventory, pushing to a repo, waiting for a pipeline. Secrets need to be encrypted, which means setting up Vault or ansible-vault, wiring that into CI, making sure anyone who needs to enroll a host has access to the right keys. You've traded one kind of manual work for another.",[71,149,150],{},"It also meant the monitoring stack would have an external dependency — a separate repository, a CI system — just to add a host to Prometheus. If the pipeline was down, enrollment was blocked.",[71,152,153],{},"What I wanted was something self-contained that lived with the monitoring stack and could be used by anyone with access to Grafana, without needing to touch a repository or understand the underlying infrastructure.",[92,155],{},[95,157,159],{"id":158},"the-grafana-plugin-insight","The Grafana Plugin Insight",[71,161,162],{},"Grafana has a plugin system. You can build a custom frontend that installs directly into Grafana as a panel or app, with its own pages, its own navigation, its own UI. It appears in the sidebar alongside Dashboards and Alerting as if it were a native part of the product.",[71,164,165],{},"That was the piece that made the whole thing click. If I built an enrollment API and a Grafana plugin that called it, the entire workflow would live inside the tool the team was already using. No separate app to navigate to, no CLI to remember. Just a form in Grafana: fill in the host details, submit, host is monitored.",[71,167,168],{},"I'd built APIs in Python before and had experience writing Python to SSH into hosts and execute commands — that background came from working on a CD pipeline from scratch. The backend wasn't the unknown part. The Grafana plugin was new territory.",[92,170],{},[95,172,174],{"id":173},"the-api","The API",[71,176,177],{},"The backend is a Python API. An enrollment request comes in with the host's IP, SSH credentials, the services running on it, and the label metadata to attach to it in Prometheus. The API then:",[103,179,180,183,186,189,192,200,203,206],{},[106,181,182],{},"Connects to the host via SSH (key or password)",[106,184,185],{},"Detects the OS — Debian, RHEL, SUSE, Windows each have different package managers and service managers",[106,187,188],{},"Installs Grafana Alloy if not present, or validates the existing installation",[106,190,191],{},"Generates an Alloy config appropriate for that host and its services from templates",[106,193,194,195,199],{},"Deploys the config and validates it by running ",[196,197,198],"code",{},"alloy fmt"," on the remote host before restarting the service",[106,201,202],{},"Writes the Prometheus target file with the full label set",[106,204,205],{},"Reloads Prometheus via its HTTP API",[106,207,208],{},"Returns a structured response",[71,210,211,212,214],{},"The validation step — running ",[196,213,198],{}," on the host itself before restarting Alloy — was an early decision that proved its worth. Config template bugs would otherwise surface as a silent Alloy failure: service appears to restart, metrics stop appearing, nothing in the logs that makes the cause obvious. Catching the syntax error before committing to it saved that confusion more than once.",[71,216,217],{},"Enrollment is idempotent. Running it against an existing host checks what's already there, updates what's changed, and skips what hasn't. Re-enrolling a host after a template update is a normal operation, not a risky one.",[71,219,220,221,224],{},"For hosts being decommissioned, the API renames the target file with a ",[196,222,223],{},".deleted.\u003Ctimestamp>"," suffix rather than deleting it. The targets directory ends up as a passive audit trail — you can see every host that was ever enrolled and when it was removed, without digging through logs.",[71,226,227],{},"The API also handles batch enrollment — a list of hosts processed concurrently up to a configurable limit, with per-host status tracking so failed hosts can be retried independently.",[92,229],{},[95,231,233],{"id":232},"the-grafana-plugin","The Grafana Plugin",[71,235,236,237,240],{},"The plugin is a Grafana app plugin — it installs into Grafana and adds pages to the sidebar. The enrollment form lives at ",[196,238,239],{},"\u002Fgrafana\u002Fa\u002Firis\u002Fenroll",". It has fields for connection details, host type, SSH credentials, service configuration, and the label metadata that ends up in Prometheus.",[71,242,243],{},"Building a Grafana plugin for the first time meant learning the plugin SDK, understanding how Grafana's frontend architecture works, and figuring out how to wire API calls through Grafana's proxy so the backend isn't exposed directly. None of that was especially difficult, but it was all new, and the documentation for app plugins is thinner than for panel plugins.",[71,245,246],{},"The result is that enrollment happens entirely within Grafana. An operator fills in the form, hits submit, and within about 30 seconds the host appears in the fleet dashboard. The underlying SSH, config generation, and Prometheus reload are invisible. The plugin also has pages for viewing enrolled hosts, managing labels, and handling batch enrollments from a file upload.",[92,248],{},[95,250,252],{"id":251},"what-changed","What Changed",[71,254,255],{},"Before the API, enrolling a host was a sequence of manual steps spread across multiple systems. After it, the same operation takes 30 seconds and happens inside the tool the team already has open.",[71,257,258,259,262,263,266],{},"The label consistency improved noticeably. When metadata is entered through a form with defined fields rather than hand-written into a JSON file, the alert annotations and dashboard filters stay clean. No more ",[196,260,261],{},"maintainer"," vs ",[196,264,265],{},"maintainers"," label mismatches surfacing weeks later.",[71,268,269],{},"The audit trail in the targets directory — active files, deleted files with timestamps — became useful almost immediately. During a network audit, being able to answer \"when was this host enrolled and when was it removed\" from the directory listing alone, without touching logs or databases, turned out to be genuinely handy.",[71,271,272],{},"The API also became the foundation for more. Once you have a reliable programmatic path into the monitoring stack, other things become possible — automated enrollment from infrastructure provisioning, health checks, label updates when service ownership changes. That expansion became its own project.",[92,274],{},[95,276,278],{"id":277},"closing-the-series","Closing the Series",[71,280,281],{},"The series started with the question of whether an open-source observability stack could genuinely replace NagiosXI in production. By Part 4, the answer was clearly yes — and the result had gone further than parity.",[71,283,284,287],{},[74,285,286],{},"Part 1"," — Prometheus, Grafana, Node Exporter, AlertManager. Functional host monitoring, replacing what NagiosXI did.",[71,289,290,293,294,297],{},[74,291,292],{},"Part 2"," — Grafana Alloy replacing Node Exporter. One agent per host, the push vs pull problem, recovering the ",[196,295,296],{},"up"," metric.",[71,299,300,303],{},[74,301,302],{},"Part 3"," — Loki and Tempo. Logs and distributed traces alongside metrics, all queryable from Grafana with signals linked to each other.",[71,305,306,309],{},[74,307,308],{},"Part 4"," — The enrollment API and Grafana plugin. The operational friction of adding hosts, eliminated.",[71,311,312],{},"The stack covers the full production fleet with metrics, logs, and traces. Alerts are accurate. Enrollment takes under a minute. The whole thing runs on open-source software with no licensing costs.",{"title":314,"searchDepth":21,"depth":21,"links":315},"",[316,317,318,319,320,321,322],{"id":97,"depth":21,"text":98},{"id":140,"depth":21,"text":141},{"id":158,"depth":21,"text":159},{"id":173,"depth":21,"text":174},{"id":232,"depth":21,"text":233},{"id":251,"depth":21,"text":252},{"id":277,"depth":21,"text":278},"Blog","2026-05-20","Eliminating manual SSH work — and why the solution ended up living inside Grafana.","md",{},true,"\u002Fblog\u002Flgtm-stack\u002Fpart-4","10 min",{"title":63,"description":325},"blog\u002Flgtm-stack\u002Fpart-4",[334,335,336,337,338,339,340],"Monitoring","Prometheus","Alloy","DevOps","Automation","API","Grafana","\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-4.png","5IGjILaaxONrcL_a3lge9njzhMlPwFM5F8Xz1yvMaI8",{"id":344,"title":345,"body":346,"category":323,"date":1446,"description":1447,"extension":326,"meta":1448,"navigation":328,"path":84,"readTime":330,"seo":1449,"stem":1450,"tags":1451,"thumbnail":1452,"__hash__":1453},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-1.md","Building a Production Monitoring Stack from Scratch — Part 1: Prometheus, Grafana, Node Exporter & AlertManager",{"type":65,"value":347,"toc":1434},[348,357,359,363,366,369,372,375,377,381,384,442,453,456,465,472,474,478,485,536,547,550,557,720,792,804,806,813,819,834,840,857,863,865,869,872,875,880,889,894,903,913,938,944,953,964,975,977,981,984,987,1218,1233,1236,1307,1317,1319,1326,1336,1339,1388,1395,1404,1407,1409,1413,1416,1419,1422,1425,1428,1430],[68,349,350],{},[71,351,352,77,354],{},[74,353,76],{},[74,355,356],{},"Part 1 of 4",[92,358],{},[95,360,362],{"id":361},"the-problem-with-nagiosxi","The Problem with NagiosXI",[71,364,365],{},"We had been running NagiosXI for a while. It worked, in the way that something can work while also quietly frustrating everyone who touches it. It checked hosts, fired alerts, and we had even wired up scripts to push notifications to Mattermost. But the gaps were real and getting harder to ignore.",[71,367,368],{},"It was a paid solution running on our own infrastructure — a licensing cost that got harder to justify every time someone asked for something it couldn't do. OpenTelemetry support was essentially nonexistent. Application log aggregation wasn't on the table at all. Every extension we had made through plugins had taken us about as far as plugins could go.",[71,370,371],{},"The conversation about replacing it had been happening for a while. Eventually it stopped being a conversation and became a project. The goal: a full open-source replacement covering host metrics, alerting, log aggregation, and eventually distributed tracing. One cohesive system instead of a patchwork.",[71,373,374],{},"I took on the work. Phase 1 was about standing up the foundation and proving it could actually replace what NagiosXI was doing before we went further.",[92,376],{},[95,378,380],{"id":379},"the-starting-point","The Starting Point",[71,382,383],{},"The first week was spent getting four things working together:",[385,386,387,400],"table",{},[388,389,390],"thead",{},[391,392,393,397],"tr",{},[394,395,396],"th",{},"Component",[394,398,399],{},"Role",[401,402,403,413,422,432],"tbody",{},[391,404,405,410],{},[406,407,408],"td",{},[74,409,335],{},[406,411,412],{},"Time-series metrics database and scraping engine",[391,414,415,419],{},[406,416,417],{},[74,418,340],{},[406,420,421],{},"Visualization and dashboarding",[391,423,424,429],{},[406,425,426],{},[74,427,428],{},"Node Exporter",[406,430,431],{},"Host-level metrics (CPU, memory, disk, network)",[391,433,434,439],{},[406,435,436],{},[74,437,438],{},"AlertManager",[406,440,441],{},"Alert routing, grouping, and silencing",[71,443,444,445,448,449,452],{},"The deployment runs across two nodes — ",[74,446,447],{},"mon-node-a"," for data collection (Prometheus, AlertManager, and agent-side components) and ",[74,450,451],{},"mon-node-b"," for presentation (Grafana). Keeping the presentation layer separate from the data layer was a deliberate decision: if we need to update or rebuild Grafana, it doesn't touch Prometheus, and vice versa. Everything runs in Docker.",[71,454,455],{},"How these pieces talk to each other matters, because one architectural choice here — pull vs push — ended up being the central problem in Part 2.",[457,458,463],"pre",{"className":459,"code":461,"language":462},[460],"language-text","[ Linux Hosts ]\n      |\n  node_exporter  (runs on each host, exposes \u002Fmetrics on port 9100)\n      |\n      ↓  (pull — Prometheus reaches out every 15s)\n[ Prometheus ]  ←── scrape_configs + alerting_rules\n      |\n      ├──→ [ AlertManager ]\n      |           |\n      |           └──→ Email \u002F Mattermost\n      |\n[ Grafana ]  ←── queries Prometheus via PromQL\n","text",[196,464,461],{"__ignoreMap":314},[71,466,467,468,471],{},"Prometheus is ",[74,469,470],{},"pull-based",". It reaches out to each target on a schedule and pulls metrics. The targets don't know Prometheus exists — they just expose an HTTP endpoint and wait. This distinction ends up mattering a lot.",[92,473],{},[95,475,477],{"id":476},"getting-host-metrics-in","Getting Host Metrics In",[71,479,480,481,484],{},"Node Exporter is a lightweight binary that runs on each host and exposes hardware and OS-level metrics at a ",[196,482,483],{},"\u002Fmetrics"," HTTP endpoint. Deploy one per machine, point Prometheus at it, done.",[457,486,490],{"className":487,"code":488,"language":489,"meta":314,"style":314},"language-bash shiki shiki-themes vitesse-light","# Verify it's running\ncurl http:\u002F\u002F\u003Chost-ip>:9100\u002Fmetrics | head -50\n","bash",[196,491,492,500],{"__ignoreMap":314},[493,494,496],"span",{"class":495,"line":12},"line",[493,497,499],{"class":498},"s8zF2","# Verify it's running\n",[493,501,502,506,510,514,517,520,523,526,529,532],{"class":495,"line":21},[493,503,505],{"class":504},"sySUi","curl",[493,507,509],{"class":508},"spphp"," http:\u002F\u002F",[493,511,513],{"class":512},"si04Y","\u003C",[493,515,516],{"class":508},"host-i",[493,518,71],{"class":519},"suHK_",[493,521,522],{"class":512},">",[493,524,525],{"class":508},":9100\u002Fmetrics",[493,527,528],{"class":512}," |",[493,530,531],{"class":504}," head",[493,533,535],{"class":534},"sEi1f"," -50\n",[71,537,538,539,542,543,546],{},"If you see ",[196,540,541],{},"# HELP"," and ",[196,544,545],{},"# TYPE"," blocks followed by metric lines, you're good.",[71,548,549],{},"Getting the metrics in wasn't the hard part. The harder part was getting them in cleanly, with enough context attached that alerts and dashboards would actually be useful. A raw IP address as the target label tells you very little when something breaks at 2am.",[71,551,552,553,556],{},"The solution was file-based service discovery with rich labels. Instead of listing targets directly in ",[196,554,555],{},"prometheus.yml",", Prometheus watches a directory of JSON files:",[457,558,562],{"className":559,"code":560,"language":561,"meta":314,"style":314},"language-json shiki shiki-themes vitesse-light","[\n  {\n    \"targets\": [\"192.168.0.101:9100\"],\n    \"labels\": {\n      \"hostname\": \"web-server-01\",\n      \"environment\": \"production\",\n      \"location\": \"Primary Rack\",\n      \"maintainers\": \"admin@domain.com\"\n    }\n  }\n]\n","json",[196,563,564,570,575,605,619,642,662,683,702,708,714],{"__ignoreMap":314},[493,565,566],{"class":495,"line":12},[493,567,569],{"class":568},"sYZai","[\n",[493,571,572],{"class":495,"line":21},[493,573,574],{"class":568},"  {\n",[493,576,577,581,585,588,591,594,597,600,602],{"class":495,"line":30},[493,578,580],{"class":579},"s61at","    \"",[493,582,584],{"class":583},"su6XF","targets",[493,586,587],{"class":579},"\"",[493,589,590],{"class":568},":",[493,592,593],{"class":568}," [",[493,595,587],{"class":596},"sSP4y",[493,598,599],{"class":508},"192.168.0.101:9100",[493,601,587],{"class":596},[493,603,604],{"class":568},"],\n",[493,606,607,609,612,614,616],{"class":495,"line":39},[493,608,580],{"class":579},[493,610,611],{"class":583},"labels",[493,613,587],{"class":579},[493,615,590],{"class":568},[493,617,618],{"class":568}," {\n",[493,620,621,624,627,629,631,634,637,639],{"class":495,"line":48},[493,622,623],{"class":579},"      \"",[493,625,626],{"class":583},"hostname",[493,628,587],{"class":579},[493,630,590],{"class":568},[493,632,633],{"class":596}," \"",[493,635,636],{"class":508},"web-server-01",[493,638,587],{"class":596},[493,640,641],{"class":568},",\n",[493,643,644,646,649,651,653,655,658,660],{"class":495,"line":57},[493,645,623],{"class":579},[493,647,648],{"class":583},"environment",[493,650,587],{"class":579},[493,652,590],{"class":568},[493,654,633],{"class":596},[493,656,657],{"class":508},"production",[493,659,587],{"class":596},[493,661,641],{"class":568},[493,663,665,667,670,672,674,676,679,681],{"class":495,"line":664},7,[493,666,623],{"class":579},[493,668,669],{"class":583},"location",[493,671,587],{"class":579},[493,673,590],{"class":568},[493,675,633],{"class":596},[493,677,678],{"class":508},"Primary Rack",[493,680,587],{"class":596},[493,682,641],{"class":568},[493,684,686,688,690,692,694,696,699],{"class":495,"line":685},8,[493,687,623],{"class":579},[493,689,265],{"class":583},[493,691,587],{"class":579},[493,693,590],{"class":568},[493,695,633],{"class":596},[493,697,698],{"class":508},"admin@domain.com",[493,700,701],{"class":596},"\"\n",[493,703,705],{"class":495,"line":704},9,[493,706,707],{"class":568},"    }\n",[493,709,711],{"class":495,"line":710},10,[493,712,713],{"class":568},"  }\n",[493,715,717],{"class":495,"line":716},11,[493,718,719],{"class":568},"]\n",[457,721,725],{"className":722,"code":723,"language":724,"meta":314,"style":314},"language-yaml shiki shiki-themes vitesse-light","# prometheus.yml\nscrape_configs:\n  - job_name: \"node_exporter\"\n    file_sd_configs:\n      - files:\n          - \u002Fetc\u002Fprometheus\u002Ftargets\u002F*.json\n        refresh_interval: 30s\n","yaml",[196,726,727,732,740,757,764,774,782],{"__ignoreMap":314},[493,728,729],{"class":495,"line":12},[493,730,731],{"class":498},"# prometheus.yml\n",[493,733,734,737],{"class":495,"line":21},[493,735,736],{"class":583},"scrape_configs",[493,738,739],{"class":568},":\n",[493,741,742,745,748,750,752,755],{"class":495,"line":30},[493,743,744],{"class":568},"  -",[493,746,747],{"class":583}," job_name",[493,749,590],{"class":568},[493,751,633],{"class":596},[493,753,754],{"class":508},"node_exporter",[493,756,701],{"class":596},[493,758,759,762],{"class":495,"line":39},[493,760,761],{"class":583},"    file_sd_configs",[493,763,739],{"class":568},[493,765,766,769,772],{"class":495,"line":48},[493,767,768],{"class":568},"      -",[493,770,771],{"class":583}," files",[493,773,739],{"class":568},[493,775,776,779],{"class":495,"line":57},[493,777,778],{"class":568},"          -",[493,780,781],{"class":508}," \u002Fetc\u002Fprometheus\u002Ftargets\u002F*.json\n",[493,783,784,787,789],{"class":495,"line":664},[493,785,786],{"class":583},"        refresh_interval",[493,788,590],{"class":568},[493,790,791],{"class":508}," 30s\n",[71,793,794,795,798,799,803],{},"Drop a file in, get a monitored host within 30 seconds. No reload required. The labels on each target flow through to every metric scraped from that host — which means they're available in alert annotations, in Grafana, everywhere. When ",[196,796,797],{},"HostDown"," fires, the alert can say ",[800,801,802],"em",{},"which"," host, in which environment, and who to contact. That's the payoff.",[92,805],{},[95,807,809,810,812],{"id":808},"the-up-metric","The ",[196,811,296],{}," Metric",[71,814,815,816,818],{},"One of Prometheus's built-in synthetic metrics is ",[196,817,296],{},". For every scrape target:",[820,821,822,828],"ul",{},[106,823,824,827],{},[196,825,826],{},"up = 1"," — scrape succeeded",[106,829,830,833],{},[196,831,832],{},"up = 0"," — scrape failed",[71,835,836,837,839],{},"This is the most fundamental health signal in the stack. Everything else — CPU, memory, disk — is meaningless if you can't even reach the host. And because ",[196,838,296],{}," carries all the labels from your target file, you can immediately see which host is down, in which environment.",[457,841,845],{"className":842,"code":843,"language":844,"meta":314,"style":314},"language-promql shiki shiki-themes vitesse-light","# All down hosts right now\nup{job=\"node_exporter\"} == 0\n","promql",[196,846,847,852],{"__ignoreMap":314},[493,848,849],{"class":495,"line":12},[493,850,851],{},"# All down hosts right now\n",[493,853,854],{"class":495,"line":21},[493,855,856],{},"up{job=\"node_exporter\"} == 0\n",[71,858,859,860,862],{},"I keep coming back to ",[196,861,296],{}," throughout this series because it's also where things can silently break if you change the architecture carelessly. More on that in Part 2.",[92,864],{},[95,866,868],{"id":867},"dashboards","Dashboards",[71,870,871],{},"Grafana connects to Prometheus as a data source and queries it via PromQL. The community dashboards are easy to import and useful for getting started, but building your own is worth doing because it forces you to understand exactly what you're looking at.",[71,873,874],{},"The core panels and the queries behind them:",[71,876,877],{},[74,878,879],{},"CPU Usage (%)",[457,881,883],{"className":842,"code":882,"language":844,"meta":314,"style":314},"100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)\n",[196,884,885],{"__ignoreMap":314},[493,886,887],{"class":495,"line":12},[493,888,882],{},[71,890,891],{},[74,892,893],{},"Memory Usage (%)",[457,895,897],{"className":842,"code":896,"language":844,"meta":314,"style":314},"(1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100\n",[196,898,899],{"__ignoreMap":314},[493,900,901],{"class":495,"line":12},[493,902,896],{},[71,904,905,908,909,912],{},[74,906,907],{},"Disk Usage (%)"," — the ",[196,910,911],{},"fstype"," filter excludes Docker overlays and tmpfs mounts that inflate results",[457,914,916],{"className":842,"code":915,"language":844,"meta":314,"style":314},"(1 - (\n  node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"} \u002F\n  node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"}\n)) * 100\n",[196,917,918,923,928,933],{"__ignoreMap":314},[493,919,920],{"class":495,"line":12},[493,921,922],{},"(1 - (\n",[493,924,925],{"class":495,"line":21},[493,926,927],{},"  node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"} \u002F\n",[493,929,930],{"class":495,"line":30},[493,931,932],{},"  node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"}\n",[493,934,935],{"class":495,"line":39},[493,936,937],{},")) * 100\n",[71,939,940,943],{},[74,941,942],{},"Fleet status"," — a stat panel showing every host's current state",[457,945,947],{"className":842,"code":946,"language":844,"meta":314,"style":314},"up{job=\"node_exporter\"}\n",[196,948,949],{"__ignoreMap":314},[493,950,951],{"class":495,"line":12},[493,952,946],{},[71,954,955,956,959,960,963],{},"Value mappings: ",[196,957,958],{},"1"," → 🟢 UP, ",[196,961,962],{},"0"," → 🔴 DOWN.",[71,965,966,967,970,971,974],{},"Adding a dashboard variable for ",[196,968,969],{},"instance"," — ",[196,972,973],{},"label_values(up{job=\"node_exporter\"}, instance)"," — gives you a dropdown to filter to a specific host or view the whole fleet. That one change makes the dashboard genuinely useful for day-to-day operations.",[92,976],{},[95,978,980],{"id":979},"alerting","Alerting",[71,982,983],{},"Prometheus evaluates alerting rules and forwards firing alerts to AlertManager. AlertManager handles the business logic: who gets notified, when, how often, and what to suppress.",[71,985,986],{},"The rules themselves live in separate YAML files:",[457,988,990],{"className":722,"code":989,"language":724,"meta":314,"style":314},"groups:\n  - name: node_exporter_alerts\n    rules:\n\n      - alert: HostDown\n        expr: up{job=\"node_exporter\"} == 0\n        for: 2m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"Host {{ $labels.instance }} is down\"\n          description: >\n            {{ $labels.hostname }} has been unreachable for more than 2 minutes.\n            Maintainers: {{ $labels.maintainers }}\n\n      - alert: HighCPUUsage\n        expr: >\n          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 85\n        for: 5m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"High CPU on {{ $labels.instance }}\"\n          description: >\n            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.\n            Current: {{ $value | printf \"%.1f\" }}%\n",[196,991,992,999,1011,1018,1023,1035,1045,1055,1062,1072,1079,1093,1105,1111,1117,1122,1134,1143,1149,1159,1166,1176,1183,1197,1206,1212],{"__ignoreMap":314},[493,993,994,997],{"class":495,"line":12},[493,995,996],{"class":583},"groups",[493,998,739],{"class":568},[493,1000,1001,1003,1006,1008],{"class":495,"line":21},[493,1002,744],{"class":568},[493,1004,1005],{"class":583}," name",[493,1007,590],{"class":568},[493,1009,1010],{"class":508}," node_exporter_alerts\n",[493,1012,1013,1016],{"class":495,"line":30},[493,1014,1015],{"class":583},"    rules",[493,1017,739],{"class":568},[493,1019,1020],{"class":495,"line":39},[493,1021,1022],{"emptyLinePlaceholder":328},"\n",[493,1024,1025,1027,1030,1032],{"class":495,"line":48},[493,1026,768],{"class":568},[493,1028,1029],{"class":583}," alert",[493,1031,590],{"class":568},[493,1033,1034],{"class":508}," HostDown\n",[493,1036,1037,1040,1042],{"class":495,"line":57},[493,1038,1039],{"class":583},"        expr",[493,1041,590],{"class":568},[493,1043,1044],{"class":508}," up{job=\"node_exporter\"} == 0\n",[493,1046,1047,1050,1052],{"class":495,"line":664},[493,1048,1049],{"class":583},"        for",[493,1051,590],{"class":568},[493,1053,1054],{"class":508}," 2m\n",[493,1056,1057,1060],{"class":495,"line":685},[493,1058,1059],{"class":583},"        labels",[493,1061,739],{"class":568},[493,1063,1064,1067,1069],{"class":495,"line":704},[493,1065,1066],{"class":583},"          severity",[493,1068,590],{"class":568},[493,1070,1071],{"class":508}," critical\n",[493,1073,1074,1077],{"class":495,"line":710},[493,1075,1076],{"class":583},"        annotations",[493,1078,739],{"class":568},[493,1080,1081,1084,1086,1088,1091],{"class":495,"line":716},[493,1082,1083],{"class":583},"          summary",[493,1085,590],{"class":568},[493,1087,633],{"class":596},[493,1089,1090],{"class":508},"Host {{ $labels.instance }} is down",[493,1092,701],{"class":596},[493,1094,1096,1099,1101],{"class":495,"line":1095},12,[493,1097,1098],{"class":583},"          description",[493,1100,590],{"class":568},[493,1102,1104],{"class":1103},"sbBg2"," >\n",[493,1106,1108],{"class":495,"line":1107},13,[493,1109,1110],{"class":508},"            {{ $labels.hostname }} has been unreachable for more than 2 minutes.\n",[493,1112,1114],{"class":495,"line":1113},14,[493,1115,1116],{"class":508},"            Maintainers: {{ $labels.maintainers }}\n",[493,1118,1120],{"class":495,"line":1119},15,[493,1121,1022],{"emptyLinePlaceholder":328},[493,1123,1125,1127,1129,1131],{"class":495,"line":1124},16,[493,1126,768],{"class":568},[493,1128,1029],{"class":583},[493,1130,590],{"class":568},[493,1132,1133],{"class":508}," HighCPUUsage\n",[493,1135,1137,1139,1141],{"class":495,"line":1136},17,[493,1138,1039],{"class":583},[493,1140,590],{"class":568},[493,1142,1104],{"class":1103},[493,1144,1146],{"class":495,"line":1145},18,[493,1147,1148],{"class":508},"          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 85\n",[493,1150,1152,1154,1156],{"class":495,"line":1151},19,[493,1153,1049],{"class":583},[493,1155,590],{"class":568},[493,1157,1158],{"class":508}," 5m\n",[493,1160,1162,1164],{"class":495,"line":1161},20,[493,1163,1059],{"class":583},[493,1165,739],{"class":568},[493,1167,1169,1171,1173],{"class":495,"line":1168},21,[493,1170,1066],{"class":583},[493,1172,590],{"class":568},[493,1174,1175],{"class":508}," warning\n",[493,1177,1179,1181],{"class":495,"line":1178},22,[493,1180,1076],{"class":583},[493,1182,739],{"class":568},[493,1184,1186,1188,1190,1192,1195],{"class":495,"line":1185},23,[493,1187,1083],{"class":583},[493,1189,590],{"class":568},[493,1191,633],{"class":596},[493,1193,1194],{"class":508},"High CPU on {{ $labels.instance }}",[493,1196,701],{"class":596},[493,1198,1200,1202,1204],{"class":495,"line":1199},24,[493,1201,1098],{"class":583},[493,1203,590],{"class":568},[493,1205,1104],{"class":1103},[493,1207,1209],{"class":495,"line":1208},25,[493,1210,1211],{"class":508},"            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.\n",[493,1213,1215],{"class":495,"line":1214},26,[493,1216,1217],{"class":508},"            Current: {{ $value | printf \"%.1f\" }}%\n",[71,1219,809,1220,1223,1224,1226,1227,1229,1230,1232],{},[196,1221,1222],{},"for: 2m"," on ",[196,1225,797],{}," absorbs brief network glitches. Without it, a momentary scrape failure sends an alert. The rich labels on the target — ",[196,1228,626],{},", ",[196,1231,265],{}," — show up directly in the alert annotations.",[71,1234,1235],{},"One AlertManager config worth explaining is the inhibit rule:",[457,1237,1239],{"className":722,"code":1238,"language":724,"meta":314,"style":314},"inhibit_rules:\n  - source_match:\n      alertname: \"HostDown\"\n    target_match_re:\n      alertname: \"HighCPUUsage|HighMemoryUsage|DiskSpaceLow\"\n    equal: [\"instance\"]\n",[196,1240,1241,1248,1257,1270,1277,1290],{"__ignoreMap":314},[493,1242,1243,1246],{"class":495,"line":12},[493,1244,1245],{"class":583},"inhibit_rules",[493,1247,739],{"class":568},[493,1249,1250,1252,1255],{"class":495,"line":21},[493,1251,744],{"class":568},[493,1253,1254],{"class":583}," source_match",[493,1256,739],{"class":568},[493,1258,1259,1262,1264,1266,1268],{"class":495,"line":30},[493,1260,1261],{"class":583},"      alertname",[493,1263,590],{"class":568},[493,1265,633],{"class":596},[493,1267,797],{"class":508},[493,1269,701],{"class":596},[493,1271,1272,1275],{"class":495,"line":39},[493,1273,1274],{"class":583},"    target_match_re",[493,1276,739],{"class":568},[493,1278,1279,1281,1283,1285,1288],{"class":495,"line":48},[493,1280,1261],{"class":583},[493,1282,590],{"class":568},[493,1284,633],{"class":596},[493,1286,1287],{"class":508},"HighCPUUsage|HighMemoryUsage|DiskSpaceLow",[493,1289,701],{"class":596},[493,1291,1292,1295,1297,1299,1301,1303,1305],{"class":495,"line":57},[493,1293,1294],{"class":583},"    equal",[493,1296,590],{"class":568},[493,1298,593],{"class":568},[493,1300,587],{"class":596},[493,1302,969],{"class":508},[493,1304,587],{"class":596},[493,1306,719],{"class":568},[71,1308,1309,1310,1312,1313,1316],{},"When ",[196,1311,797],{}," fires for a host, AlertManager suppresses all other alerts for that same host. There's no useful signal in a ",[196,1314,1315],{},"HighMemoryUsage"," alert for a machine that isn't reachable. Without this, a single dead host can generate a cascade of noise.",[92,1318],{},[95,1320,809,1322,1325],{"id":1321},"the-last_seen-pattern",[196,1323,1324],{},"last_seen"," Pattern",[71,1327,1328,1329,1332,1333,1335],{},"When a host disappears completely, Prometheus eventually stops having active series data for it. ",[196,1330,1331],{},"up{instance=\"...\"}"," doesn't return ",[196,1334,962],{}," — it returns nothing, because there's no scrape happening. You lose the ability to answer \"when did this thing last check in?\"",[71,1337,1338],{},"A recording rule fixes this by continuously writing a timestamp whenever a host is up:",[457,1340,1342],{"className":722,"code":1341,"language":724,"meta":314,"style":314},"groups:\n  - name: recording_rules\n    rules:\n      - record: node_last_seen_timestamp\n        expr: time() * up{job=\"node_exporter\"}\n",[196,1343,1344,1350,1361,1367,1379],{"__ignoreMap":314},[493,1345,1346,1348],{"class":495,"line":12},[493,1347,996],{"class":583},[493,1349,739],{"class":568},[493,1351,1352,1354,1356,1358],{"class":495,"line":21},[493,1353,744],{"class":568},[493,1355,1005],{"class":583},[493,1357,590],{"class":568},[493,1359,1360],{"class":508}," recording_rules\n",[493,1362,1363,1365],{"class":495,"line":30},[493,1364,1015],{"class":583},[493,1366,739],{"class":568},[493,1368,1369,1371,1374,1376],{"class":495,"line":39},[493,1370,768],{"class":568},[493,1372,1373],{"class":583}," record",[493,1375,590],{"class":568},[493,1377,1378],{"class":508}," node_last_seen_timestamp\n",[493,1380,1381,1383,1385],{"class":495,"line":48},[493,1382,1039],{"class":583},[493,1384,590],{"class":568},[493,1386,1387],{"class":508}," time() * up{job=\"node_exporter\"}\n",[71,1389,1390,1391,1394],{},"This writes the current Unix timestamp on every evaluation cycle, but only when ",[196,1392,1393],{},"up == 1",". When a host goes dark, the last written value persists in storage. In Grafana:",[457,1396,1398],{"className":842,"code":1397,"language":844,"meta":314,"style":314},"time() - node_last_seen_timestamp\n",[196,1399,1400],{"__ignoreMap":314},[493,1401,1402],{"class":495,"line":12},[493,1403,1397],{},[71,1405,1406],{},"Format as duration and you get: \"last seen 3h 22m ago.\" It's a small thing but it's become one of the most-used panels.",[92,1408],{},[95,1410,1412],{"id":1411},"where-this-left-off","Where This Left Off",[71,1414,1415],{},"By the end of the first week, the stack was functionally replacing NagiosXI for host monitoring. Prometheus scraping every host every 15 seconds, dashboards showing the fleet, AlertManager routing alerts with inhibit rules and deduplication, recording rules keeping last-seen timestamps for hosts that went dark.",[71,1417,1418],{},"But there was a question I hadn't resolved yet.",[71,1420,1421],{},"Node Exporter is a single-purpose binary — host metrics and nothing else. The moment we wanted logs or traces from these same hosts, we'd need additional agents running alongside it. And adding a host to monitoring still meant four manual steps: SSH in, install Node Exporter, write the target file, reload Prometheus.",[71,1423,1424],{},"My colleague had been working in parallel, exploring the multiple-exporter approach — a separate binary for each signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle all of it. We hadn't converged yet, and there were real questions about whether Alloy was ready enough to build on.",[71,1426,1427],{},"That's what Part 2 is about.",[82,1429],{"path":87},[1431,1432,1433],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s8zF2, html code.shiki .s8zF2{--shiki-default:#A0ADA0}html pre.shiki code .sySUi, html code.shiki .sySUi{--shiki-default:#59873A}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}html pre.shiki code .si04Y, html code.shiki .si04Y{--shiki-default:#AB5959}html pre.shiki code .suHK_, html code.shiki .suHK_{--shiki-default:#393A34}html pre.shiki code .sEi1f, html code.shiki .sEi1f{--shiki-default:#A65E2B}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .s61at, html code.shiki .s61at{--shiki-default:#99841877}html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sSP4y, html code.shiki .sSP4y{--shiki-default:#B5695977}html pre.shiki code .sbBg2, html code.shiki .sbBg2{--shiki-default:#1E754F}",{"title":314,"searchDepth":21,"depth":21,"links":1435},[1436,1437,1438,1439,1441,1442,1443,1445],{"id":361,"depth":21,"text":362},{"id":379,"depth":21,"text":380},{"id":476,"depth":21,"text":477},{"id":808,"depth":21,"text":1440},"The up Metric",{"id":867,"depth":21,"text":868},{"id":979,"depth":21,"text":980},{"id":1321,"depth":21,"text":1444},"The last_seen Pattern",{"id":1411,"depth":21,"text":1412},"2025-01-17","How we migrated from NagiosXI to a modern open-source observability stack — and why getting the foundation right mattered more than I expected.",{},{"title":345,"description":1447},"blog\u002Flgtm-stack\u002Fpart-1",[334,340,335,438,337],"\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-1.png","7NeM6IsFhBh6e-n5FcTCZrpdQwLcnU1u-1RFk4aMi1A",{"id":1455,"title":1456,"body":1457,"category":323,"date":2124,"description":2125,"extension":326,"meta":2126,"navigation":328,"path":87,"readTime":2127,"seo":2128,"stem":2129,"tags":2130,"thumbnail":2131,"__hash__":2132},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-2.md","Building a Production Monitoring Stack from Scratch — Part 2: Grafana Alloy & the Push vs Pull Problem",{"type":65,"value":1458,"toc":2112},[1459,1468,1470,1472,1476,1479,1482,1485,1487,1491,1498,1501,1504,1521,1524,1596,1603,1605,1609,1615,1627,1635,1642,1648,1657,1660,1665,1670,1677,1682,1685,1694,1700,1704,1710,1717,1723,1729,1738,1825,1838,1968,1974,1976,1980,1983,1986,2005,2016,2019,2032,2035,2037,2041,2044,2047,2050,2053,2056,2058,2062,2067,2077,2086,2092,2094,2096,2099,2102,2107,2109],[68,1460,1461],{},[71,1462,1463,77,1465],{},[74,1464,76],{},[74,1466,1467],{},"Part 2 of 4",[82,1469],{"path":84},[92,1471],{},[95,1473,1475],{"id":1474},"the-open-question-from-part-1","The Open Question from Part 1",[71,1477,1478],{},"By the end of Phase 1, the stack was working. But Node Exporter is a single-purpose binary — host metrics, nothing else. The plan was always to get logs and traces into the same system, which meant we'd eventually need more agents on each host. A separate exporter for postgres metrics, another for nginx, maybe more after that. Each one is another thing to deploy, another thing to update, another thing to break in a subtly different way.",[71,1480,1481],{},"My colleague and I had been running in parallel on this. He was working through the multiple-exporter approach — the established path, a separate binary per signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle metrics, logs, and traces from one deployed process.",[71,1483,1484],{},"The question was whether Alloy was actually ready to build on.",[92,1486],{},[95,1488,1490],{"id":1489},"what-alloy-is","What Alloy Is",[71,1492,1493,1494,1497],{},"Grafana Alloy is Grafana Labs' open-source observability agent, positioned as the successor to Grafana Agent Flow. It's built around a pipeline model: you define sources, processors, and exporters as typed components and wire them together in ",[196,1495,1496],{},".alloy"," config files.",[71,1499,1500],{},"When I started working with it, it was still fairly new. The Agent Flow rebranding into Alloy had just stabilised, documentation was still filling in gaps, and community examples were sparse. You were going to hit sharp edges. But the direction seemed clearly right — one agent, multiple signals, explicit pipelines.",[71,1502,1503],{},"What made it compelling on paper:",[820,1505,1506,1512,1515,1518],{},[106,1507,1508,1511],{},[196,1509,1510],{},"prometheus.exporter.unix"," replicates Node Exporter's collectors without a separate binary",[106,1513,1514],{},"First-class support for OpenTelemetry receivers and exporters",[106,1516,1517],{},"Composable pipeline configs where data flow is visible and readable",[106,1519,1520],{},"Application metric endpoints (postgres, nginx, etc.) accessible as pipeline components",[71,1522,1523],{},"The config model is clean. Here's a simple pipeline — collect host metrics, send to Prometheus:",[457,1525,1529],{"className":1526,"code":1527,"language":1528,"meta":314,"style":314},"language-alloy shiki shiki-themes vitesse-light","prometheus.exporter.unix \"localhost\" {\n  set_collectors = [\"cpu\", \"meminfo\", \"diskstats\", \"filesystem\", \"netdev\", \"loadavg\"]\n}\n\nprometheus.scrape \"node\" {\n  targets    = prometheus.exporter.unix.localhost.targets\n  forward_to = [prometheus.remote_write.default.receiver]\n}\n\nprometheus.remote_write \"default\" {\n  endpoint {\n    url = \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\"\n  }\n}\n","alloy",[196,1530,1531,1536,1541,1546,1550,1555,1560,1565,1569,1573,1578,1583,1588,1592],{"__ignoreMap":314},[493,1532,1533],{"class":495,"line":12},[493,1534,1535],{},"prometheus.exporter.unix \"localhost\" {\n",[493,1537,1538],{"class":495,"line":21},[493,1539,1540],{},"  set_collectors = [\"cpu\", \"meminfo\", \"diskstats\", \"filesystem\", \"netdev\", \"loadavg\"]\n",[493,1542,1543],{"class":495,"line":30},[493,1544,1545],{},"}\n",[493,1547,1548],{"class":495,"line":39},[493,1549,1022],{"emptyLinePlaceholder":328},[493,1551,1552],{"class":495,"line":48},[493,1553,1554],{},"prometheus.scrape \"node\" {\n",[493,1556,1557],{"class":495,"line":57},[493,1558,1559],{},"  targets    = prometheus.exporter.unix.localhost.targets\n",[493,1561,1562],{"class":495,"line":664},[493,1563,1564],{},"  forward_to = [prometheus.remote_write.default.receiver]\n",[493,1566,1567],{"class":495,"line":685},[493,1568,1545],{},[493,1570,1571],{"class":495,"line":704},[493,1572,1022],{"emptyLinePlaceholder":328},[493,1574,1575],{"class":495,"line":710},[493,1576,1577],{},"prometheus.remote_write \"default\" {\n",[493,1579,1580],{"class":495,"line":716},[493,1581,1582],{},"  endpoint {\n",[493,1584,1585],{"class":495,"line":1095},[493,1586,1587],{},"    url = \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\"\n",[493,1589,1590],{"class":495,"line":1107},[493,1591,713],{},[493,1593,1594],{"class":495,"line":1113},[493,1595,1545],{},[71,1597,1598,1599,1602],{},"That ",[196,1600,1601],{},"remote_write"," line introduced a problem I didn't see coming.",[92,1604],{},[95,1606,1608],{"id":1607},"the-push-vs-pull-problem","The Push vs Pull Problem",[71,1610,1611,1612,1614],{},"Prometheus is a pull-based system. It owns the scrape cycle — it reaches out to targets, pulls metrics, and as a side effect of each successful scrape, generates a synthetic ",[196,1613,296],{}," metric:",[820,1616,1617,1622],{},[106,1618,1619,1621],{},[196,1620,826],{}," — scrape succeeded, host is reachable",[106,1623,1624,1626],{},[196,1625,832],{}," — scrape failed, something is wrong",[71,1628,1629,1631,1632,1634],{},[196,1630,296],{}," isn't something your application exports. Prometheus generates it internally, based on whether the HTTP request to ",[196,1633,483],{}," succeeded. The entire alerting chain from Part 1 depended on it.",[71,1636,1637,1638,1641],{},"When Alloy uses ",[196,1639,1640],{},"prometheus.remote_write",", the data flow reverses. Alloy pushes metrics to Prometheus via HTTP POST. Prometheus sits passively and receives what's sent.",[71,1643,1644,1645,1647],{},"And that means Prometheus never scrapes these hosts. So Prometheus never generates ",[196,1646,296],{}," for them.",[71,1649,1650,1651,1653,1654,1656],{},"The first time I checked the Prometheus targets page after switching to push-based Alloy, those hosts weren't in the targets list at all. Not showing ",[196,1652,832],{}," — not there at all. Prometheus had no scrape config for them; it was just receiving a stream of metrics it hadn't asked for. The ",[196,1655,296],{}," metric had silently disappeared, and everything built on top of it — every alert, every \"host is down\" notification — had gone with it.",[71,1658,1659],{},"This was the kind of failure that wouldn't surface immediately. The dashboards still had data. Metrics were still flowing. It would only become obvious the next time a host actually went down and nobody got paged.",[1661,1662,1664],"h3",{"id":1663},"the-workarounds-i-tried","The Workarounds I Tried",[71,1666,1667],{},[74,1668,1669],{},"Heartbeat metric from Alloy's internal health",[71,1671,1672,1673,1676],{},"Alloy exposes internal component status metrics. You can check if the pipeline is running. But this only tells you Alloy is alive on the server side — it says nothing about whether the ",[800,1674,1675],{},"host"," is reachable. A host could be completely unreachable and Alloy's own health metrics would look fine from Prometheus's perspective, because Prometheus was never reaching out to check.",[71,1678,1679],{},[74,1680,1681],{},"Staleness detection via timestamp",[71,1683,1684],{},"If a host stops pushing, its metrics go stale. You can detect this:",[457,1686,1688],{"className":842,"code":1687,"language":844,"meta":314,"style":314},"(time() - max by (instance) (timestamp(node_cpu_seconds_total))) > 120\n",[196,1689,1690],{"__ignoreMap":314},[493,1691,1692],{"class":495,"line":12},[493,1693,1687],{},[71,1695,1696,1697,1699],{},"This technically works. But it's fragile — dependent on a specific metric being present and recently written, prone to false positives from remote_write buffer lag or brief network hiccups. And it means rewriting every alert and dashboard around staleness rather than the clean binary ",[196,1698,296],{}," signal. It felt like building on sand.",[1661,1701,1703],{"id":1702},"the-actual-fix","The Actual Fix",[71,1705,1706,1707],{},"After long enough on the workarounds, the right answer was simpler: ",[74,1708,1709],{},"keep Prometheus pulling, just pull from Alloy's HTTP endpoint instead of a standalone Node Exporter binary.",[71,1711,1712,1713,1716],{},"Alloy exposes an HTTP API on port ",[196,1714,1715],{},"12345",". Every component that produces metrics is accessible at a path under that API:",[457,1718,1721],{"className":1719,"code":1720,"language":462},[460],"http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[196,1722,1720],{"__ignoreMap":314},[71,1724,1725,1726,1728],{},"This is a plain HTTP endpoint serving Prometheus text format — exactly what Node Exporter served on port 9100. Prometheus can scrape it exactly like any other target. When it does, it generates ",[196,1727,296],{},". Everything from Part 1 works without modification.",[71,1730,1731,1732,1735,1736,590],{},"The Alloy config on each host becomes simpler, not more complex — no ",[196,1733,1734],{},"prometheus.scrape",", no ",[196,1737,1640],{},[457,1739,1741],{"className":1526,"code":1740,"language":1528,"meta":314,"style":314},"prometheus.exporter.unix \"localhost\" {\n  set_collectors = [\n    \"cpu\",\n    \"meminfo\",\n    \"diskstats\",\n    \"filesystem\",\n    \"netdev\",\n    \"loadavg\",\n    \"uname\",\n    \"time\",\n    \"systemd\",\n    \"processes\",\n  ]\n}\n\n\u002F\u002F Prometheus will pull from:\n\u002F\u002F http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[196,1742,1743,1747,1752,1757,1762,1767,1772,1777,1782,1787,1792,1797,1802,1807,1811,1815,1820],{"__ignoreMap":314},[493,1744,1745],{"class":495,"line":12},[493,1746,1535],{},[493,1748,1749],{"class":495,"line":21},[493,1750,1751],{},"  set_collectors = [\n",[493,1753,1754],{"class":495,"line":30},[493,1755,1756],{},"    \"cpu\",\n",[493,1758,1759],{"class":495,"line":39},[493,1760,1761],{},"    \"meminfo\",\n",[493,1763,1764],{"class":495,"line":48},[493,1765,1766],{},"    \"diskstats\",\n",[493,1768,1769],{"class":495,"line":57},[493,1770,1771],{},"    \"filesystem\",\n",[493,1773,1774],{"class":495,"line":664},[493,1775,1776],{},"    \"netdev\",\n",[493,1778,1779],{"class":495,"line":685},[493,1780,1781],{},"    \"loadavg\",\n",[493,1783,1784],{"class":495,"line":704},[493,1785,1786],{},"    \"uname\",\n",[493,1788,1789],{"class":495,"line":710},[493,1790,1791],{},"    \"time\",\n",[493,1793,1794],{"class":495,"line":716},[493,1795,1796],{},"    \"systemd\",\n",[493,1798,1799],{"class":495,"line":1095},[493,1800,1801],{},"    \"processes\",\n",[493,1803,1804],{"class":495,"line":1107},[493,1805,1806],{},"  ]\n",[493,1808,1809],{"class":495,"line":1113},[493,1810,1545],{},[493,1812,1813],{"class":495,"line":1119},[493,1814,1022],{"emptyLinePlaceholder":328},[493,1816,1817],{"class":495,"line":1124},[493,1818,1819],{},"\u002F\u002F Prometheus will pull from:\n",[493,1821,1822],{"class":495,"line":1136},[493,1823,1824],{},"\u002F\u002F http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[71,1826,1827,1828,1830,1831,1834,1835,1837],{},"The target file for each Alloy host points to port ",[196,1829,1715],{}," and uses ",[196,1832,1833],{},"__metrics_path__"," — a special Prometheus label that overrides the default ",[196,1836,483],{}," scrape path — to point at the correct component endpoint:",[457,1839,1841],{"className":559,"code":1840,"language":561,"meta":314,"style":314},"[\n  {\n    \"targets\": [\"10.200.3.23:12345\"],\n    \"labels\": {\n      \"hostname\": \"cloud-network-3\",\n      \"environment\": \"production\",\n      \"maintainers\": \"admin@domain.com\",\n      \"__metrics_path__\": \"\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\"\n    }\n  }\n]\n",[196,1842,1843,1847,1851,1872,1884,1903,1921,1939,1956,1960,1964],{"__ignoreMap":314},[493,1844,1845],{"class":495,"line":12},[493,1846,569],{"class":568},[493,1848,1849],{"class":495,"line":21},[493,1850,574],{"class":568},[493,1852,1853,1855,1857,1859,1861,1863,1865,1868,1870],{"class":495,"line":30},[493,1854,580],{"class":579},[493,1856,584],{"class":583},[493,1858,587],{"class":579},[493,1860,590],{"class":568},[493,1862,593],{"class":568},[493,1864,587],{"class":596},[493,1866,1867],{"class":508},"10.200.3.23:12345",[493,1869,587],{"class":596},[493,1871,604],{"class":568},[493,1873,1874,1876,1878,1880,1882],{"class":495,"line":39},[493,1875,580],{"class":579},[493,1877,611],{"class":583},[493,1879,587],{"class":579},[493,1881,590],{"class":568},[493,1883,618],{"class":568},[493,1885,1886,1888,1890,1892,1894,1896,1899,1901],{"class":495,"line":48},[493,1887,623],{"class":579},[493,1889,626],{"class":583},[493,1891,587],{"class":579},[493,1893,590],{"class":568},[493,1895,633],{"class":596},[493,1897,1898],{"class":508},"cloud-network-3",[493,1900,587],{"class":596},[493,1902,641],{"class":568},[493,1904,1905,1907,1909,1911,1913,1915,1917,1919],{"class":495,"line":57},[493,1906,623],{"class":579},[493,1908,648],{"class":583},[493,1910,587],{"class":579},[493,1912,590],{"class":568},[493,1914,633],{"class":596},[493,1916,657],{"class":508},[493,1918,587],{"class":596},[493,1920,641],{"class":568},[493,1922,1923,1925,1927,1929,1931,1933,1935,1937],{"class":495,"line":664},[493,1924,623],{"class":579},[493,1926,265],{"class":583},[493,1928,587],{"class":579},[493,1930,590],{"class":568},[493,1932,633],{"class":596},[493,1934,698],{"class":508},[493,1936,587],{"class":596},[493,1938,641],{"class":568},[493,1940,1941,1943,1945,1947,1949,1951,1954],{"class":495,"line":685},[493,1942,623],{"class":579},[493,1944,1833],{"class":583},[493,1946,587],{"class":579},[493,1948,590],{"class":568},[493,1950,633],{"class":596},[493,1952,1953],{"class":508},"\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics",[493,1955,701],{"class":596},[493,1957,1958],{"class":495,"line":704},[493,1959,707],{"class":568},[493,1961,1962],{"class":495,"line":710},[493,1963,713],{"class":568},[493,1965,1966],{"class":495,"line":716},[493,1967,719],{"class":568},[71,1969,1970,1971,1973],{},"Prometheus scrapes it, gets back standard metrics, generates ",[196,1972,826],{},", stores everything as normal. The alerting chain is intact.",[92,1975],{},[95,1977,1979],{"id":1978},"getting-application-metrics-in-the-part-that-nearly-broke-alloy-for-us","Getting Application Metrics In: The Part That Nearly Broke Alloy for Us",[71,1981,1982],{},"The unix exporter was the straightforward part. The harder part was getting application-level metrics — specifically postgres and nginx — through Alloy rather than through separate exporter binaries.",[71,1984,1985],{},"Alloy has built-in components for both. The postgres one:",[457,1987,1989],{"className":1526,"code":1988,"language":1528,"meta":314,"style":314},"prometheus.exporter.postgres \"db\" {\n  data_source_names = [\"postgresql:\u002F\u002Fuser:pass@localhost:5432\u002Fmydb?sslmode=disable\"]\n}\n",[196,1990,1991,1996,2001],{"__ignoreMap":314},[493,1992,1993],{"class":495,"line":12},[493,1994,1995],{},"prometheus.exporter.postgres \"db\" {\n",[493,1997,1998],{"class":495,"line":21},[493,1999,2000],{},"  data_source_names = [\"postgresql:\u002F\u002Fuser:pass@localhost:5432\u002Fmydb?sslmode=disable\"]\n",[493,2002,2003],{"class":495,"line":30},[493,2004,1545],{},[71,2006,2007,2008,2011,2012,2015],{},"I tried the nginx equivalent first. Alloy has ",[196,2009,2010],{},"prometheus.exporter.nginx",", which connects to nginx's ",[196,2013,2014],{},"stub_status"," endpoint and pulls metrics from it. I set it up, checked the output — nothing. No metrics, no errors, just silence. I spent time on it, checked the nginx config, checked the Alloy config, tried different approaches. At some point I started thinking seriously about just installing the standalone nginx exporter and giving up on Alloy for application metrics entirely.",[71,2017,2018],{},"Before doing that, I tried the postgres component instead. It worked immediately — metrics flowing through on the first attempt. That was the signal I needed. If postgres worked, nginx should work too. The problem wasn't Alloy. Something was wrong with my specific nginx setup.",[71,2020,2021,2022,2024,2025,2027,2028,2031],{},"I went back to nginx, looked more carefully at the ",[196,2023,2014],{}," configuration, and found it. The endpoint wasn't properly enabled — the nginx config had the ",[196,2026,2014],{}," block but it was only accessible from ",[196,2029,2030],{},"127.0.0.1",", and Alloy was trying to reach it in a way that wasn't matching that restriction. A small fix, and nginx metrics started flowing.",[71,2033,2034],{},"The near-ditch was worth it. Running separate exporters for every application would have meant my colleague's approach and my approach converging on the same place — a proliferation of binaries per host. The whole point of Alloy was avoiding that.",[92,2036],{},[95,2038,2040],{"id":2039},"why-alloy-over-multiple-exporters","Why Alloy Over Multiple Exporters",[71,2042,2043],{},"My colleague's multiple-exporter approach was working. It's the established path, well-documented, stable. The case for it isn't wrong.",[71,2045,2046],{},"But the case for Alloy is better for where we're going. The moment you want logs — which was always the plan — you need another agent anyway. If you're already running Node Exporter, postgres exporter, and nginx exporter, you're at three binaries per host. Adding a log agent makes four. Each one needs to be deployed, configured, updated, and monitored independently.",[71,2048,2049],{},"With Alloy, adding logs is another component in the same config file on the same process. Adding traces is the same. The operational footprint stays at one agent per host regardless of how many signals you're collecting.",[71,2051,2052],{},"There's also the matter of the pipeline model. When you have a single config that describes exactly what data is flowing where, debugging is straightforward. With four separate agents running independently, understanding the full picture requires checking four separate processes.",[71,2054,2055],{},"The sharp edges were real — the push vs pull problem cost me real time, the nginx issue nearly derailed the whole approach. But those were solvable problems. The structural limitation of multiple exporters — complexity that compounds as you add signals — isn't.",[92,2057],{},[95,2059,2061],{"id":2060},"what-changed-in-grafana","What Changed in Grafana",[71,2063,2064,2066],{},[196,2065,1510],{}," uses the same metric names as standalone Node Exporter — it's built on the same underlying collectors. Every PromQL query from Part 1 works unchanged.",[71,2068,2069,2070,2072,2073,2076],{},"The one real change is how ",[196,2071,296],{}," is queried. With Alloy, the job label becomes ",[196,2074,2075],{},"\"alloy\"",", and filtering is done through the richer label set on each target — environment, priority, instance — rather than anything tied to a port or exporter binary. For example, the fleet status panel:",[457,2078,2080],{"className":842,"code":2079,"language":844,"meta":314,"style":314},"count(up{job=\"alloy\", priority=~\"${priority}\", environment=~\"${environment}\", instance!~\"^localhost.*\", instance=~\"${instance}\"} == 1) or vector(0)\n",[196,2081,2082],{"__ignoreMap":314},[493,2083,2084],{"class":495,"line":12},[493,2085,2079],{},[71,2087,809,2088,2091],{},[196,2089,2090],{},"or vector(0)"," ensures the panel returns zero rather than no data when nothing matches — a small thing that matters when you're staring at a dashboard at 2am wondering if the query is broken or the hosts are genuinely all down.",[92,2093],{},[95,2095,1412],{"id":1411},[71,2097,2098],{},"By the end of Phase 2, the stack had a single agent per host handling metrics across system and application layers, pull-based scraping preserved so all the alerting machinery from Phase 1 still worked, and OTel ports open on the Alloy container for what came next.",[71,2100,2101],{},"The next gap was observability beyond metrics. Host CPU and memory tell you a machine is struggling; they don't tell you why, or what a request was doing when it failed. That meant Loki for logs and Tempo for distributed traces.",[71,2103,2104],{},[74,2105,2106],{},"Next:",[82,2108],{"path":90},[1431,2110,2111],{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .s61at, html code.shiki .s61at{--shiki-default:#99841877}html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sSP4y, html code.shiki .sSP4y{--shiki-default:#B5695977}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}",{"title":314,"searchDepth":21,"depth":21,"links":2113},[2114,2115,2116,2120,2121,2122,2123],{"id":1474,"depth":21,"text":1475},{"id":1489,"depth":21,"text":1490},{"id":1607,"depth":21,"text":1608,"children":2117},[2118,2119],{"id":1663,"depth":30,"text":1664},{"id":1702,"depth":30,"text":1703},{"id":1978,"depth":21,"text":1979},{"id":2039,"depth":21,"text":2040},{"id":2060,"depth":21,"text":2061},{"id":1411,"depth":21,"text":1412},"2026-05-18","Why we replaced individual exporters with Grafana Alloy, why push-based metrics silently broke our alerting, and what it took to figure that out.",{},"11 min",{"title":1456,"description":2125},"blog\u002Flgtm-stack\u002Fpart-2",[334,340,336,335,337],"\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-2.png","5rTJAJ6R9qZvwa3_G-JIs2egK0cngWF_IBU66dyKnZk",{"id":2134,"title":2135,"body":2136,"category":323,"date":2744,"description":2745,"extension":326,"meta":2746,"navigation":328,"path":90,"readTime":330,"seo":2747,"stem":2748,"tags":2749,"thumbnail":2751,"__hash__":2752},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-3.md","Building a Production Monitoring Stack from Scratch — Part 3: Loki, Tempo & the Full Observability Picture",{"type":65,"value":2137,"toc":2734},[2138,2147,2149,2151,2153,2157,2160,2163,2165,2169,2172,2178,2184,2193,2199,2201,2205,2208,2222,2342,2345,2347,2351,2354,2361,2372,2375,2381,2384,2395,2397,2400,2403,2475,2478,2481,2574,2576,2580,2583,2610,2613,2616,2664,2674,2677,2679,2683,2686,2703,2706,2712,2715,2717,2719,2722,2725,2729,2731],[68,2139,2140],{},[71,2141,2142,77,2144],{},[74,2143,76],{},[74,2145,2146],{},"Part 3 of 4",[82,2148],{"path":84},[82,2150],{"path":87},[92,2152],{},[95,2154,2156],{"id":2155},"what-was-still-missing","What Was Still Missing",[71,2158,2159],{},"After Parts 1 and 2, host health was solid. Prometheus pulling from Alloy on every node, dashboards showing the fleet, AlertManager firing when something went wrong. But all of that is infrastructure-level visibility — CPU spiking, disk filling, host going dark. It tells you a machine is struggling. It doesn't tell you what was happening inside your applications when it did.",[71,2161,2162],{},"For that you need logs and traces. This part covers adding both — Loki for log aggregation and Tempo for distributed tracing — and how the central Alloy instance on mon-node-a ties all three signal types together.",[92,2164],{},[95,2166,2168],{"id":2167},"two-alloy-roles","Two Alloy Roles",[71,2170,2171],{},"Before getting into Loki and Tempo, it's worth being clear about something that can cause confusion: there are two distinct Alloy deployments in this setup and they do completely different things.",[71,2173,2174,2177],{},[74,2175,2176],{},"Alloy on each enrolled node"," — runs the unix exporter, exposes host metrics on port 12345, gets scraped by Prometheus. This is the pull-based setup from Part 2. Nothing changes here.",[71,2179,2180,2183],{},[74,2181,2182],{},"Central Alloy on mon-node-a"," — runs as a container alongside Prometheus, Loki, and Tempo. Opens OTel endpoints on ports 4317 (gRPC) and 4318 (HTTP). Any instrumented application sends its telemetry here, and Alloy routes each signal type to the right backend.",[71,2185,2186,2187,2189,2190,2192],{},"The separation is clean: node Alloy handles infrastructure signals via pull, central Alloy handles application signals via OTel push. Prometheus only scrapes the node Alloys. The ",[196,2188,296],{}," metric concern from Part 2 doesn't apply here — we're not relying on ",[196,2191,296],{}," for application health, only for host availability.",[457,2194,2197],{"className":2195,"code":2196,"language":462},[460],"[ Enrolled Nodes ]\n      |\n  Alloy :12345  (one per node, host metrics)\n      |\n      ↓  PULL\n[ Prometheus ]  →  AlertManager\n      ↑\n      |  remote_write (application metrics only)\n      |\n[ Central Alloy :4317\u002F:4318 ]  ←  instrumented applications (OTel)\n      |\n      ├──→ Tempo      (traces)\n      └──→ Loki       (logs)\n\n[ Grafana ]  ←── queries Prometheus, Loki, Tempo\n",[196,2198,2196],{"__ignoreMap":314},[92,2200],{},[95,2202,2204],{"id":2203},"storage-minio","Storage: MinIO",[71,2206,2207],{},"Both Loki and Tempo need a durable storage backend. In a cloud environment that would be S3. Here, MinIO provides an S3-compatible store running as a container on mon-node-a.",[71,2209,2210,2211,1229,2214,2217,2218,2221],{},"Three buckets: ",[196,2212,2213],{},"loki-data",[196,2215,2216],{},"loki-ruler",", and ",[196,2219,2220],{},"tempo",". The entrypoint script pre-creates the directories before MinIO starts — a small thing that saves a confusing startup failure on first run.",[457,2223,2225],{"className":722,"code":2224,"language":724,"meta":314,"style":314},"minio:\n  image: minio\u002Fminio:latest\n  environment:\n    - MINIO_ACCESS_KEY=observability\n    - MINIO_SECRET_KEY=supersecret\n  entrypoint:\n    - sh\n    - -euc\n    - |\n      mkdir -p \u002Fdata\u002Ftempo\n      mkdir -p \u002Fdata\u002Floki-data\n      mkdir -p \u002Fdata\u002Floki-ruler\n      minio server \u002Fdata --console-address ':9001'\n  networks:\n    - monitoring\n  volumes:\n    - .\u002Fdata\u002Fminio:\u002Fdata\n",[196,2226,2227,2234,2244,2251,2259,2266,2273,2280,2287,2294,2299,2304,2309,2314,2321,2328,2335],{"__ignoreMap":314},[493,2228,2229,2232],{"class":495,"line":12},[493,2230,2231],{"class":583},"minio",[493,2233,739],{"class":568},[493,2235,2236,2239,2241],{"class":495,"line":21},[493,2237,2238],{"class":583},"  image",[493,2240,590],{"class":568},[493,2242,2243],{"class":508}," minio\u002Fminio:latest\n",[493,2245,2246,2249],{"class":495,"line":30},[493,2247,2248],{"class":583},"  environment",[493,2250,739],{"class":568},[493,2252,2253,2256],{"class":495,"line":39},[493,2254,2255],{"class":568},"    -",[493,2257,2258],{"class":508}," MINIO_ACCESS_KEY=observability\n",[493,2260,2261,2263],{"class":495,"line":48},[493,2262,2255],{"class":568},[493,2264,2265],{"class":508}," MINIO_SECRET_KEY=supersecret\n",[493,2267,2268,2271],{"class":495,"line":57},[493,2269,2270],{"class":583},"  entrypoint",[493,2272,739],{"class":568},[493,2274,2275,2277],{"class":495,"line":664},[493,2276,2255],{"class":568},[493,2278,2279],{"class":508}," sh\n",[493,2281,2282,2284],{"class":495,"line":685},[493,2283,2255],{"class":568},[493,2285,2286],{"class":508}," -euc\n",[493,2288,2289,2291],{"class":495,"line":704},[493,2290,2255],{"class":568},[493,2292,2293],{"class":1103}," |\n",[493,2295,2296],{"class":495,"line":710},[493,2297,2298],{"class":508},"      mkdir -p \u002Fdata\u002Ftempo\n",[493,2300,2301],{"class":495,"line":716},[493,2302,2303],{"class":508},"      mkdir -p \u002Fdata\u002Floki-data\n",[493,2305,2306],{"class":495,"line":1095},[493,2307,2308],{"class":508},"      mkdir -p \u002Fdata\u002Floki-ruler\n",[493,2310,2311],{"class":495,"line":1107},[493,2312,2313],{"class":508},"      minio server \u002Fdata --console-address ':9001'\n",[493,2315,2316,2319],{"class":495,"line":1113},[493,2317,2318],{"class":583},"  networks",[493,2320,739],{"class":568},[493,2322,2323,2325],{"class":495,"line":1119},[493,2324,2255],{"class":568},[493,2326,2327],{"class":508}," monitoring\n",[493,2329,2330,2333],{"class":495,"line":1124},[493,2331,2332],{"class":583},"  volumes",[493,2334,739],{"class":568},[493,2336,2337,2339],{"class":495,"line":1136},[493,2338,2255],{"class":568},[493,2340,2341],{"class":508}," .\u002Fdata\u002Fminio:\u002Fdata\n",[71,2343,2344],{},"The MinIO web console on port 9001 is useful when first bringing things up — you can watch objects appearing in the buckets and confirm that Loki and Tempo are actually flushing data rather than buffering it indefinitely.",[92,2346],{},[95,2348,2350],{"id":2349},"loki","Loki",[71,2352,2353],{},"Loki runs in microservices mode with read, write, and backend roles as separate containers, each with three replicas. The read and write paths scale independently, which matters as log volume grows.",[71,2355,2356,2357,2360],{},"A ",[196,2358,2359],{},"loki-init"," container runs first to set correct directory ownership — Loki processes run as UID 10001 and the volume mount needs to reflect that before anything starts.",[71,2362,2363,2364,2367,2368,2371],{},"All external traffic goes through an nginx gateway in front of the cluster. Central Alloy pushes logs to ",[196,2365,2366],{},"http:\u002F\u002Floki-gateway:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush",". Grafana queries ",[196,2369,2370],{},"http:\u002F\u002Floki-gateway:3100",". Neither needs to know which replica handles a given request.",[71,2373,2374],{},"A few config decisions worth noting:",[71,2376,2377,2380],{},[196,2378,2379],{},"s3forcepathstyle: true"," is required when talking to MinIO — it uses path-style URLs rather than the virtual-hosted style AWS uses, and without this flag nothing stores correctly.",[71,2382,2383],{},"Replication factor 3 means each chunk is written to all three write replicas. Since they all back onto the same MinIO instance this is about write redundancy rather than independent storage — but it means the cluster survives a replica restart without data loss in the WAL.",[71,2385,2386,2387,2390,2391,2394],{},"The three component types discover each other via memberlist gossip on port 7946, joining by container name. Getting the ",[196,2388,2389],{},"join_members"," list right — ",[196,2392,2393],{},"[\"loki-read\", \"loki-write\", \"loki-backend\"]"," — is what brings the cluster together.",[92,2396],{},[95,2398,2399],{"id":2220},"Tempo",[71,2401,2402],{},"Tempo also runs in microservices mode. The components and what each does:",[385,2404,2405,2413],{},[388,2406,2407],{},[391,2408,2409,2411],{},[394,2410,396],{},[394,2412,399],{},[401,2414,2415,2425,2435,2445,2455,2465],{},[391,2416,2417,2422],{},[406,2418,2419],{},[196,2420,2421],{},"tempo-distributor",[406,2423,2424],{},"Receives traces from Alloy, routes to ingesters",[391,2426,2427,2432],{},[406,2428,2429],{},[196,2430,2431],{},"tempo-ingester-0\u002F1\u002F2",[406,2433,2434],{},"Buffers traces in memory, flushes to MinIO",[391,2436,2437,2442],{},[406,2438,2439],{},[196,2440,2441],{},"tempo-query-frontend",[406,2443,2444],{},"Entry point for Grafana queries",[391,2446,2447,2452],{},[406,2448,2449],{},[196,2450,2451],{},"tempo-querier",[406,2453,2454],{},"Executes queries against ingesters and object storage",[391,2456,2457,2462],{},[406,2458,2459],{},[196,2460,2461],{},"tempo-compactor",[406,2463,2464],{},"Merges and compacts trace blocks",[391,2466,2467,2472],{},[406,2468,2469],{},[196,2470,2471],{},"tempo-metrics-generator",[406,2473,2474],{},"Derives RED metrics from trace data, writes to Prometheus",[71,2476,2477],{},"The metrics generator is worth understanding. It reads incoming traces and derives standard RED metrics — Rate, Errors, Duration — then writes them back to Prometheus via remote_write. The practical effect is that you get service-level dashboards showing request rates, error rates, and latency percentiles automatically from trace data, without any additional metric instrumentation in your applications. The traces are the source of truth; Tempo does the calculation.",[71,2479,2480],{},"It also builds a service dependency graph from trace data that Grafana can render as an interactive topology map — which services call which, with live latency and error rates on each edge.",[457,2482,2484],{"className":722,"code":2483,"language":724,"meta":314,"style":314},"metrics_generator:\n  storage:\n    remote_write:\n      - url: http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\n        send_exemplars: true\n  processor:\n    service_graphs:\n      wait: 10s\n      max_items: 10000\n      workers: 10\n",[196,2485,2486,2493,2500,2507,2519,2529,2536,2543,2553,2564],{"__ignoreMap":314},[493,2487,2488,2491],{"class":495,"line":12},[493,2489,2490],{"class":583},"metrics_generator",[493,2492,739],{"class":568},[493,2494,2495,2498],{"class":495,"line":21},[493,2496,2497],{"class":583},"  storage",[493,2499,739],{"class":568},[493,2501,2502,2505],{"class":495,"line":30},[493,2503,2504],{"class":583},"    remote_write",[493,2506,739],{"class":568},[493,2508,2509,2511,2514,2516],{"class":495,"line":39},[493,2510,768],{"class":568},[493,2512,2513],{"class":583}," url",[493,2515,590],{"class":568},[493,2517,2518],{"class":508}," http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\n",[493,2520,2521,2524,2526],{"class":495,"line":48},[493,2522,2523],{"class":583},"        send_exemplars",[493,2525,590],{"class":568},[493,2527,2528],{"class":1103}," true\n",[493,2530,2531,2534],{"class":495,"line":57},[493,2532,2533],{"class":583},"  processor",[493,2535,739],{"class":568},[493,2537,2538,2541],{"class":495,"line":664},[493,2539,2540],{"class":583},"    service_graphs",[493,2542,739],{"class":568},[493,2544,2545,2548,2550],{"class":495,"line":685},[493,2546,2547],{"class":583},"      wait",[493,2549,590],{"class":568},[493,2551,2552],{"class":508}," 10s\n",[493,2554,2555,2558,2560],{"class":495,"line":704},[493,2556,2557],{"class":583},"      max_items",[493,2559,590],{"class":568},[493,2561,2563],{"class":2562},"s-TwI"," 10000\n",[493,2565,2566,2569,2571],{"class":495,"line":710},[493,2567,2568],{"class":583},"      workers",[493,2570,590],{"class":568},[493,2572,2573],{"class":2562}," 10\n",[92,2575],{},[95,2577,2579],{"id":2578},"getting-application-signals-in","Getting Application Signals In",[71,2581,2582],{},"From an application's perspective, the integration is a single environment variable:",[457,2584,2586],{"className":487,"code":2585,"language":489,"meta":314,"style":314},"OTEL_EXPORTER_OTLP_ENDPOINT=http:\u002F\u002Fmon-node-a:4317\nOTEL_SERVICE_NAME=my-service\n",[196,2587,2588,2600],{"__ignoreMap":314},[493,2589,2590,2594,2597],{"class":495,"line":12},[493,2591,2593],{"class":2592},"svycV","OTEL_EXPORTER_OTLP_ENDPOINT",[493,2595,2596],{"class":568},"=",[493,2598,2599],{"class":508},"http:\u002F\u002Fmon-node-a:4317\n",[493,2601,2602,2605,2607],{"class":495,"line":21},[493,2603,2604],{"class":2592},"OTEL_SERVICE_NAME",[493,2606,2596],{"class":568},[493,2608,2609],{"class":508},"my-service\n",[71,2611,2612],{},"The OTel SDK handles the rest. Traces, logs, and metrics all go to the same endpoint and Alloy sorts them.",[71,2614,2615],{},"The central Alloy config receives all three signal types through one receiver and routes each to its backend:",[457,2617,2619],{"className":1526,"code":2618,"language":1528,"meta":314,"style":314},"otelcol.receiver.otlp \"otlp_receiver\" {\n  grpc { endpoint = \"0.0.0.0:4317\" }\n  http { endpoint = \"0.0.0.0:4318\" }\n  output {\n    traces  = [otelcol.processor.batch.default.input]\n    logs    = [otelcol.processor.batch.default.input]\n    metrics = [otelcol.processor.batch.default.input]\n  }\n}\n",[196,2620,2621,2626,2631,2636,2641,2646,2651,2656,2660],{"__ignoreMap":314},[493,2622,2623],{"class":495,"line":12},[493,2624,2625],{},"otelcol.receiver.otlp \"otlp_receiver\" {\n",[493,2627,2628],{"class":495,"line":21},[493,2629,2630],{},"  grpc { endpoint = \"0.0.0.0:4317\" }\n",[493,2632,2633],{"class":495,"line":30},[493,2634,2635],{},"  http { endpoint = \"0.0.0.0:4318\" }\n",[493,2637,2638],{"class":495,"line":39},[493,2639,2640],{},"  output {\n",[493,2642,2643],{"class":495,"line":48},[493,2644,2645],{},"    traces  = [otelcol.processor.batch.default.input]\n",[493,2647,2648],{"class":495,"line":57},[493,2649,2650],{},"    logs    = [otelcol.processor.batch.default.input]\n",[493,2652,2653],{"class":495,"line":664},[493,2654,2655],{},"    metrics = [otelcol.processor.batch.default.input]\n",[493,2657,2658],{"class":495,"line":685},[493,2659,713],{},[493,2661,2662],{"class":495,"line":704},[493,2663,1545],{},[71,2665,2666,2667,2670,2671,2673],{},"After batching, signals split to their respective exporters: traces to the Tempo distributor via OTLP, logs to Loki via ",[196,2668,2669],{},"loki.write",", application metrics to Prometheus via ",[196,2672,1601],{},".",[71,2675,2676],{},"The OTel to Alloy to Tempo path worked on the first proper attempt — the pipeline model makes the data flow explicit enough that when something isn't arriving where you expect it, it's usually obvious which component in the chain is the problem.",[92,2678],{},[95,2680,2682],{"id":2681},"connecting-everything-in-grafana","Connecting Everything in Grafana",[71,2684,2685],{},"Three data sources on mon-node-b:",[820,2687,2688,2693,2698],{},[106,2689,2690,2692],{},[74,2691,335],{}," — host metrics, application metrics, and the RED metrics Tempo generates",[106,2694,2695,2697],{},[74,2696,2350],{}," — application logs",[106,2699,2700,2702],{},[74,2701,2399],{}," — distributed traces",[71,2704,2705],{},"The part that makes these three genuinely useful together rather than just three separate views is derived fields in Loki. Any log line containing a trace ID becomes a clickable link to that trace in Tempo:",[457,2707,2710],{"className":2708,"code":2709,"language":462},[460],"Field name: traceId\nRegex: traceId=(\\w+)\nInternal link: Tempo → ${__value.raw}\n",[196,2711,2709],{"__ignoreMap":314},[71,2713,2714],{},"From a trace in Tempo you can navigate back to the Loki logs for that service in the same time window. The three signals become navigable together rather than three separate places to look.",[92,2716],{},[95,2718,1412],{"id":1411},[71,2720,2721],{},"The stack now covers all three observability pillars. Host health and availability through Prometheus and Alloy on each node, unchanged from Part 2. Application logs through Loki. Distributed traces through Tempo, with RED metrics derived automatically. All queryable from Grafana with the signals linked to each other.",[71,2723,2724],{},"What was still manual: enrolling a new host still meant SSHing in, installing Alloy, writing its config, creating a target file, and reloading Prometheus. That friction was the last remaining operational problem — and fixing it turned into something bigger than just a script.",[71,2726,2727],{},[74,2728,2106],{},[82,2730],{"path":329},[1431,2732,2733],{},"html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}html pre.shiki code .sbBg2, html code.shiki .sbBg2{--shiki-default:#1E754F}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s-TwI, html code.shiki .s-TwI{--shiki-default:#2F798A}html pre.shiki code .svycV, html code.shiki .svycV{--shiki-default:#B07D48}",{"title":314,"searchDepth":21,"depth":21,"links":2735},[2736,2737,2738,2739,2740,2741,2742,2743],{"id":2155,"depth":21,"text":2156},{"id":2167,"depth":21,"text":2168},{"id":2203,"depth":21,"text":2204},{"id":2349,"depth":21,"text":2350},{"id":2220,"depth":21,"text":2399},{"id":2578,"depth":21,"text":2579},{"id":2681,"depth":21,"text":2682},{"id":1411,"depth":21,"text":1412},"2026-05-19","Adding log aggregation with Loki and distributed tracing with Tempo — completing the metrics, logs, and traces picture.",{},{"title":2135,"description":2745},"blog\u002Flgtm-stack\u002Fpart-3",[334,2350,2399,340,336,2750,337],"OpenTelemetry","\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-3.png","9s48r2qM_PDpsOglDZPADwXDaUQsdhpZExXgq7Db3Q0",1780657374510]