[{"data":1,"prerenderedAt":1877},["ShallowReactive",2],{"nav-stories":3,"blog-lgtm-stack\u002Fpart-1":61,"ref-\u002Fblog\u002Flgtm-stack\u002Fpart-2":1196},[4,16,25,34,43,52],{"id":5,"color":6,"extension":7,"image":8,"label":9,"link":10,"meta":11,"order":12,"stem":13,"text":14,"__hash__":15},"stories\u002Fstories\u002F01-data-center.yml",null,"yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1558494949-ef010cbdcc31?w=1080","DATA_CENTER","https:\u002F\u002Fx.com\u002Fabbeytetteh_",{},1,"stories\u002F01-data-center","Racking new servers. 40gbit backbone online.","0QUZQbaANhdO8WemZxkDdO7vbVopfnynHtH9FxBZb_w",{"id":17,"color":6,"extension":7,"image":18,"label":19,"link":6,"meta":20,"order":21,"stem":22,"text":23,"__hash__":24},"stories\u002Fstories\u002F02-thoughts.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1498050108023-c5249f4df085?w=1080","THOUGHTS",{},2,"stories\u002F02-thoughts","Late night bug hunting. Found the memory leak.","Gd1am954aasY6HRHD7hCtOuessXb6zYZ8iizS501ICg",{"id":26,"color":27,"extension":7,"image":6,"label":28,"link":6,"meta":29,"order":30,"stem":31,"text":32,"__hash__":33},"stories\u002Fstories\u002F03-coding.yml","#3b82f6","CODING",{},3,"stories\u002F03-coding","Just thinking about how much easier life is with Swarm. https:\u002F\u002Fgoogle.com","-WTk-47jnLM-TZRWBg0VbJyZJfIM7FpQ5HGbc8LEdhQ",{"id":35,"color":6,"extension":7,"image":36,"label":37,"link":6,"meta":38,"order":39,"stem":40,"text":41,"__hash__":42},"stories\u002Fstories\u002F04-update.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1591799264318-7e6ef8ddb7ea?w=1080","UPDATE",{},4,"stories\u002F04-update","New cluster nodes arrived. Prepping for installation.","kyT60N5C6Re_jMonZbgNy0PbQhzXmUWxDbD0D_v43ts",{"id":44,"color":45,"extension":7,"image":6,"label":46,"link":6,"meta":47,"order":48,"stem":49,"text":50,"__hash__":51},"stories\u002Fstories\u002F05-setup.yml","#86868b","SETUP",{},5,"stories\u002F05-setup","Optimizing the telemetry pipeline for 1M req\u002Fs.","cPOBkzoyXsCmPgRO2d80Hj3vm4MP-6nAejtlQ5iuSzw",{"id":53,"color":6,"extension":7,"image":54,"label":55,"link":6,"meta":56,"order":57,"stem":58,"text":59,"__hash__":60},"stories\u002Fstories\u002F06-travel.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1560969184-10fe8719e047?w=1080","TRAVEL",{},6,"stories\u002F06-travel","Travel log — system architecture workshop in Berlin.","jnOxerdF6usAIHdR35Z-opx0LJAy9kZluXnZhtz62Z0",{"id":62,"title":63,"body":64,"category":1182,"date":1183,"description":1184,"extension":1185,"meta":1186,"navigation":755,"path":1187,"readTime":1188,"seo":1189,"stem":1190,"tags":1191,"thumbnail":1194,"__hash__":1195},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-1.md","Building a Production Monitoring Stack from Scratch — Part 1: Prometheus, Grafana, Node Exporter & AlertManager",{"type":65,"value":66,"toc":1170},"minimark",[67,81,84,89,92,95,98,101,103,107,110,170,181,184,195,202,204,208,215,266,277,280,287,451,523,535,537,545,551,567,573,590,596,598,602,605,608,613,622,627,636,646,671,677,686,697,708,710,714,717,720,952,967,970,1041,1051,1053,1060,1070,1073,1122,1129,1138,1141,1143,1147,1150,1153,1156,1159,1162,1166],[68,69,70],"blockquote",{},[71,72,73,77,78],"p",{},[74,75,76],"strong",{},"Series:"," From NagiosXI to a Modern Observability Stack\n",[74,79,80],{},"Part 1 of 4",[82,83],"hr",{},[85,86,88],"h2",{"id":87},"the-problem-with-nagiosxi","The Problem with NagiosXI",[71,90,91],{},"We had been running NagiosXI for a while. It worked, in the way that something can work while also quietly frustrating everyone who touches it. It checked hosts, fired alerts, and we had even wired up scripts to push notifications to Mattermost. But the gaps were real and getting harder to ignore.",[71,93,94],{},"It was a paid solution running on our own infrastructure — a licensing cost that got harder to justify every time someone asked for something it couldn't do. OpenTelemetry support was essentially nonexistent. Application log aggregation wasn't on the table at all. Every extension we had made through plugins had taken us about as far as plugins could go.",[71,96,97],{},"The conversation about replacing it had been happening for a while. Eventually it stopped being a conversation and became a project. The goal: a full open-source replacement covering host metrics, alerting, log aggregation, and eventually distributed tracing. One cohesive system instead of a patchwork.",[71,99,100],{},"I took on the work. Phase 1 was about standing up the foundation and proving it could actually replace what NagiosXI was doing before we went further.",[82,102],{},[85,104,106],{"id":105},"the-starting-point","The Starting Point",[71,108,109],{},"The first week was spent getting four things working together:",[111,112,113,126],"table",{},[114,115,116],"thead",{},[117,118,119,123],"tr",{},[120,121,122],"th",{},"Component",[120,124,125],{},"Role",[127,128,129,140,150,160],"tbody",{},[117,130,131,137],{},[132,133,134],"td",{},[74,135,136],{},"Prometheus",[132,138,139],{},"Time-series metrics database and scraping engine",[117,141,142,147],{},[132,143,144],{},[74,145,146],{},"Grafana",[132,148,149],{},"Visualization and dashboarding",[117,151,152,157],{},[132,153,154],{},[74,155,156],{},"Node Exporter",[132,158,159],{},"Host-level metrics (CPU, memory, disk, network)",[117,161,162,167],{},[132,163,164],{},[74,165,166],{},"AlertManager",[132,168,169],{},"Alert routing, grouping, and silencing",[71,171,172,173,176,177,180],{},"The deployment runs across two nodes — ",[74,174,175],{},"mon-node-a"," for data collection (Prometheus, AlertManager, and agent-side components) and ",[74,178,179],{},"mon-node-b"," for presentation (Grafana). Keeping the presentation layer separate from the data layer was a deliberate decision: if we need to update or rebuild Grafana, it doesn't touch Prometheus, and vice versa. Everything runs in Docker.",[71,182,183],{},"How these pieces talk to each other matters, because one architectural choice here — pull vs push — ended up being the central problem in Part 2.",[185,186,191],"pre",{"className":187,"code":189,"language":190},[188],"language-text","[ Linux Hosts ]\n      |\n  node_exporter  (runs on each host, exposes \u002Fmetrics on port 9100)\n      |\n      ↓  (pull — Prometheus reaches out every 15s)\n[ Prometheus ]  ←── scrape_configs + alerting_rules\n      |\n      ├──→ [ AlertManager ]\n      |           |\n      |           └──→ Email \u002F Mattermost\n      |\n[ Grafana ]  ←── queries Prometheus via PromQL\n","text",[192,193,189],"code",{"__ignoreMap":194},"",[71,196,197,198,201],{},"Prometheus is ",[74,199,200],{},"pull-based",". It reaches out to each target on a schedule and pulls metrics. The targets don't know Prometheus exists — they just expose an HTTP endpoint and wait. This distinction ends up mattering a lot.",[82,203],{},[85,205,207],{"id":206},"getting-host-metrics-in","Getting Host Metrics In",[71,209,210,211,214],{},"Node Exporter is a lightweight binary that runs on each host and exposes hardware and OS-level metrics at a ",[192,212,213],{},"\u002Fmetrics"," HTTP endpoint. Deploy one per machine, point Prometheus at it, done.",[185,216,220],{"className":217,"code":218,"language":219,"meta":194,"style":194},"language-bash shiki shiki-themes vitesse-light","# Verify it's running\ncurl http:\u002F\u002F\u003Chost-ip>:9100\u002Fmetrics | head -50\n","bash",[192,221,222,230],{"__ignoreMap":194},[223,224,226],"span",{"class":225,"line":12},"line",[223,227,229],{"class":228},"s8zF2","# Verify it's running\n",[223,231,232,236,240,244,247,250,253,256,259,262],{"class":225,"line":21},[223,233,235],{"class":234},"sySUi","curl",[223,237,239],{"class":238},"spphp"," http:\u002F\u002F",[223,241,243],{"class":242},"si04Y","\u003C",[223,245,246],{"class":238},"host-i",[223,248,71],{"class":249},"suHK_",[223,251,252],{"class":242},">",[223,254,255],{"class":238},":9100\u002Fmetrics",[223,257,258],{"class":242}," |",[223,260,261],{"class":234}," head",[223,263,265],{"class":264},"sEi1f"," -50\n",[71,267,268,269,272,273,276],{},"If you see ",[192,270,271],{},"# HELP"," and ",[192,274,275],{},"# TYPE"," blocks followed by metric lines, you're good.",[71,278,279],{},"Getting the metrics in wasn't the hard part. The harder part was getting them in cleanly, with enough context attached that alerts and dashboards would actually be useful. A raw IP address as the target label tells you very little when something breaks at 2am.",[71,281,282,283,286],{},"The solution was file-based service discovery with rich labels. Instead of listing targets directly in ",[192,284,285],{},"prometheus.yml",", Prometheus watches a directory of JSON files:",[185,288,292],{"className":289,"code":290,"language":291,"meta":194,"style":194},"language-json shiki shiki-themes vitesse-light","[\n  {\n    \"targets\": [\"192.168.0.101:9100\"],\n    \"labels\": {\n      \"hostname\": \"web-server-01\",\n      \"environment\": \"production\",\n      \"location\": \"Primary Rack\",\n      \"maintainers\": \"admin@domain.com\"\n    }\n  }\n]\n","json",[192,293,294,300,305,335,349,372,392,413,433,439,445],{"__ignoreMap":194},[223,295,296],{"class":225,"line":12},[223,297,299],{"class":298},"sYZai","[\n",[223,301,302],{"class":225,"line":21},[223,303,304],{"class":298},"  {\n",[223,306,307,311,315,318,321,324,327,330,332],{"class":225,"line":30},[223,308,310],{"class":309},"s61at","    \"",[223,312,314],{"class":313},"su6XF","targets",[223,316,317],{"class":309},"\"",[223,319,320],{"class":298},":",[223,322,323],{"class":298}," [",[223,325,317],{"class":326},"sSP4y",[223,328,329],{"class":238},"192.168.0.101:9100",[223,331,317],{"class":326},[223,333,334],{"class":298},"],\n",[223,336,337,339,342,344,346],{"class":225,"line":39},[223,338,310],{"class":309},[223,340,341],{"class":313},"labels",[223,343,317],{"class":309},[223,345,320],{"class":298},[223,347,348],{"class":298}," {\n",[223,350,351,354,357,359,361,364,367,369],{"class":225,"line":48},[223,352,353],{"class":309},"      \"",[223,355,356],{"class":313},"hostname",[223,358,317],{"class":309},[223,360,320],{"class":298},[223,362,363],{"class":326}," \"",[223,365,366],{"class":238},"web-server-01",[223,368,317],{"class":326},[223,370,371],{"class":298},",\n",[223,373,374,376,379,381,383,385,388,390],{"class":225,"line":57},[223,375,353],{"class":309},[223,377,378],{"class":313},"environment",[223,380,317],{"class":309},[223,382,320],{"class":298},[223,384,363],{"class":326},[223,386,387],{"class":238},"production",[223,389,317],{"class":326},[223,391,371],{"class":298},[223,393,395,397,400,402,404,406,409,411],{"class":225,"line":394},7,[223,396,353],{"class":309},[223,398,399],{"class":313},"location",[223,401,317],{"class":309},[223,403,320],{"class":298},[223,405,363],{"class":326},[223,407,408],{"class":238},"Primary Rack",[223,410,317],{"class":326},[223,412,371],{"class":298},[223,414,416,418,421,423,425,427,430],{"class":225,"line":415},8,[223,417,353],{"class":309},[223,419,420],{"class":313},"maintainers",[223,422,317],{"class":309},[223,424,320],{"class":298},[223,426,363],{"class":326},[223,428,429],{"class":238},"admin@domain.com",[223,431,432],{"class":326},"\"\n",[223,434,436],{"class":225,"line":435},9,[223,437,438],{"class":298},"    }\n",[223,440,442],{"class":225,"line":441},10,[223,443,444],{"class":298},"  }\n",[223,446,448],{"class":225,"line":447},11,[223,449,450],{"class":298},"]\n",[185,452,456],{"className":453,"code":454,"language":455,"meta":194,"style":194},"language-yaml shiki shiki-themes vitesse-light","# prometheus.yml\nscrape_configs:\n  - job_name: \"node_exporter\"\n    file_sd_configs:\n      - files:\n          - \u002Fetc\u002Fprometheus\u002Ftargets\u002F*.json\n        refresh_interval: 30s\n","yaml",[192,457,458,463,471,488,495,505,513],{"__ignoreMap":194},[223,459,460],{"class":225,"line":12},[223,461,462],{"class":228},"# prometheus.yml\n",[223,464,465,468],{"class":225,"line":21},[223,466,467],{"class":313},"scrape_configs",[223,469,470],{"class":298},":\n",[223,472,473,476,479,481,483,486],{"class":225,"line":30},[223,474,475],{"class":298},"  -",[223,477,478],{"class":313}," job_name",[223,480,320],{"class":298},[223,482,363],{"class":326},[223,484,485],{"class":238},"node_exporter",[223,487,432],{"class":326},[223,489,490,493],{"class":225,"line":39},[223,491,492],{"class":313},"    file_sd_configs",[223,494,470],{"class":298},[223,496,497,500,503],{"class":225,"line":48},[223,498,499],{"class":298},"      -",[223,501,502],{"class":313}," files",[223,504,470],{"class":298},[223,506,507,510],{"class":225,"line":57},[223,508,509],{"class":298},"          -",[223,511,512],{"class":238}," \u002Fetc\u002Fprometheus\u002Ftargets\u002F*.json\n",[223,514,515,518,520],{"class":225,"line":394},[223,516,517],{"class":313},"        refresh_interval",[223,519,320],{"class":298},[223,521,522],{"class":238}," 30s\n",[71,524,525,526,529,530,534],{},"Drop a file in, get a monitored host within 30 seconds. No reload required. The labels on each target flow through to every metric scraped from that host — which means they're available in alert annotations, in Grafana, everywhere. When ",[192,527,528],{},"HostDown"," fires, the alert can say ",[531,532,533],"em",{},"which"," host, in which environment, and who to contact. That's the payoff.",[82,536],{},[85,538,540,541,544],{"id":539},"the-up-metric","The ",[192,542,543],{},"up"," Metric",[71,546,547,548,550],{},"One of Prometheus's built-in synthetic metrics is ",[192,549,543],{},". For every scrape target:",[552,553,554,561],"ul",{},[555,556,557,560],"li",{},[192,558,559],{},"up = 1"," — scrape succeeded",[555,562,563,566],{},[192,564,565],{},"up = 0"," — scrape failed",[71,568,569,570,572],{},"This is the most fundamental health signal in the stack. Everything else — CPU, memory, disk — is meaningless if you can't even reach the host. And because ",[192,571,543],{}," carries all the labels from your target file, you can immediately see which host is down, in which environment.",[185,574,578],{"className":575,"code":576,"language":577,"meta":194,"style":194},"language-promql shiki shiki-themes vitesse-light","# All down hosts right now\nup{job=\"node_exporter\"} == 0\n","promql",[192,579,580,585],{"__ignoreMap":194},[223,581,582],{"class":225,"line":12},[223,583,584],{},"# All down hosts right now\n",[223,586,587],{"class":225,"line":21},[223,588,589],{},"up{job=\"node_exporter\"} == 0\n",[71,591,592,593,595],{},"I keep coming back to ",[192,594,543],{}," throughout this series because it's also where things can silently break if you change the architecture carelessly. More on that in Part 2.",[82,597],{},[85,599,601],{"id":600},"dashboards","Dashboards",[71,603,604],{},"Grafana connects to Prometheus as a data source and queries it via PromQL. The community dashboards are easy to import and useful for getting started, but building your own is worth doing because it forces you to understand exactly what you're looking at.",[71,606,607],{},"The core panels and the queries behind them:",[71,609,610],{},[74,611,612],{},"CPU Usage (%)",[185,614,616],{"className":575,"code":615,"language":577,"meta":194,"style":194},"100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)\n",[192,617,618],{"__ignoreMap":194},[223,619,620],{"class":225,"line":12},[223,621,615],{},[71,623,624],{},[74,625,626],{},"Memory Usage (%)",[185,628,630],{"className":575,"code":629,"language":577,"meta":194,"style":194},"(1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100\n",[192,631,632],{"__ignoreMap":194},[223,633,634],{"class":225,"line":12},[223,635,629],{},[71,637,638,641,642,645],{},[74,639,640],{},"Disk Usage (%)"," — the ",[192,643,644],{},"fstype"," filter excludes Docker overlays and tmpfs mounts that inflate results",[185,647,649],{"className":575,"code":648,"language":577,"meta":194,"style":194},"(1 - (\n  node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"} \u002F\n  node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"}\n)) * 100\n",[192,650,651,656,661,666],{"__ignoreMap":194},[223,652,653],{"class":225,"line":12},[223,654,655],{},"(1 - (\n",[223,657,658],{"class":225,"line":21},[223,659,660],{},"  node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"} \u002F\n",[223,662,663],{"class":225,"line":30},[223,664,665],{},"  node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"}\n",[223,667,668],{"class":225,"line":39},[223,669,670],{},")) * 100\n",[71,672,673,676],{},[74,674,675],{},"Fleet status"," — a stat panel showing every host's current state",[185,678,680],{"className":575,"code":679,"language":577,"meta":194,"style":194},"up{job=\"node_exporter\"}\n",[192,681,682],{"__ignoreMap":194},[223,683,684],{"class":225,"line":12},[223,685,679],{},[71,687,688,689,692,693,696],{},"Value mappings: ",[192,690,691],{},"1"," → 🟢 UP, ",[192,694,695],{},"0"," → 🔴 DOWN.",[71,698,699,700,703,704,707],{},"Adding a dashboard variable for ",[192,701,702],{},"instance"," — ",[192,705,706],{},"label_values(up{job=\"node_exporter\"}, instance)"," — gives you a dropdown to filter to a specific host or view the whole fleet. That one change makes the dashboard genuinely useful for day-to-day operations.",[82,709],{},[85,711,713],{"id":712},"alerting","Alerting",[71,715,716],{},"Prometheus evaluates alerting rules and forwards firing alerts to AlertManager. AlertManager handles the business logic: who gets notified, when, how often, and what to suppress.",[71,718,719],{},"The rules themselves live in separate YAML files:",[185,721,723],{"className":453,"code":722,"language":455,"meta":194,"style":194},"groups:\n  - name: node_exporter_alerts\n    rules:\n\n      - alert: HostDown\n        expr: up{job=\"node_exporter\"} == 0\n        for: 2m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"Host {{ $labels.instance }} is down\"\n          description: >\n            {{ $labels.hostname }} has been unreachable for more than 2 minutes.\n            Maintainers: {{ $labels.maintainers }}\n\n      - alert: HighCPUUsage\n        expr: >\n          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 85\n        for: 5m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"High CPU on {{ $labels.instance }}\"\n          description: >\n            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.\n            Current: {{ $value | printf \"%.1f\" }}%\n",[192,724,725,732,744,751,757,769,779,789,796,806,813,827,839,845,851,856,868,877,883,893,900,910,917,931,940,946],{"__ignoreMap":194},[223,726,727,730],{"class":225,"line":12},[223,728,729],{"class":313},"groups",[223,731,470],{"class":298},[223,733,734,736,739,741],{"class":225,"line":21},[223,735,475],{"class":298},[223,737,738],{"class":313}," name",[223,740,320],{"class":298},[223,742,743],{"class":238}," node_exporter_alerts\n",[223,745,746,749],{"class":225,"line":30},[223,747,748],{"class":313},"    rules",[223,750,470],{"class":298},[223,752,753],{"class":225,"line":39},[223,754,756],{"emptyLinePlaceholder":755},true,"\n",[223,758,759,761,764,766],{"class":225,"line":48},[223,760,499],{"class":298},[223,762,763],{"class":313}," alert",[223,765,320],{"class":298},[223,767,768],{"class":238}," HostDown\n",[223,770,771,774,776],{"class":225,"line":57},[223,772,773],{"class":313},"        expr",[223,775,320],{"class":298},[223,777,778],{"class":238}," up{job=\"node_exporter\"} == 0\n",[223,780,781,784,786],{"class":225,"line":394},[223,782,783],{"class":313},"        for",[223,785,320],{"class":298},[223,787,788],{"class":238}," 2m\n",[223,790,791,794],{"class":225,"line":415},[223,792,793],{"class":313},"        labels",[223,795,470],{"class":298},[223,797,798,801,803],{"class":225,"line":435},[223,799,800],{"class":313},"          severity",[223,802,320],{"class":298},[223,804,805],{"class":238}," critical\n",[223,807,808,811],{"class":225,"line":441},[223,809,810],{"class":313},"        annotations",[223,812,470],{"class":298},[223,814,815,818,820,822,825],{"class":225,"line":447},[223,816,817],{"class":313},"          summary",[223,819,320],{"class":298},[223,821,363],{"class":326},[223,823,824],{"class":238},"Host {{ $labels.instance }} is down",[223,826,432],{"class":326},[223,828,830,833,835],{"class":225,"line":829},12,[223,831,832],{"class":313},"          description",[223,834,320],{"class":298},[223,836,838],{"class":837},"sbBg2"," >\n",[223,840,842],{"class":225,"line":841},13,[223,843,844],{"class":238},"            {{ $labels.hostname }} has been unreachable for more than 2 minutes.\n",[223,846,848],{"class":225,"line":847},14,[223,849,850],{"class":238},"            Maintainers: {{ $labels.maintainers }}\n",[223,852,854],{"class":225,"line":853},15,[223,855,756],{"emptyLinePlaceholder":755},[223,857,859,861,863,865],{"class":225,"line":858},16,[223,860,499],{"class":298},[223,862,763],{"class":313},[223,864,320],{"class":298},[223,866,867],{"class":238}," HighCPUUsage\n",[223,869,871,873,875],{"class":225,"line":870},17,[223,872,773],{"class":313},[223,874,320],{"class":298},[223,876,838],{"class":837},[223,878,880],{"class":225,"line":879},18,[223,881,882],{"class":238},"          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 85\n",[223,884,886,888,890],{"class":225,"line":885},19,[223,887,783],{"class":313},[223,889,320],{"class":298},[223,891,892],{"class":238}," 5m\n",[223,894,896,898],{"class":225,"line":895},20,[223,897,793],{"class":313},[223,899,470],{"class":298},[223,901,903,905,907],{"class":225,"line":902},21,[223,904,800],{"class":313},[223,906,320],{"class":298},[223,908,909],{"class":238}," warning\n",[223,911,913,915],{"class":225,"line":912},22,[223,914,810],{"class":313},[223,916,470],{"class":298},[223,918,920,922,924,926,929],{"class":225,"line":919},23,[223,921,817],{"class":313},[223,923,320],{"class":298},[223,925,363],{"class":326},[223,927,928],{"class":238},"High CPU on {{ $labels.instance }}",[223,930,432],{"class":326},[223,932,934,936,938],{"class":225,"line":933},24,[223,935,832],{"class":313},[223,937,320],{"class":298},[223,939,838],{"class":837},[223,941,943],{"class":225,"line":942},25,[223,944,945],{"class":238},"            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.\n",[223,947,949],{"class":225,"line":948},26,[223,950,951],{"class":238},"            Current: {{ $value | printf \"%.1f\" }}%\n",[71,953,540,954,957,958,960,961,963,964,966],{},[192,955,956],{},"for: 2m"," on ",[192,959,528],{}," absorbs brief network glitches. Without it, a momentary scrape failure sends an alert. The rich labels on the target — ",[192,962,356],{},", ",[192,965,420],{}," — show up directly in the alert annotations.",[71,968,969],{},"One AlertManager config worth explaining is the inhibit rule:",[185,971,973],{"className":453,"code":972,"language":455,"meta":194,"style":194},"inhibit_rules:\n  - source_match:\n      alertname: \"HostDown\"\n    target_match_re:\n      alertname: \"HighCPUUsage|HighMemoryUsage|DiskSpaceLow\"\n    equal: [\"instance\"]\n",[192,974,975,982,991,1004,1011,1024],{"__ignoreMap":194},[223,976,977,980],{"class":225,"line":12},[223,978,979],{"class":313},"inhibit_rules",[223,981,470],{"class":298},[223,983,984,986,989],{"class":225,"line":21},[223,985,475],{"class":298},[223,987,988],{"class":313}," source_match",[223,990,470],{"class":298},[223,992,993,996,998,1000,1002],{"class":225,"line":30},[223,994,995],{"class":313},"      alertname",[223,997,320],{"class":298},[223,999,363],{"class":326},[223,1001,528],{"class":238},[223,1003,432],{"class":326},[223,1005,1006,1009],{"class":225,"line":39},[223,1007,1008],{"class":313},"    target_match_re",[223,1010,470],{"class":298},[223,1012,1013,1015,1017,1019,1022],{"class":225,"line":48},[223,1014,995],{"class":313},[223,1016,320],{"class":298},[223,1018,363],{"class":326},[223,1020,1021],{"class":238},"HighCPUUsage|HighMemoryUsage|DiskSpaceLow",[223,1023,432],{"class":326},[223,1025,1026,1029,1031,1033,1035,1037,1039],{"class":225,"line":57},[223,1027,1028],{"class":313},"    equal",[223,1030,320],{"class":298},[223,1032,323],{"class":298},[223,1034,317],{"class":326},[223,1036,702],{"class":238},[223,1038,317],{"class":326},[223,1040,450],{"class":298},[71,1042,1043,1044,1046,1047,1050],{},"When ",[192,1045,528],{}," fires for a host, AlertManager suppresses all other alerts for that same host. There's no useful signal in a ",[192,1048,1049],{},"HighMemoryUsage"," alert for a machine that isn't reachable. Without this, a single dead host can generate a cascade of noise.",[82,1052],{},[85,1054,540,1056,1059],{"id":1055},"the-last_seen-pattern",[192,1057,1058],{},"last_seen"," Pattern",[71,1061,1062,1063,1066,1067,1069],{},"When a host disappears completely, Prometheus eventually stops having active series data for it. ",[192,1064,1065],{},"up{instance=\"...\"}"," doesn't return ",[192,1068,695],{}," — it returns nothing, because there's no scrape happening. You lose the ability to answer \"when did this thing last check in?\"",[71,1071,1072],{},"A recording rule fixes this by continuously writing a timestamp whenever a host is up:",[185,1074,1076],{"className":453,"code":1075,"language":455,"meta":194,"style":194},"groups:\n  - name: recording_rules\n    rules:\n      - record: node_last_seen_timestamp\n        expr: time() * up{job=\"node_exporter\"}\n",[192,1077,1078,1084,1095,1101,1113],{"__ignoreMap":194},[223,1079,1080,1082],{"class":225,"line":12},[223,1081,729],{"class":313},[223,1083,470],{"class":298},[223,1085,1086,1088,1090,1092],{"class":225,"line":21},[223,1087,475],{"class":298},[223,1089,738],{"class":313},[223,1091,320],{"class":298},[223,1093,1094],{"class":238}," recording_rules\n",[223,1096,1097,1099],{"class":225,"line":30},[223,1098,748],{"class":313},[223,1100,470],{"class":298},[223,1102,1103,1105,1108,1110],{"class":225,"line":39},[223,1104,499],{"class":298},[223,1106,1107],{"class":313}," record",[223,1109,320],{"class":298},[223,1111,1112],{"class":238}," node_last_seen_timestamp\n",[223,1114,1115,1117,1119],{"class":225,"line":48},[223,1116,773],{"class":313},[223,1118,320],{"class":298},[223,1120,1121],{"class":238}," time() * up{job=\"node_exporter\"}\n",[71,1123,1124,1125,1128],{},"This writes the current Unix timestamp on every evaluation cycle, but only when ",[192,1126,1127],{},"up == 1",". When a host goes dark, the last written value persists in storage. In Grafana:",[185,1130,1132],{"className":575,"code":1131,"language":577,"meta":194,"style":194},"time() - node_last_seen_timestamp\n",[192,1133,1134],{"__ignoreMap":194},[223,1135,1136],{"class":225,"line":12},[223,1137,1131],{},[71,1139,1140],{},"Format as duration and you get: \"last seen 3h 22m ago.\" It's a small thing but it's become one of the most-used panels.",[82,1142],{},[85,1144,1146],{"id":1145},"where-this-left-off","Where This Left Off",[71,1148,1149],{},"By the end of the first week, the stack was functionally replacing NagiosXI for host monitoring. Prometheus scraping every host every 15 seconds, dashboards showing the fleet, AlertManager routing alerts with inhibit rules and deduplication, recording rules keeping last-seen timestamps for hosts that went dark.",[71,1151,1152],{},"But there was a question I hadn't resolved yet.",[71,1154,1155],{},"Node Exporter is a single-purpose binary — host metrics and nothing else. The moment we wanted logs or traces from these same hosts, we'd need additional agents running alongside it. And adding a host to monitoring still meant four manual steps: SSH in, install Node Exporter, write the target file, reload Prometheus.",[71,1157,1158],{},"My colleague had been working in parallel, exploring the multiple-exporter approach — a separate binary for each signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle all of it. We hadn't converged yet, and there were real questions about whether Alloy was ready enough to build on.",[71,1160,1161],{},"That's what Part 2 is about.",[1163,1164],"reference",{"path":1165},"\u002Fblog\u002Flgtm-stack\u002Fpart-2",[1167,1168,1169],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s8zF2, html code.shiki .s8zF2{--shiki-default:#A0ADA0}html pre.shiki code .sySUi, html code.shiki .sySUi{--shiki-default:#59873A}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}html pre.shiki code .si04Y, html code.shiki .si04Y{--shiki-default:#AB5959}html pre.shiki code .suHK_, html code.shiki .suHK_{--shiki-default:#393A34}html pre.shiki code .sEi1f, html code.shiki .sEi1f{--shiki-default:#A65E2B}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .s61at, html code.shiki .s61at{--shiki-default:#99841877}html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sSP4y, html code.shiki .sSP4y{--shiki-default:#B5695977}html pre.shiki code .sbBg2, html code.shiki .sbBg2{--shiki-default:#1E754F}",{"title":194,"searchDepth":21,"depth":21,"links":1171},[1172,1173,1174,1175,1177,1178,1179,1181],{"id":87,"depth":21,"text":88},{"id":105,"depth":21,"text":106},{"id":206,"depth":21,"text":207},{"id":539,"depth":21,"text":1176},"The up Metric",{"id":600,"depth":21,"text":601},{"id":712,"depth":21,"text":713},{"id":1055,"depth":21,"text":1180},"The last_seen Pattern",{"id":1145,"depth":21,"text":1146},"Blog","2025-01-17","How we migrated from NagiosXI to a modern open-source observability stack — and why getting the foundation right mattered more than I expected.","md",{},"\u002Fblog\u002Flgtm-stack\u002Fpart-1","10 min",{"title":63,"description":1184},"blog\u002Flgtm-stack\u002Fpart-1",[1192,146,136,166,1193],"Monitoring","DevOps","\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-1.png","7NeM6IsFhBh6e-n5FcTCZrpdQwLcnU1u-1RFk4aMi1A",{"id":1197,"title":1198,"body":1199,"category":1182,"date":1867,"description":1868,"extension":1185,"meta":1869,"navigation":755,"path":1165,"readTime":1870,"seo":1871,"stem":1872,"tags":1873,"thumbnail":1875,"__hash__":1876},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-2.md","Building a Production Monitoring Stack from Scratch — Part 2: Grafana Alloy & the Push vs Pull Problem",{"type":65,"value":1200,"toc":1855},[1201,1210,1212,1214,1218,1221,1224,1227,1229,1233,1240,1243,1246,1263,1266,1338,1345,1347,1351,1357,1369,1377,1384,1390,1399,1402,1407,1412,1419,1424,1427,1436,1442,1446,1452,1459,1465,1471,1480,1567,1580,1710,1716,1718,1722,1725,1728,1747,1758,1761,1774,1777,1779,1783,1786,1789,1792,1795,1798,1800,1804,1809,1819,1828,1834,1836,1838,1841,1844,1849,1852],[68,1202,1203],{},[71,1204,1205,77,1207],{},[74,1206,76],{},[74,1208,1209],{},"Part 2 of 4",[1163,1211],{"path":1187},[82,1213],{},[85,1215,1217],{"id":1216},"the-open-question-from-part-1","The Open Question from Part 1",[71,1219,1220],{},"By the end of Phase 1, the stack was working. But Node Exporter is a single-purpose binary — host metrics, nothing else. The plan was always to get logs and traces into the same system, which meant we'd eventually need more agents on each host. A separate exporter for postgres metrics, another for nginx, maybe more after that. Each one is another thing to deploy, another thing to update, another thing to break in a subtly different way.",[71,1222,1223],{},"My colleague and I had been running in parallel on this. He was working through the multiple-exporter approach — the established path, a separate binary per signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle metrics, logs, and traces from one deployed process.",[71,1225,1226],{},"The question was whether Alloy was actually ready to build on.",[82,1228],{},[85,1230,1232],{"id":1231},"what-alloy-is","What Alloy Is",[71,1234,1235,1236,1239],{},"Grafana Alloy is Grafana Labs' open-source observability agent, positioned as the successor to Grafana Agent Flow. It's built around a pipeline model: you define sources, processors, and exporters as typed components and wire them together in ",[192,1237,1238],{},".alloy"," config files.",[71,1241,1242],{},"When I started working with it, it was still fairly new. The Agent Flow rebranding into Alloy had just stabilised, documentation was still filling in gaps, and community examples were sparse. You were going to hit sharp edges. But the direction seemed clearly right — one agent, multiple signals, explicit pipelines.",[71,1244,1245],{},"What made it compelling on paper:",[552,1247,1248,1254,1257,1260],{},[555,1249,1250,1253],{},[192,1251,1252],{},"prometheus.exporter.unix"," replicates Node Exporter's collectors without a separate binary",[555,1255,1256],{},"First-class support for OpenTelemetry receivers and exporters",[555,1258,1259],{},"Composable pipeline configs where data flow is visible and readable",[555,1261,1262],{},"Application metric endpoints (postgres, nginx, etc.) accessible as pipeline components",[71,1264,1265],{},"The config model is clean. Here's a simple pipeline — collect host metrics, send to Prometheus:",[185,1267,1271],{"className":1268,"code":1269,"language":1270,"meta":194,"style":194},"language-alloy shiki shiki-themes vitesse-light","prometheus.exporter.unix \"localhost\" {\n  set_collectors = [\"cpu\", \"meminfo\", \"diskstats\", \"filesystem\", \"netdev\", \"loadavg\"]\n}\n\nprometheus.scrape \"node\" {\n  targets    = prometheus.exporter.unix.localhost.targets\n  forward_to = [prometheus.remote_write.default.receiver]\n}\n\nprometheus.remote_write \"default\" {\n  endpoint {\n    url = \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\"\n  }\n}\n","alloy",[192,1272,1273,1278,1283,1288,1292,1297,1302,1307,1311,1315,1320,1325,1330,1334],{"__ignoreMap":194},[223,1274,1275],{"class":225,"line":12},[223,1276,1277],{},"prometheus.exporter.unix \"localhost\" {\n",[223,1279,1280],{"class":225,"line":21},[223,1281,1282],{},"  set_collectors = [\"cpu\", \"meminfo\", \"diskstats\", \"filesystem\", \"netdev\", \"loadavg\"]\n",[223,1284,1285],{"class":225,"line":30},[223,1286,1287],{},"}\n",[223,1289,1290],{"class":225,"line":39},[223,1291,756],{"emptyLinePlaceholder":755},[223,1293,1294],{"class":225,"line":48},[223,1295,1296],{},"prometheus.scrape \"node\" {\n",[223,1298,1299],{"class":225,"line":57},[223,1300,1301],{},"  targets    = prometheus.exporter.unix.localhost.targets\n",[223,1303,1304],{"class":225,"line":394},[223,1305,1306],{},"  forward_to = [prometheus.remote_write.default.receiver]\n",[223,1308,1309],{"class":225,"line":415},[223,1310,1287],{},[223,1312,1313],{"class":225,"line":435},[223,1314,756],{"emptyLinePlaceholder":755},[223,1316,1317],{"class":225,"line":441},[223,1318,1319],{},"prometheus.remote_write \"default\" {\n",[223,1321,1322],{"class":225,"line":447},[223,1323,1324],{},"  endpoint {\n",[223,1326,1327],{"class":225,"line":829},[223,1328,1329],{},"    url = \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\"\n",[223,1331,1332],{"class":225,"line":841},[223,1333,444],{},[223,1335,1336],{"class":225,"line":847},[223,1337,1287],{},[71,1339,1340,1341,1344],{},"That ",[192,1342,1343],{},"remote_write"," line introduced a problem I didn't see coming.",[82,1346],{},[85,1348,1350],{"id":1349},"the-push-vs-pull-problem","The Push vs Pull Problem",[71,1352,1353,1354,1356],{},"Prometheus is a pull-based system. It owns the scrape cycle — it reaches out to targets, pulls metrics, and as a side effect of each successful scrape, generates a synthetic ",[192,1355,543],{}," metric:",[552,1358,1359,1364],{},[555,1360,1361,1363],{},[192,1362,559],{}," — scrape succeeded, host is reachable",[555,1365,1366,1368],{},[192,1367,565],{}," — scrape failed, something is wrong",[71,1370,1371,1373,1374,1376],{},[192,1372,543],{}," isn't something your application exports. Prometheus generates it internally, based on whether the HTTP request to ",[192,1375,213],{}," succeeded. The entire alerting chain from Part 1 depended on it.",[71,1378,1379,1380,1383],{},"When Alloy uses ",[192,1381,1382],{},"prometheus.remote_write",", the data flow reverses. Alloy pushes metrics to Prometheus via HTTP POST. Prometheus sits passively and receives what's sent.",[71,1385,1386,1387,1389],{},"And that means Prometheus never scrapes these hosts. So Prometheus never generates ",[192,1388,543],{}," for them.",[71,1391,1392,1393,1395,1396,1398],{},"The first time I checked the Prometheus targets page after switching to push-based Alloy, those hosts weren't in the targets list at all. Not showing ",[192,1394,565],{}," — not there at all. Prometheus had no scrape config for them; it was just receiving a stream of metrics it hadn't asked for. The ",[192,1397,543],{}," metric had silently disappeared, and everything built on top of it — every alert, every \"host is down\" notification — had gone with it.",[71,1400,1401],{},"This was the kind of failure that wouldn't surface immediately. The dashboards still had data. Metrics were still flowing. It would only become obvious the next time a host actually went down and nobody got paged.",[1403,1404,1406],"h3",{"id":1405},"the-workarounds-i-tried","The Workarounds I Tried",[71,1408,1409],{},[74,1410,1411],{},"Heartbeat metric from Alloy's internal health",[71,1413,1414,1415,1418],{},"Alloy exposes internal component status metrics. You can check if the pipeline is running. But this only tells you Alloy is alive on the server side — it says nothing about whether the ",[531,1416,1417],{},"host"," is reachable. A host could be completely unreachable and Alloy's own health metrics would look fine from Prometheus's perspective, because Prometheus was never reaching out to check.",[71,1420,1421],{},[74,1422,1423],{},"Staleness detection via timestamp",[71,1425,1426],{},"If a host stops pushing, its metrics go stale. You can detect this:",[185,1428,1430],{"className":575,"code":1429,"language":577,"meta":194,"style":194},"(time() - max by (instance) (timestamp(node_cpu_seconds_total))) > 120\n",[192,1431,1432],{"__ignoreMap":194},[223,1433,1434],{"class":225,"line":12},[223,1435,1429],{},[71,1437,1438,1439,1441],{},"This technically works. But it's fragile — dependent on a specific metric being present and recently written, prone to false positives from remote_write buffer lag or brief network hiccups. And it means rewriting every alert and dashboard around staleness rather than the clean binary ",[192,1440,543],{}," signal. It felt like building on sand.",[1403,1443,1445],{"id":1444},"the-actual-fix","The Actual Fix",[71,1447,1448,1449],{},"After long enough on the workarounds, the right answer was simpler: ",[74,1450,1451],{},"keep Prometheus pulling, just pull from Alloy's HTTP endpoint instead of a standalone Node Exporter binary.",[71,1453,1454,1455,1458],{},"Alloy exposes an HTTP API on port ",[192,1456,1457],{},"12345",". Every component that produces metrics is accessible at a path under that API:",[185,1460,1463],{"className":1461,"code":1462,"language":190},[188],"http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[192,1464,1462],{"__ignoreMap":194},[71,1466,1467,1468,1470],{},"This is a plain HTTP endpoint serving Prometheus text format — exactly what Node Exporter served on port 9100. Prometheus can scrape it exactly like any other target. When it does, it generates ",[192,1469,543],{},". Everything from Part 1 works without modification.",[71,1472,1473,1474,1477,1478,320],{},"The Alloy config on each host becomes simpler, not more complex — no ",[192,1475,1476],{},"prometheus.scrape",", no ",[192,1479,1382],{},[185,1481,1483],{"className":1268,"code":1482,"language":1270,"meta":194,"style":194},"prometheus.exporter.unix \"localhost\" {\n  set_collectors = [\n    \"cpu\",\n    \"meminfo\",\n    \"diskstats\",\n    \"filesystem\",\n    \"netdev\",\n    \"loadavg\",\n    \"uname\",\n    \"time\",\n    \"systemd\",\n    \"processes\",\n  ]\n}\n\n\u002F\u002F Prometheus will pull from:\n\u002F\u002F http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[192,1484,1485,1489,1494,1499,1504,1509,1514,1519,1524,1529,1534,1539,1544,1549,1553,1557,1562],{"__ignoreMap":194},[223,1486,1487],{"class":225,"line":12},[223,1488,1277],{},[223,1490,1491],{"class":225,"line":21},[223,1492,1493],{},"  set_collectors = [\n",[223,1495,1496],{"class":225,"line":30},[223,1497,1498],{},"    \"cpu\",\n",[223,1500,1501],{"class":225,"line":39},[223,1502,1503],{},"    \"meminfo\",\n",[223,1505,1506],{"class":225,"line":48},[223,1507,1508],{},"    \"diskstats\",\n",[223,1510,1511],{"class":225,"line":57},[223,1512,1513],{},"    \"filesystem\",\n",[223,1515,1516],{"class":225,"line":394},[223,1517,1518],{},"    \"netdev\",\n",[223,1520,1521],{"class":225,"line":415},[223,1522,1523],{},"    \"loadavg\",\n",[223,1525,1526],{"class":225,"line":435},[223,1527,1528],{},"    \"uname\",\n",[223,1530,1531],{"class":225,"line":441},[223,1532,1533],{},"    \"time\",\n",[223,1535,1536],{"class":225,"line":447},[223,1537,1538],{},"    \"systemd\",\n",[223,1540,1541],{"class":225,"line":829},[223,1542,1543],{},"    \"processes\",\n",[223,1545,1546],{"class":225,"line":841},[223,1547,1548],{},"  ]\n",[223,1550,1551],{"class":225,"line":847},[223,1552,1287],{},[223,1554,1555],{"class":225,"line":853},[223,1556,756],{"emptyLinePlaceholder":755},[223,1558,1559],{"class":225,"line":858},[223,1560,1561],{},"\u002F\u002F Prometheus will pull from:\n",[223,1563,1564],{"class":225,"line":870},[223,1565,1566],{},"\u002F\u002F http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[71,1568,1569,1570,1572,1573,1576,1577,1579],{},"The target file for each Alloy host points to port ",[192,1571,1457],{}," and uses ",[192,1574,1575],{},"__metrics_path__"," — a special Prometheus label that overrides the default ",[192,1578,213],{}," scrape path — to point at the correct component endpoint:",[185,1581,1583],{"className":289,"code":1582,"language":291,"meta":194,"style":194},"[\n  {\n    \"targets\": [\"10.200.3.23:12345\"],\n    \"labels\": {\n      \"hostname\": \"cloud-network-3\",\n      \"environment\": \"production\",\n      \"maintainers\": \"admin@domain.com\",\n      \"__metrics_path__\": \"\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\"\n    }\n  }\n]\n",[192,1584,1585,1589,1593,1614,1626,1645,1663,1681,1698,1702,1706],{"__ignoreMap":194},[223,1586,1587],{"class":225,"line":12},[223,1588,299],{"class":298},[223,1590,1591],{"class":225,"line":21},[223,1592,304],{"class":298},[223,1594,1595,1597,1599,1601,1603,1605,1607,1610,1612],{"class":225,"line":30},[223,1596,310],{"class":309},[223,1598,314],{"class":313},[223,1600,317],{"class":309},[223,1602,320],{"class":298},[223,1604,323],{"class":298},[223,1606,317],{"class":326},[223,1608,1609],{"class":238},"10.200.3.23:12345",[223,1611,317],{"class":326},[223,1613,334],{"class":298},[223,1615,1616,1618,1620,1622,1624],{"class":225,"line":39},[223,1617,310],{"class":309},[223,1619,341],{"class":313},[223,1621,317],{"class":309},[223,1623,320],{"class":298},[223,1625,348],{"class":298},[223,1627,1628,1630,1632,1634,1636,1638,1641,1643],{"class":225,"line":48},[223,1629,353],{"class":309},[223,1631,356],{"class":313},[223,1633,317],{"class":309},[223,1635,320],{"class":298},[223,1637,363],{"class":326},[223,1639,1640],{"class":238},"cloud-network-3",[223,1642,317],{"class":326},[223,1644,371],{"class":298},[223,1646,1647,1649,1651,1653,1655,1657,1659,1661],{"class":225,"line":57},[223,1648,353],{"class":309},[223,1650,378],{"class":313},[223,1652,317],{"class":309},[223,1654,320],{"class":298},[223,1656,363],{"class":326},[223,1658,387],{"class":238},[223,1660,317],{"class":326},[223,1662,371],{"class":298},[223,1664,1665,1667,1669,1671,1673,1675,1677,1679],{"class":225,"line":394},[223,1666,353],{"class":309},[223,1668,420],{"class":313},[223,1670,317],{"class":309},[223,1672,320],{"class":298},[223,1674,363],{"class":326},[223,1676,429],{"class":238},[223,1678,317],{"class":326},[223,1680,371],{"class":298},[223,1682,1683,1685,1687,1689,1691,1693,1696],{"class":225,"line":415},[223,1684,353],{"class":309},[223,1686,1575],{"class":313},[223,1688,317],{"class":309},[223,1690,320],{"class":298},[223,1692,363],{"class":326},[223,1694,1695],{"class":238},"\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics",[223,1697,432],{"class":326},[223,1699,1700],{"class":225,"line":435},[223,1701,438],{"class":298},[223,1703,1704],{"class":225,"line":441},[223,1705,444],{"class":298},[223,1707,1708],{"class":225,"line":447},[223,1709,450],{"class":298},[71,1711,1712,1713,1715],{},"Prometheus scrapes it, gets back standard metrics, generates ",[192,1714,559],{},", stores everything as normal. The alerting chain is intact.",[82,1717],{},[85,1719,1721],{"id":1720},"getting-application-metrics-in-the-part-that-nearly-broke-alloy-for-us","Getting Application Metrics In: The Part That Nearly Broke Alloy for Us",[71,1723,1724],{},"The unix exporter was the straightforward part. The harder part was getting application-level metrics — specifically postgres and nginx — through Alloy rather than through separate exporter binaries.",[71,1726,1727],{},"Alloy has built-in components for both. The postgres one:",[185,1729,1731],{"className":1268,"code":1730,"language":1270,"meta":194,"style":194},"prometheus.exporter.postgres \"db\" {\n  data_source_names = [\"postgresql:\u002F\u002Fuser:pass@localhost:5432\u002Fmydb?sslmode=disable\"]\n}\n",[192,1732,1733,1738,1743],{"__ignoreMap":194},[223,1734,1735],{"class":225,"line":12},[223,1736,1737],{},"prometheus.exporter.postgres \"db\" {\n",[223,1739,1740],{"class":225,"line":21},[223,1741,1742],{},"  data_source_names = [\"postgresql:\u002F\u002Fuser:pass@localhost:5432\u002Fmydb?sslmode=disable\"]\n",[223,1744,1745],{"class":225,"line":30},[223,1746,1287],{},[71,1748,1749,1750,1753,1754,1757],{},"I tried the nginx equivalent first. Alloy has ",[192,1751,1752],{},"prometheus.exporter.nginx",", which connects to nginx's ",[192,1755,1756],{},"stub_status"," endpoint and pulls metrics from it. I set it up, checked the output — nothing. No metrics, no errors, just silence. I spent time on it, checked the nginx config, checked the Alloy config, tried different approaches. At some point I started thinking seriously about just installing the standalone nginx exporter and giving up on Alloy for application metrics entirely.",[71,1759,1760],{},"Before doing that, I tried the postgres component instead. It worked immediately — metrics flowing through on the first attempt. That was the signal I needed. If postgres worked, nginx should work too. The problem wasn't Alloy. Something was wrong with my specific nginx setup.",[71,1762,1763,1764,1766,1767,1769,1770,1773],{},"I went back to nginx, looked more carefully at the ",[192,1765,1756],{}," configuration, and found it. The endpoint wasn't properly enabled — the nginx config had the ",[192,1768,1756],{}," block but it was only accessible from ",[192,1771,1772],{},"127.0.0.1",", and Alloy was trying to reach it in a way that wasn't matching that restriction. A small fix, and nginx metrics started flowing.",[71,1775,1776],{},"The near-ditch was worth it. Running separate exporters for every application would have meant my colleague's approach and my approach converging on the same place — a proliferation of binaries per host. The whole point of Alloy was avoiding that.",[82,1778],{},[85,1780,1782],{"id":1781},"why-alloy-over-multiple-exporters","Why Alloy Over Multiple Exporters",[71,1784,1785],{},"My colleague's multiple-exporter approach was working. It's the established path, well-documented, stable. The case for it isn't wrong.",[71,1787,1788],{},"But the case for Alloy is better for where we're going. The moment you want logs — which was always the plan — you need another agent anyway. If you're already running Node Exporter, postgres exporter, and nginx exporter, you're at three binaries per host. Adding a log agent makes four. Each one needs to be deployed, configured, updated, and monitored independently.",[71,1790,1791],{},"With Alloy, adding logs is another component in the same config file on the same process. Adding traces is the same. The operational footprint stays at one agent per host regardless of how many signals you're collecting.",[71,1793,1794],{},"There's also the matter of the pipeline model. When you have a single config that describes exactly what data is flowing where, debugging is straightforward. With four separate agents running independently, understanding the full picture requires checking four separate processes.",[71,1796,1797],{},"The sharp edges were real — the push vs pull problem cost me real time, the nginx issue nearly derailed the whole approach. But those were solvable problems. The structural limitation of multiple exporters — complexity that compounds as you add signals — isn't.",[82,1799],{},[85,1801,1803],{"id":1802},"what-changed-in-grafana","What Changed in Grafana",[71,1805,1806,1808],{},[192,1807,1252],{}," uses the same metric names as standalone Node Exporter — it's built on the same underlying collectors. Every PromQL query from Part 1 works unchanged.",[71,1810,1811,1812,1814,1815,1818],{},"The one real change is how ",[192,1813,543],{}," is queried. With Alloy, the job label becomes ",[192,1816,1817],{},"\"alloy\"",", and filtering is done through the richer label set on each target — environment, priority, instance — rather than anything tied to a port or exporter binary. For example, the fleet status panel:",[185,1820,1822],{"className":575,"code":1821,"language":577,"meta":194,"style":194},"count(up{job=\"alloy\", priority=~\"${priority}\", environment=~\"${environment}\", instance!~\"^localhost.*\", instance=~\"${instance}\"} == 1) or vector(0)\n",[192,1823,1824],{"__ignoreMap":194},[223,1825,1826],{"class":225,"line":12},[223,1827,1821],{},[71,1829,540,1830,1833],{},[192,1831,1832],{},"or vector(0)"," ensures the panel returns zero rather than no data when nothing matches — a small thing that matters when you're staring at a dashboard at 2am wondering if the query is broken or the hosts are genuinely all down.",[82,1835],{},[85,1837,1146],{"id":1145},[71,1839,1840],{},"By the end of Phase 2, the stack had a single agent per host handling metrics across system and application layers, pull-based scraping preserved so all the alerting machinery from Phase 1 still worked, and OTel ports open on the Alloy container for what came next.",[71,1842,1843],{},"The next gap was observability beyond metrics. Host CPU and memory tell you a machine is struggling; they don't tell you why, or what a request was doing when it failed. That meant Loki for logs and Tempo for distributed traces.",[71,1845,1846],{},[74,1847,1848],{},"Next:",[1163,1850],{"path":1851},"\u002Fblog\u002Flgtm-stack\u002Fpart-3",[1167,1853,1854],{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .s61at, html code.shiki .s61at{--shiki-default:#99841877}html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sSP4y, html code.shiki .sSP4y{--shiki-default:#B5695977}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}",{"title":194,"searchDepth":21,"depth":21,"links":1856},[1857,1858,1859,1863,1864,1865,1866],{"id":1216,"depth":21,"text":1217},{"id":1231,"depth":21,"text":1232},{"id":1349,"depth":21,"text":1350,"children":1860},[1861,1862],{"id":1405,"depth":30,"text":1406},{"id":1444,"depth":30,"text":1445},{"id":1720,"depth":21,"text":1721},{"id":1781,"depth":21,"text":1782},{"id":1802,"depth":21,"text":1803},{"id":1145,"depth":21,"text":1146},"2026-05-18","Why we replaced individual exporters with Grafana Alloy, why push-based metrics silently broke our alerting, and what it took to figure that out.",{},"11 min",{"title":1198,"description":1868},"blog\u002Flgtm-stack\u002Fpart-2",[1192,146,1874,136,1193],"Alloy","\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-2.png","5rTJAJ6R9qZvwa3_G-JIs2egK0cngWF_IBU66dyKnZk",1780657374705]