[{"data":1,"prerenderedAt":2498},["ShallowReactive",2],{"nav-stories":3,"blog-lgtm-stack\u002Fpart-2":61,"ref-\u002Fblog\u002Flgtm-stack\u002Fpart-1":821,"ref-\u002Fblog\u002Flgtm-stack\u002Fpart-3":1877},[4,16,25,34,43,52],{"id":5,"color":6,"extension":7,"image":8,"label":9,"link":10,"meta":11,"order":12,"stem":13,"text":14,"__hash__":15},"stories\u002Fstories\u002F01-data-center.yml",null,"yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1558494949-ef010cbdcc31?w=1080","DATA_CENTER","https:\u002F\u002Fx.com\u002Fabbeytetteh_",{},1,"stories\u002F01-data-center","Racking new servers. 40gbit backbone online.","0QUZQbaANhdO8WemZxkDdO7vbVopfnynHtH9FxBZb_w",{"id":17,"color":6,"extension":7,"image":18,"label":19,"link":6,"meta":20,"order":21,"stem":22,"text":23,"__hash__":24},"stories\u002Fstories\u002F02-thoughts.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1498050108023-c5249f4df085?w=1080","THOUGHTS",{},2,"stories\u002F02-thoughts","Late night bug hunting. Found the memory leak.","Gd1am954aasY6HRHD7hCtOuessXb6zYZ8iizS501ICg",{"id":26,"color":27,"extension":7,"image":6,"label":28,"link":6,"meta":29,"order":30,"stem":31,"text":32,"__hash__":33},"stories\u002Fstories\u002F03-coding.yml","#3b82f6","CODING",{},3,"stories\u002F03-coding","Just thinking about how much easier life is with Swarm. https:\u002F\u002Fgoogle.com","-WTk-47jnLM-TZRWBg0VbJyZJfIM7FpQ5HGbc8LEdhQ",{"id":35,"color":6,"extension":7,"image":36,"label":37,"link":6,"meta":38,"order":39,"stem":40,"text":41,"__hash__":42},"stories\u002Fstories\u002F04-update.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1591799264318-7e6ef8ddb7ea?w=1080","UPDATE",{},4,"stories\u002F04-update","New cluster nodes arrived. Prepping for installation.","kyT60N5C6Re_jMonZbgNy0PbQhzXmUWxDbD0D_v43ts",{"id":44,"color":45,"extension":7,"image":6,"label":46,"link":6,"meta":47,"order":48,"stem":49,"text":50,"__hash__":51},"stories\u002Fstories\u002F05-setup.yml","#86868b","SETUP",{},5,"stories\u002F05-setup","Optimizing the telemetry pipeline for 1M req\u002Fs.","cPOBkzoyXsCmPgRO2d80Hj3vm4MP-6nAejtlQ5iuSzw",{"id":53,"color":6,"extension":7,"image":54,"label":55,"link":6,"meta":56,"order":57,"stem":58,"text":59,"__hash__":60},"stories\u002Fstories\u002F06-travel.yml","https:\u002F\u002Fimages.unsplash.com\u002Fphoto-1560969184-10fe8719e047?w=1080","TRAVEL",{},6,"stories\u002F06-travel","Travel log — system architecture workshop in Berlin.","jnOxerdF6usAIHdR35Z-opx0LJAy9kZluXnZhtz62Z0",{"id":62,"title":63,"body":64,"category":804,"date":805,"description":806,"extension":807,"meta":808,"navigation":174,"path":809,"readTime":810,"seo":811,"stem":812,"tags":813,"thumbnail":819,"__hash__":820},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-2.md","Building a Production Monitoring Stack from Scratch — Part 2: Grafana Alloy & the Push vs Pull Problem",{"type":65,"value":66,"toc":792},"minimark",[67,81,85,88,93,96,99,102,104,108,116,119,122,141,144,231,238,240,244,251,265,274,281,287,296,299,304,309,317,322,325,336,342,346,352,359,367,373,383,473,486,643,649,651,655,658,661,680,691,694,707,710,712,716,719,722,725,728,731,733,737,742,752,761,768,770,774,777,780,785,788],[68,69,70],"blockquote",{},[71,72,73,77,78],"p",{},[74,75,76],"strong",{},"Series:"," From NagiosXI to a Modern Observability Stack\n",[74,79,80],{},"Part 2 of 4",[82,83],"reference",{"path":84},"\u002Fblog\u002Flgtm-stack\u002Fpart-1",[86,87],"hr",{},[89,90,92],"h2",{"id":91},"the-open-question-from-part-1","The Open Question from Part 1",[71,94,95],{},"By the end of Phase 1, the stack was working. But Node Exporter is a single-purpose binary — host metrics, nothing else. The plan was always to get logs and traces into the same system, which meant we'd eventually need more agents on each host. A separate exporter for postgres metrics, another for nginx, maybe more after that. Each one is another thing to deploy, another thing to update, another thing to break in a subtly different way.",[71,97,98],{},"My colleague and I had been running in parallel on this. He was working through the multiple-exporter approach — the established path, a separate binary per signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle metrics, logs, and traces from one deployed process.",[71,100,101],{},"The question was whether Alloy was actually ready to build on.",[86,103],{},[89,105,107],{"id":106},"what-alloy-is","What Alloy Is",[71,109,110,111,115],{},"Grafana Alloy is Grafana Labs' open-source observability agent, positioned as the successor to Grafana Agent Flow. It's built around a pipeline model: you define sources, processors, and exporters as typed components and wire them together in ",[112,113,114],"code",{},".alloy"," config files.",[71,117,118],{},"When I started working with it, it was still fairly new. The Agent Flow rebranding into Alloy had just stabilised, documentation was still filling in gaps, and community examples were sparse. You were going to hit sharp edges. But the direction seemed clearly right — one agent, multiple signals, explicit pipelines.",[71,120,121],{},"What made it compelling on paper:",[123,124,125,132,135,138],"ul",{},[126,127,128,131],"li",{},[112,129,130],{},"prometheus.exporter.unix"," replicates Node Exporter's collectors without a separate binary",[126,133,134],{},"First-class support for OpenTelemetry receivers and exporters",[126,136,137],{},"Composable pipeline configs where data flow is visible and readable",[126,139,140],{},"Application metric endpoints (postgres, nginx, etc.) accessible as pipeline components",[71,142,143],{},"The config model is clean. Here's a simple pipeline — collect host metrics, send to Prometheus:",[145,146,151],"pre",{"className":147,"code":148,"language":149,"meta":150,"style":150},"language-alloy shiki shiki-themes vitesse-light","prometheus.exporter.unix \"localhost\" {\n  set_collectors = [\"cpu\", \"meminfo\", \"diskstats\", \"filesystem\", \"netdev\", \"loadavg\"]\n}\n\nprometheus.scrape \"node\" {\n  targets    = prometheus.exporter.unix.localhost.targets\n  forward_to = [prometheus.remote_write.default.receiver]\n}\n\nprometheus.remote_write \"default\" {\n  endpoint {\n    url = \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\"\n  }\n}\n","alloy","",[112,152,153,160,165,170,176,181,186,192,197,202,208,214,220,226],{"__ignoreMap":150},[154,155,157],"span",{"class":156,"line":12},"line",[154,158,159],{},"prometheus.exporter.unix \"localhost\" {\n",[154,161,162],{"class":156,"line":21},[154,163,164],{},"  set_collectors = [\"cpu\", \"meminfo\", \"diskstats\", \"filesystem\", \"netdev\", \"loadavg\"]\n",[154,166,167],{"class":156,"line":30},[154,168,169],{},"}\n",[154,171,172],{"class":156,"line":39},[154,173,175],{"emptyLinePlaceholder":174},true,"\n",[154,177,178],{"class":156,"line":48},[154,179,180],{},"prometheus.scrape \"node\" {\n",[154,182,183],{"class":156,"line":57},[154,184,185],{},"  targets    = prometheus.exporter.unix.localhost.targets\n",[154,187,189],{"class":156,"line":188},7,[154,190,191],{},"  forward_to = [prometheus.remote_write.default.receiver]\n",[154,193,195],{"class":156,"line":194},8,[154,196,169],{},[154,198,200],{"class":156,"line":199},9,[154,201,175],{"emptyLinePlaceholder":174},[154,203,205],{"class":156,"line":204},10,[154,206,207],{},"prometheus.remote_write \"default\" {\n",[154,209,211],{"class":156,"line":210},11,[154,212,213],{},"  endpoint {\n",[154,215,217],{"class":156,"line":216},12,[154,218,219],{},"    url = \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\"\n",[154,221,223],{"class":156,"line":222},13,[154,224,225],{},"  }\n",[154,227,229],{"class":156,"line":228},14,[154,230,169],{},[71,232,233,234,237],{},"That ",[112,235,236],{},"remote_write"," line introduced a problem I didn't see coming.",[86,239],{},[89,241,243],{"id":242},"the-push-vs-pull-problem","The Push vs Pull Problem",[71,245,246,247,250],{},"Prometheus is a pull-based system. It owns the scrape cycle — it reaches out to targets, pulls metrics, and as a side effect of each successful scrape, generates a synthetic ",[112,248,249],{},"up"," metric:",[123,252,253,259],{},[126,254,255,258],{},[112,256,257],{},"up = 1"," — scrape succeeded, host is reachable",[126,260,261,264],{},[112,262,263],{},"up = 0"," — scrape failed, something is wrong",[71,266,267,269,270,273],{},[112,268,249],{}," isn't something your application exports. Prometheus generates it internally, based on whether the HTTP request to ",[112,271,272],{},"\u002Fmetrics"," succeeded. The entire alerting chain from Part 1 depended on it.",[71,275,276,277,280],{},"When Alloy uses ",[112,278,279],{},"prometheus.remote_write",", the data flow reverses. Alloy pushes metrics to Prometheus via HTTP POST. Prometheus sits passively and receives what's sent.",[71,282,283,284,286],{},"And that means Prometheus never scrapes these hosts. So Prometheus never generates ",[112,285,249],{}," for them.",[71,288,289,290,292,293,295],{},"The first time I checked the Prometheus targets page after switching to push-based Alloy, those hosts weren't in the targets list at all. Not showing ",[112,291,263],{}," — not there at all. Prometheus had no scrape config for them; it was just receiving a stream of metrics it hadn't asked for. The ",[112,294,249],{}," metric had silently disappeared, and everything built on top of it — every alert, every \"host is down\" notification — had gone with it.",[71,297,298],{},"This was the kind of failure that wouldn't surface immediately. The dashboards still had data. Metrics were still flowing. It would only become obvious the next time a host actually went down and nobody got paged.",[300,301,303],"h3",{"id":302},"the-workarounds-i-tried","The Workarounds I Tried",[71,305,306],{},[74,307,308],{},"Heartbeat metric from Alloy's internal health",[71,310,311,312,316],{},"Alloy exposes internal component status metrics. You can check if the pipeline is running. But this only tells you Alloy is alive on the server side — it says nothing about whether the ",[313,314,315],"em",{},"host"," is reachable. A host could be completely unreachable and Alloy's own health metrics would look fine from Prometheus's perspective, because Prometheus was never reaching out to check.",[71,318,319],{},[74,320,321],{},"Staleness detection via timestamp",[71,323,324],{},"If a host stops pushing, its metrics go stale. You can detect this:",[145,326,330],{"className":327,"code":328,"language":329,"meta":150,"style":150},"language-promql shiki shiki-themes vitesse-light","(time() - max by (instance) (timestamp(node_cpu_seconds_total))) > 120\n","promql",[112,331,332],{"__ignoreMap":150},[154,333,334],{"class":156,"line":12},[154,335,328],{},[71,337,338,339,341],{},"This technically works. But it's fragile — dependent on a specific metric being present and recently written, prone to false positives from remote_write buffer lag or brief network hiccups. And it means rewriting every alert and dashboard around staleness rather than the clean binary ",[112,340,249],{}," signal. It felt like building on sand.",[300,343,345],{"id":344},"the-actual-fix","The Actual Fix",[71,347,348,349],{},"After long enough on the workarounds, the right answer was simpler: ",[74,350,351],{},"keep Prometheus pulling, just pull from Alloy's HTTP endpoint instead of a standalone Node Exporter binary.",[71,353,354,355,358],{},"Alloy exposes an HTTP API on port ",[112,356,357],{},"12345",". Every component that produces metrics is accessible at a path under that API:",[145,360,365],{"className":361,"code":363,"language":364},[362],"language-text","http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n","text",[112,366,363],{"__ignoreMap":150},[71,368,369,370,372],{},"This is a plain HTTP endpoint serving Prometheus text format — exactly what Node Exporter served on port 9100. Prometheus can scrape it exactly like any other target. When it does, it generates ",[112,371,249],{},". Everything from Part 1 works without modification.",[71,374,375,376,379,380,382],{},"The Alloy config on each host becomes simpler, not more complex — no ",[112,377,378],{},"prometheus.scrape",", no ",[112,381,279],{},":",[145,384,386],{"className":147,"code":385,"language":149,"meta":150,"style":150},"prometheus.exporter.unix \"localhost\" {\n  set_collectors = [\n    \"cpu\",\n    \"meminfo\",\n    \"diskstats\",\n    \"filesystem\",\n    \"netdev\",\n    \"loadavg\",\n    \"uname\",\n    \"time\",\n    \"systemd\",\n    \"processes\",\n  ]\n}\n\n\u002F\u002F Prometheus will pull from:\n\u002F\u002F http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[112,387,388,392,397,402,407,412,417,422,427,432,437,442,447,452,456,461,467],{"__ignoreMap":150},[154,389,390],{"class":156,"line":12},[154,391,159],{},[154,393,394],{"class":156,"line":21},[154,395,396],{},"  set_collectors = [\n",[154,398,399],{"class":156,"line":30},[154,400,401],{},"    \"cpu\",\n",[154,403,404],{"class":156,"line":39},[154,405,406],{},"    \"meminfo\",\n",[154,408,409],{"class":156,"line":48},[154,410,411],{},"    \"diskstats\",\n",[154,413,414],{"class":156,"line":57},[154,415,416],{},"    \"filesystem\",\n",[154,418,419],{"class":156,"line":188},[154,420,421],{},"    \"netdev\",\n",[154,423,424],{"class":156,"line":194},[154,425,426],{},"    \"loadavg\",\n",[154,428,429],{"class":156,"line":199},[154,430,431],{},"    \"uname\",\n",[154,433,434],{"class":156,"line":204},[154,435,436],{},"    \"time\",\n",[154,438,439],{"class":156,"line":210},[154,440,441],{},"    \"systemd\",\n",[154,443,444],{"class":156,"line":216},[154,445,446],{},"    \"processes\",\n",[154,448,449],{"class":156,"line":222},[154,450,451],{},"  ]\n",[154,453,454],{"class":156,"line":228},[154,455,169],{},[154,457,459],{"class":156,"line":458},15,[154,460,175],{"emptyLinePlaceholder":174},[154,462,464],{"class":156,"line":463},16,[154,465,466],{},"\u002F\u002F Prometheus will pull from:\n",[154,468,470],{"class":156,"line":469},17,[154,471,472],{},"\u002F\u002F http:\u002F\u002F\u003Chost>:12345\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\n",[71,474,475,476,478,479,482,483,485],{},"The target file for each Alloy host points to port ",[112,477,357],{}," and uses ",[112,480,481],{},"__metrics_path__"," — a special Prometheus label that overrides the default ",[112,484,272],{}," scrape path — to point at the correct component endpoint:",[145,487,491],{"className":488,"code":489,"language":490,"meta":150,"style":150},"language-json shiki shiki-themes vitesse-light","[\n  {\n    \"targets\": [\"10.200.3.23:12345\"],\n    \"labels\": {\n      \"hostname\": \"cloud-network-3\",\n      \"environment\": \"production\",\n      \"maintainers\": \"admin@domain.com\",\n      \"__metrics_path__\": \"\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics\"\n    }\n  }\n]\n","json",[112,492,493,499,504,534,548,571,591,611,629,634,638],{"__ignoreMap":150},[154,494,495],{"class":156,"line":12},[154,496,498],{"class":497},"sYZai","[\n",[154,500,501],{"class":156,"line":21},[154,502,503],{"class":497},"  {\n",[154,505,506,510,514,517,519,522,525,529,531],{"class":156,"line":30},[154,507,509],{"class":508},"s61at","    \"",[154,511,513],{"class":512},"su6XF","targets",[154,515,516],{"class":508},"\"",[154,518,382],{"class":497},[154,520,521],{"class":497}," [",[154,523,516],{"class":524},"sSP4y",[154,526,528],{"class":527},"spphp","10.200.3.23:12345",[154,530,516],{"class":524},[154,532,533],{"class":497},"],\n",[154,535,536,538,541,543,545],{"class":156,"line":39},[154,537,509],{"class":508},[154,539,540],{"class":512},"labels",[154,542,516],{"class":508},[154,544,382],{"class":497},[154,546,547],{"class":497}," {\n",[154,549,550,553,556,558,560,563,566,568],{"class":156,"line":48},[154,551,552],{"class":508},"      \"",[154,554,555],{"class":512},"hostname",[154,557,516],{"class":508},[154,559,382],{"class":497},[154,561,562],{"class":524}," \"",[154,564,565],{"class":527},"cloud-network-3",[154,567,516],{"class":524},[154,569,570],{"class":497},",\n",[154,572,573,575,578,580,582,584,587,589],{"class":156,"line":57},[154,574,552],{"class":508},[154,576,577],{"class":512},"environment",[154,579,516],{"class":508},[154,581,382],{"class":497},[154,583,562],{"class":524},[154,585,586],{"class":527},"production",[154,588,516],{"class":524},[154,590,570],{"class":497},[154,592,593,595,598,600,602,604,607,609],{"class":156,"line":188},[154,594,552],{"class":508},[154,596,597],{"class":512},"maintainers",[154,599,516],{"class":508},[154,601,382],{"class":497},[154,603,562],{"class":524},[154,605,606],{"class":527},"admin@domain.com",[154,608,516],{"class":524},[154,610,570],{"class":497},[154,612,613,615,617,619,621,623,626],{"class":156,"line":194},[154,614,552],{"class":508},[154,616,481],{"class":512},[154,618,516],{"class":508},[154,620,382],{"class":497},[154,622,562],{"class":524},[154,624,625],{"class":527},"\u002Fapi\u002Fv0\u002Fcomponent\u002Fprometheus.exporter.unix.localhost\u002Fmetrics",[154,627,628],{"class":524},"\"\n",[154,630,631],{"class":156,"line":199},[154,632,633],{"class":497},"    }\n",[154,635,636],{"class":156,"line":204},[154,637,225],{"class":497},[154,639,640],{"class":156,"line":210},[154,641,642],{"class":497},"]\n",[71,644,645,646,648],{},"Prometheus scrapes it, gets back standard metrics, generates ",[112,647,257],{},", stores everything as normal. The alerting chain is intact.",[86,650],{},[89,652,654],{"id":653},"getting-application-metrics-in-the-part-that-nearly-broke-alloy-for-us","Getting Application Metrics In: The Part That Nearly Broke Alloy for Us",[71,656,657],{},"The unix exporter was the straightforward part. The harder part was getting application-level metrics — specifically postgres and nginx — through Alloy rather than through separate exporter binaries.",[71,659,660],{},"Alloy has built-in components for both. The postgres one:",[145,662,664],{"className":147,"code":663,"language":149,"meta":150,"style":150},"prometheus.exporter.postgres \"db\" {\n  data_source_names = [\"postgresql:\u002F\u002Fuser:pass@localhost:5432\u002Fmydb?sslmode=disable\"]\n}\n",[112,665,666,671,676],{"__ignoreMap":150},[154,667,668],{"class":156,"line":12},[154,669,670],{},"prometheus.exporter.postgres \"db\" {\n",[154,672,673],{"class":156,"line":21},[154,674,675],{},"  data_source_names = [\"postgresql:\u002F\u002Fuser:pass@localhost:5432\u002Fmydb?sslmode=disable\"]\n",[154,677,678],{"class":156,"line":30},[154,679,169],{},[71,681,682,683,686,687,690],{},"I tried the nginx equivalent first. Alloy has ",[112,684,685],{},"prometheus.exporter.nginx",", which connects to nginx's ",[112,688,689],{},"stub_status"," endpoint and pulls metrics from it. I set it up, checked the output — nothing. No metrics, no errors, just silence. I spent time on it, checked the nginx config, checked the Alloy config, tried different approaches. At some point I started thinking seriously about just installing the standalone nginx exporter and giving up on Alloy for application metrics entirely.",[71,692,693],{},"Before doing that, I tried the postgres component instead. It worked immediately — metrics flowing through on the first attempt. That was the signal I needed. If postgres worked, nginx should work too. The problem wasn't Alloy. Something was wrong with my specific nginx setup.",[71,695,696,697,699,700,702,703,706],{},"I went back to nginx, looked more carefully at the ",[112,698,689],{}," configuration, and found it. The endpoint wasn't properly enabled — the nginx config had the ",[112,701,689],{}," block but it was only accessible from ",[112,704,705],{},"127.0.0.1",", and Alloy was trying to reach it in a way that wasn't matching that restriction. A small fix, and nginx metrics started flowing.",[71,708,709],{},"The near-ditch was worth it. Running separate exporters for every application would have meant my colleague's approach and my approach converging on the same place — a proliferation of binaries per host. The whole point of Alloy was avoiding that.",[86,711],{},[89,713,715],{"id":714},"why-alloy-over-multiple-exporters","Why Alloy Over Multiple Exporters",[71,717,718],{},"My colleague's multiple-exporter approach was working. It's the established path, well-documented, stable. The case for it isn't wrong.",[71,720,721],{},"But the case for Alloy is better for where we're going. The moment you want logs — which was always the plan — you need another agent anyway. If you're already running Node Exporter, postgres exporter, and nginx exporter, you're at three binaries per host. Adding a log agent makes four. Each one needs to be deployed, configured, updated, and monitored independently.",[71,723,724],{},"With Alloy, adding logs is another component in the same config file on the same process. Adding traces is the same. The operational footprint stays at one agent per host regardless of how many signals you're collecting.",[71,726,727],{},"There's also the matter of the pipeline model. When you have a single config that describes exactly what data is flowing where, debugging is straightforward. With four separate agents running independently, understanding the full picture requires checking four separate processes.",[71,729,730],{},"The sharp edges were real — the push vs pull problem cost me real time, the nginx issue nearly derailed the whole approach. But those were solvable problems. The structural limitation of multiple exporters — complexity that compounds as you add signals — isn't.",[86,732],{},[89,734,736],{"id":735},"what-changed-in-grafana","What Changed in Grafana",[71,738,739,741],{},[112,740,130],{}," uses the same metric names as standalone Node Exporter — it's built on the same underlying collectors. Every PromQL query from Part 1 works unchanged.",[71,743,744,745,747,748,751],{},"The one real change is how ",[112,746,249],{}," is queried. With Alloy, the job label becomes ",[112,749,750],{},"\"alloy\"",", and filtering is done through the richer label set on each target — environment, priority, instance — rather than anything tied to a port or exporter binary. For example, the fleet status panel:",[145,753,755],{"className":327,"code":754,"language":329,"meta":150,"style":150},"count(up{job=\"alloy\", priority=~\"${priority}\", environment=~\"${environment}\", instance!~\"^localhost.*\", instance=~\"${instance}\"} == 1) or vector(0)\n",[112,756,757],{"__ignoreMap":150},[154,758,759],{"class":156,"line":12},[154,760,754],{},[71,762,763,764,767],{},"The ",[112,765,766],{},"or vector(0)"," ensures the panel returns zero rather than no data when nothing matches — a small thing that matters when you're staring at a dashboard at 2am wondering if the query is broken or the hosts are genuinely all down.",[86,769],{},[89,771,773],{"id":772},"where-this-left-off","Where This Left Off",[71,775,776],{},"By the end of Phase 2, the stack had a single agent per host handling metrics across system and application layers, pull-based scraping preserved so all the alerting machinery from Phase 1 still worked, and OTel ports open on the Alloy container for what came next.",[71,778,779],{},"The next gap was observability beyond metrics. Host CPU and memory tell you a machine is struggling; they don't tell you why, or what a request was doing when it failed. That meant Loki for logs and Tempo for distributed traces.",[71,781,782],{},[74,783,784],{},"Next:",[82,786],{"path":787},"\u002Fblog\u002Flgtm-stack\u002Fpart-3",[789,790,791],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .s61at, html code.shiki .s61at{--shiki-default:#99841877}html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sSP4y, html code.shiki .sSP4y{--shiki-default:#B5695977}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}",{"title":150,"searchDepth":21,"depth":21,"links":793},[794,795,796,800,801,802,803],{"id":91,"depth":21,"text":92},{"id":106,"depth":21,"text":107},{"id":242,"depth":21,"text":243,"children":797},[798,799],{"id":302,"depth":30,"text":303},{"id":344,"depth":30,"text":345},{"id":653,"depth":21,"text":654},{"id":714,"depth":21,"text":715},{"id":735,"depth":21,"text":736},{"id":772,"depth":21,"text":773},"Blog","2026-05-18","Why we replaced individual exporters with Grafana Alloy, why push-based metrics silently broke our alerting, and what it took to figure that out.","md",{},"\u002Fblog\u002Flgtm-stack\u002Fpart-2","11 min",{"title":63,"description":806},"blog\u002Flgtm-stack\u002Fpart-2",[814,815,816,817,818],"Monitoring","Grafana","Alloy","Prometheus","DevOps","\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-2.png","5rTJAJ6R9qZvwa3_G-JIs2egK0cngWF_IBU66dyKnZk",{"id":822,"title":823,"body":824,"category":804,"date":1868,"description":1869,"extension":807,"meta":1870,"navigation":174,"path":84,"readTime":1871,"seo":1872,"stem":1873,"tags":1874,"thumbnail":1875,"__hash__":1876},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-1.md","Building a Production Monitoring Stack from Scratch — Part 1: Prometheus, Grafana, Node Exporter & AlertManager",{"type":65,"value":825,"toc":1856},[826,835,837,841,844,847,850,853,855,859,862,920,931,934,940,947,949,953,959,1007,1018,1021,1028,1159,1231,1242,1244,1250,1256,1268,1274,1289,1295,1297,1301,1304,1307,1312,1321,1326,1335,1345,1370,1376,1385,1396,1407,1409,1413,1416,1419,1643,1658,1661,1732,1742,1744,1751,1761,1764,1813,1820,1829,1832,1834,1836,1839,1842,1845,1848,1851,1853],[68,827,828],{},[71,829,830,77,832],{},[74,831,76],{},[74,833,834],{},"Part 1 of 4",[86,836],{},[89,838,840],{"id":839},"the-problem-with-nagiosxi","The Problem with NagiosXI",[71,842,843],{},"We had been running NagiosXI for a while. It worked, in the way that something can work while also quietly frustrating everyone who touches it. It checked hosts, fired alerts, and we had even wired up scripts to push notifications to Mattermost. But the gaps were real and getting harder to ignore.",[71,845,846],{},"It was a paid solution running on our own infrastructure — a licensing cost that got harder to justify every time someone asked for something it couldn't do. OpenTelemetry support was essentially nonexistent. Application log aggregation wasn't on the table at all. Every extension we had made through plugins had taken us about as far as plugins could go.",[71,848,849],{},"The conversation about replacing it had been happening for a while. Eventually it stopped being a conversation and became a project. The goal: a full open-source replacement covering host metrics, alerting, log aggregation, and eventually distributed tracing. One cohesive system instead of a patchwork.",[71,851,852],{},"I took on the work. Phase 1 was about standing up the foundation and proving it could actually replace what NagiosXI was doing before we went further.",[86,854],{},[89,856,858],{"id":857},"the-starting-point","The Starting Point",[71,860,861],{},"The first week was spent getting four things working together:",[863,864,865,878],"table",{},[866,867,868],"thead",{},[869,870,871,875],"tr",{},[872,873,874],"th",{},"Component",[872,876,877],{},"Role",[879,880,881,891,900,910],"tbody",{},[869,882,883,888],{},[884,885,886],"td",{},[74,887,817],{},[884,889,890],{},"Time-series metrics database and scraping engine",[869,892,893,897],{},[884,894,895],{},[74,896,815],{},[884,898,899],{},"Visualization and dashboarding",[869,901,902,907],{},[884,903,904],{},[74,905,906],{},"Node Exporter",[884,908,909],{},"Host-level metrics (CPU, memory, disk, network)",[869,911,912,917],{},[884,913,914],{},[74,915,916],{},"AlertManager",[884,918,919],{},"Alert routing, grouping, and silencing",[71,921,922,923,926,927,930],{},"The deployment runs across two nodes — ",[74,924,925],{},"mon-node-a"," for data collection (Prometheus, AlertManager, and agent-side components) and ",[74,928,929],{},"mon-node-b"," for presentation (Grafana). Keeping the presentation layer separate from the data layer was a deliberate decision: if we need to update or rebuild Grafana, it doesn't touch Prometheus, and vice versa. Everything runs in Docker.",[71,932,933],{},"How these pieces talk to each other matters, because one architectural choice here — pull vs push — ended up being the central problem in Part 2.",[145,935,938],{"className":936,"code":937,"language":364},[362],"[ Linux Hosts ]\n      |\n  node_exporter  (runs on each host, exposes \u002Fmetrics on port 9100)\n      |\n      ↓  (pull — Prometheus reaches out every 15s)\n[ Prometheus ]  ←── scrape_configs + alerting_rules\n      |\n      ├──→ [ AlertManager ]\n      |           |\n      |           └──→ Email \u002F Mattermost\n      |\n[ Grafana ]  ←── queries Prometheus via PromQL\n",[112,939,937],{"__ignoreMap":150},[71,941,942,943,946],{},"Prometheus is ",[74,944,945],{},"pull-based",". It reaches out to each target on a schedule and pulls metrics. The targets don't know Prometheus exists — they just expose an HTTP endpoint and wait. This distinction ends up mattering a lot.",[86,948],{},[89,950,952],{"id":951},"getting-host-metrics-in","Getting Host Metrics In",[71,954,955,956,958],{},"Node Exporter is a lightweight binary that runs on each host and exposes hardware and OS-level metrics at a ",[112,957,272],{}," HTTP endpoint. Deploy one per machine, point Prometheus at it, done.",[145,960,964],{"className":961,"code":962,"language":963,"meta":150,"style":150},"language-bash shiki shiki-themes vitesse-light","# Verify it's running\ncurl http:\u002F\u002F\u003Chost-ip>:9100\u002Fmetrics | head -50\n","bash",[112,965,966,972],{"__ignoreMap":150},[154,967,968],{"class":156,"line":12},[154,969,971],{"class":970},"s8zF2","# Verify it's running\n",[154,973,974,978,981,985,988,991,994,997,1000,1003],{"class":156,"line":21},[154,975,977],{"class":976},"sySUi","curl",[154,979,980],{"class":527}," http:\u002F\u002F",[154,982,984],{"class":983},"si04Y","\u003C",[154,986,987],{"class":527},"host-i",[154,989,71],{"class":990},"suHK_",[154,992,993],{"class":983},">",[154,995,996],{"class":527},":9100\u002Fmetrics",[154,998,999],{"class":983}," |",[154,1001,1002],{"class":976}," head",[154,1004,1006],{"class":1005},"sEi1f"," -50\n",[71,1008,1009,1010,1013,1014,1017],{},"If you see ",[112,1011,1012],{},"# HELP"," and ",[112,1015,1016],{},"# TYPE"," blocks followed by metric lines, you're good.",[71,1019,1020],{},"Getting the metrics in wasn't the hard part. The harder part was getting them in cleanly, with enough context attached that alerts and dashboards would actually be useful. A raw IP address as the target label tells you very little when something breaks at 2am.",[71,1022,1023,1024,1027],{},"The solution was file-based service discovery with rich labels. Instead of listing targets directly in ",[112,1025,1026],{},"prometheus.yml",", Prometheus watches a directory of JSON files:",[145,1029,1031],{"className":488,"code":1030,"language":490,"meta":150,"style":150},"[\n  {\n    \"targets\": [\"192.168.0.101:9100\"],\n    \"labels\": {\n      \"hostname\": \"web-server-01\",\n      \"environment\": \"production\",\n      \"location\": \"Primary Rack\",\n      \"maintainers\": \"admin@domain.com\"\n    }\n  }\n]\n",[112,1032,1033,1037,1041,1062,1074,1093,1111,1131,1147,1151,1155],{"__ignoreMap":150},[154,1034,1035],{"class":156,"line":12},[154,1036,498],{"class":497},[154,1038,1039],{"class":156,"line":21},[154,1040,503],{"class":497},[154,1042,1043,1045,1047,1049,1051,1053,1055,1058,1060],{"class":156,"line":30},[154,1044,509],{"class":508},[154,1046,513],{"class":512},[154,1048,516],{"class":508},[154,1050,382],{"class":497},[154,1052,521],{"class":497},[154,1054,516],{"class":524},[154,1056,1057],{"class":527},"192.168.0.101:9100",[154,1059,516],{"class":524},[154,1061,533],{"class":497},[154,1063,1064,1066,1068,1070,1072],{"class":156,"line":39},[154,1065,509],{"class":508},[154,1067,540],{"class":512},[154,1069,516],{"class":508},[154,1071,382],{"class":497},[154,1073,547],{"class":497},[154,1075,1076,1078,1080,1082,1084,1086,1089,1091],{"class":156,"line":48},[154,1077,552],{"class":508},[154,1079,555],{"class":512},[154,1081,516],{"class":508},[154,1083,382],{"class":497},[154,1085,562],{"class":524},[154,1087,1088],{"class":527},"web-server-01",[154,1090,516],{"class":524},[154,1092,570],{"class":497},[154,1094,1095,1097,1099,1101,1103,1105,1107,1109],{"class":156,"line":57},[154,1096,552],{"class":508},[154,1098,577],{"class":512},[154,1100,516],{"class":508},[154,1102,382],{"class":497},[154,1104,562],{"class":524},[154,1106,586],{"class":527},[154,1108,516],{"class":524},[154,1110,570],{"class":497},[154,1112,1113,1115,1118,1120,1122,1124,1127,1129],{"class":156,"line":188},[154,1114,552],{"class":508},[154,1116,1117],{"class":512},"location",[154,1119,516],{"class":508},[154,1121,382],{"class":497},[154,1123,562],{"class":524},[154,1125,1126],{"class":527},"Primary Rack",[154,1128,516],{"class":524},[154,1130,570],{"class":497},[154,1132,1133,1135,1137,1139,1141,1143,1145],{"class":156,"line":194},[154,1134,552],{"class":508},[154,1136,597],{"class":512},[154,1138,516],{"class":508},[154,1140,382],{"class":497},[154,1142,562],{"class":524},[154,1144,606],{"class":527},[154,1146,628],{"class":524},[154,1148,1149],{"class":156,"line":199},[154,1150,633],{"class":497},[154,1152,1153],{"class":156,"line":204},[154,1154,225],{"class":497},[154,1156,1157],{"class":156,"line":210},[154,1158,642],{"class":497},[145,1160,1164],{"className":1161,"code":1162,"language":1163,"meta":150,"style":150},"language-yaml shiki shiki-themes vitesse-light","# prometheus.yml\nscrape_configs:\n  - job_name: \"node_exporter\"\n    file_sd_configs:\n      - files:\n          - \u002Fetc\u002Fprometheus\u002Ftargets\u002F*.json\n        refresh_interval: 30s\n","yaml",[112,1165,1166,1171,1179,1196,1203,1213,1221],{"__ignoreMap":150},[154,1167,1168],{"class":156,"line":12},[154,1169,1170],{"class":970},"# prometheus.yml\n",[154,1172,1173,1176],{"class":156,"line":21},[154,1174,1175],{"class":512},"scrape_configs",[154,1177,1178],{"class":497},":\n",[154,1180,1181,1184,1187,1189,1191,1194],{"class":156,"line":30},[154,1182,1183],{"class":497},"  -",[154,1185,1186],{"class":512}," job_name",[154,1188,382],{"class":497},[154,1190,562],{"class":524},[154,1192,1193],{"class":527},"node_exporter",[154,1195,628],{"class":524},[154,1197,1198,1201],{"class":156,"line":39},[154,1199,1200],{"class":512},"    file_sd_configs",[154,1202,1178],{"class":497},[154,1204,1205,1208,1211],{"class":156,"line":48},[154,1206,1207],{"class":497},"      -",[154,1209,1210],{"class":512}," files",[154,1212,1178],{"class":497},[154,1214,1215,1218],{"class":156,"line":57},[154,1216,1217],{"class":497},"          -",[154,1219,1220],{"class":527}," \u002Fetc\u002Fprometheus\u002Ftargets\u002F*.json\n",[154,1222,1223,1226,1228],{"class":156,"line":188},[154,1224,1225],{"class":512},"        refresh_interval",[154,1227,382],{"class":497},[154,1229,1230],{"class":527}," 30s\n",[71,1232,1233,1234,1237,1238,1241],{},"Drop a file in, get a monitored host within 30 seconds. No reload required. The labels on each target flow through to every metric scraped from that host — which means they're available in alert annotations, in Grafana, everywhere. When ",[112,1235,1236],{},"HostDown"," fires, the alert can say ",[313,1239,1240],{},"which"," host, in which environment, and who to contact. That's the payoff.",[86,1243],{},[89,1245,763,1247,1249],{"id":1246},"the-up-metric",[112,1248,249],{}," Metric",[71,1251,1252,1253,1255],{},"One of Prometheus's built-in synthetic metrics is ",[112,1254,249],{},". For every scrape target:",[123,1257,1258,1263],{},[126,1259,1260,1262],{},[112,1261,257],{}," — scrape succeeded",[126,1264,1265,1267],{},[112,1266,263],{}," — scrape failed",[71,1269,1270,1271,1273],{},"This is the most fundamental health signal in the stack. Everything else — CPU, memory, disk — is meaningless if you can't even reach the host. And because ",[112,1272,249],{}," carries all the labels from your target file, you can immediately see which host is down, in which environment.",[145,1275,1277],{"className":327,"code":1276,"language":329,"meta":150,"style":150},"# All down hosts right now\nup{job=\"node_exporter\"} == 0\n",[112,1278,1279,1284],{"__ignoreMap":150},[154,1280,1281],{"class":156,"line":12},[154,1282,1283],{},"# All down hosts right now\n",[154,1285,1286],{"class":156,"line":21},[154,1287,1288],{},"up{job=\"node_exporter\"} == 0\n",[71,1290,1291,1292,1294],{},"I keep coming back to ",[112,1293,249],{}," throughout this series because it's also where things can silently break if you change the architecture carelessly. More on that in Part 2.",[86,1296],{},[89,1298,1300],{"id":1299},"dashboards","Dashboards",[71,1302,1303],{},"Grafana connects to Prometheus as a data source and queries it via PromQL. The community dashboards are easy to import and useful for getting started, but building your own is worth doing because it forces you to understand exactly what you're looking at.",[71,1305,1306],{},"The core panels and the queries behind them:",[71,1308,1309],{},[74,1310,1311],{},"CPU Usage (%)",[145,1313,1315],{"className":327,"code":1314,"language":329,"meta":150,"style":150},"100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)\n",[112,1316,1317],{"__ignoreMap":150},[154,1318,1319],{"class":156,"line":12},[154,1320,1314],{},[71,1322,1323],{},[74,1324,1325],{},"Memory Usage (%)",[145,1327,1329],{"className":327,"code":1328,"language":329,"meta":150,"style":150},"(1 - (node_memory_MemAvailable_bytes \u002F node_memory_MemTotal_bytes)) * 100\n",[112,1330,1331],{"__ignoreMap":150},[154,1332,1333],{"class":156,"line":12},[154,1334,1328],{},[71,1336,1337,1340,1341,1344],{},[74,1338,1339],{},"Disk Usage (%)"," — the ",[112,1342,1343],{},"fstype"," filter excludes Docker overlays and tmpfs mounts that inflate results",[145,1346,1348],{"className":327,"code":1347,"language":329,"meta":150,"style":150},"(1 - (\n  node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"} \u002F\n  node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"}\n)) * 100\n",[112,1349,1350,1355,1360,1365],{"__ignoreMap":150},[154,1351,1352],{"class":156,"line":12},[154,1353,1354],{},"(1 - (\n",[154,1356,1357],{"class":156,"line":21},[154,1358,1359],{},"  node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay\"} \u002F\n",[154,1361,1362],{"class":156,"line":30},[154,1363,1364],{},"  node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"}\n",[154,1366,1367],{"class":156,"line":39},[154,1368,1369],{},")) * 100\n",[71,1371,1372,1375],{},[74,1373,1374],{},"Fleet status"," — a stat panel showing every host's current state",[145,1377,1379],{"className":327,"code":1378,"language":329,"meta":150,"style":150},"up{job=\"node_exporter\"}\n",[112,1380,1381],{"__ignoreMap":150},[154,1382,1383],{"class":156,"line":12},[154,1384,1378],{},[71,1386,1387,1388,1391,1392,1395],{},"Value mappings: ",[112,1389,1390],{},"1"," → 🟢 UP, ",[112,1393,1394],{},"0"," → 🔴 DOWN.",[71,1397,1398,1399,1402,1403,1406],{},"Adding a dashboard variable for ",[112,1400,1401],{},"instance"," — ",[112,1404,1405],{},"label_values(up{job=\"node_exporter\"}, instance)"," — gives you a dropdown to filter to a specific host or view the whole fleet. That one change makes the dashboard genuinely useful for day-to-day operations.",[86,1408],{},[89,1410,1412],{"id":1411},"alerting","Alerting",[71,1414,1415],{},"Prometheus evaluates alerting rules and forwards firing alerts to AlertManager. AlertManager handles the business logic: who gets notified, when, how often, and what to suppress.",[71,1417,1418],{},"The rules themselves live in separate YAML files:",[145,1420,1422],{"className":1161,"code":1421,"language":1163,"meta":150,"style":150},"groups:\n  - name: node_exporter_alerts\n    rules:\n\n      - alert: HostDown\n        expr: up{job=\"node_exporter\"} == 0\n        for: 2m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"Host {{ $labels.instance }} is down\"\n          description: >\n            {{ $labels.hostname }} has been unreachable for more than 2 minutes.\n            Maintainers: {{ $labels.maintainers }}\n\n      - alert: HighCPUUsage\n        expr: >\n          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 85\n        for: 5m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"High CPU on {{ $labels.instance }}\"\n          description: >\n            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.\n            Current: {{ $value | printf \"%.1f\" }}%\n",[112,1423,1424,1431,1443,1450,1454,1466,1476,1486,1493,1503,1510,1524,1535,1540,1545,1549,1560,1568,1574,1584,1591,1601,1608,1622,1631,1637],{"__ignoreMap":150},[154,1425,1426,1429],{"class":156,"line":12},[154,1427,1428],{"class":512},"groups",[154,1430,1178],{"class":497},[154,1432,1433,1435,1438,1440],{"class":156,"line":21},[154,1434,1183],{"class":497},[154,1436,1437],{"class":512}," name",[154,1439,382],{"class":497},[154,1441,1442],{"class":527}," node_exporter_alerts\n",[154,1444,1445,1448],{"class":156,"line":30},[154,1446,1447],{"class":512},"    rules",[154,1449,1178],{"class":497},[154,1451,1452],{"class":156,"line":39},[154,1453,175],{"emptyLinePlaceholder":174},[154,1455,1456,1458,1461,1463],{"class":156,"line":48},[154,1457,1207],{"class":497},[154,1459,1460],{"class":512}," alert",[154,1462,382],{"class":497},[154,1464,1465],{"class":527}," HostDown\n",[154,1467,1468,1471,1473],{"class":156,"line":57},[154,1469,1470],{"class":512},"        expr",[154,1472,382],{"class":497},[154,1474,1475],{"class":527}," up{job=\"node_exporter\"} == 0\n",[154,1477,1478,1481,1483],{"class":156,"line":188},[154,1479,1480],{"class":512},"        for",[154,1482,382],{"class":497},[154,1484,1485],{"class":527}," 2m\n",[154,1487,1488,1491],{"class":156,"line":194},[154,1489,1490],{"class":512},"        labels",[154,1492,1178],{"class":497},[154,1494,1495,1498,1500],{"class":156,"line":199},[154,1496,1497],{"class":512},"          severity",[154,1499,382],{"class":497},[154,1501,1502],{"class":527}," critical\n",[154,1504,1505,1508],{"class":156,"line":204},[154,1506,1507],{"class":512},"        annotations",[154,1509,1178],{"class":497},[154,1511,1512,1515,1517,1519,1522],{"class":156,"line":210},[154,1513,1514],{"class":512},"          summary",[154,1516,382],{"class":497},[154,1518,562],{"class":524},[154,1520,1521],{"class":527},"Host {{ $labels.instance }} is down",[154,1523,628],{"class":524},[154,1525,1526,1529,1531],{"class":156,"line":216},[154,1527,1528],{"class":512},"          description",[154,1530,382],{"class":497},[154,1532,1534],{"class":1533},"sbBg2"," >\n",[154,1536,1537],{"class":156,"line":222},[154,1538,1539],{"class":527},"            {{ $labels.hostname }} has been unreachable for more than 2 minutes.\n",[154,1541,1542],{"class":156,"line":228},[154,1543,1544],{"class":527},"            Maintainers: {{ $labels.maintainers }}\n",[154,1546,1547],{"class":156,"line":458},[154,1548,175],{"emptyLinePlaceholder":174},[154,1550,1551,1553,1555,1557],{"class":156,"line":463},[154,1552,1207],{"class":497},[154,1554,1460],{"class":512},[154,1556,382],{"class":497},[154,1558,1559],{"class":527}," HighCPUUsage\n",[154,1561,1562,1564,1566],{"class":156,"line":469},[154,1563,1470],{"class":512},[154,1565,382],{"class":497},[154,1567,1534],{"class":1533},[154,1569,1571],{"class":156,"line":1570},18,[154,1572,1573],{"class":527},"          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 85\n",[154,1575,1577,1579,1581],{"class":156,"line":1576},19,[154,1578,1480],{"class":512},[154,1580,382],{"class":497},[154,1582,1583],{"class":527}," 5m\n",[154,1585,1587,1589],{"class":156,"line":1586},20,[154,1588,1490],{"class":512},[154,1590,1178],{"class":497},[154,1592,1594,1596,1598],{"class":156,"line":1593},21,[154,1595,1497],{"class":512},[154,1597,382],{"class":497},[154,1599,1600],{"class":527}," warning\n",[154,1602,1604,1606],{"class":156,"line":1603},22,[154,1605,1507],{"class":512},[154,1607,1178],{"class":497},[154,1609,1611,1613,1615,1617,1620],{"class":156,"line":1610},23,[154,1612,1514],{"class":512},[154,1614,382],{"class":497},[154,1616,562],{"class":524},[154,1618,1619],{"class":527},"High CPU on {{ $labels.instance }}",[154,1621,628],{"class":524},[154,1623,1625,1627,1629],{"class":156,"line":1624},24,[154,1626,1528],{"class":512},[154,1628,382],{"class":497},[154,1630,1534],{"class":1533},[154,1632,1634],{"class":156,"line":1633},25,[154,1635,1636],{"class":527},"            CPU on {{ $labels.hostname }} has been above 85% for 5 minutes.\n",[154,1638,1640],{"class":156,"line":1639},26,[154,1641,1642],{"class":527},"            Current: {{ $value | printf \"%.1f\" }}%\n",[71,1644,763,1645,1648,1649,1651,1652,1654,1655,1657],{},[112,1646,1647],{},"for: 2m"," on ",[112,1650,1236],{}," absorbs brief network glitches. Without it, a momentary scrape failure sends an alert. The rich labels on the target — ",[112,1653,555],{},", ",[112,1656,597],{}," — show up directly in the alert annotations.",[71,1659,1660],{},"One AlertManager config worth explaining is the inhibit rule:",[145,1662,1664],{"className":1161,"code":1663,"language":1163,"meta":150,"style":150},"inhibit_rules:\n  - source_match:\n      alertname: \"HostDown\"\n    target_match_re:\n      alertname: \"HighCPUUsage|HighMemoryUsage|DiskSpaceLow\"\n    equal: [\"instance\"]\n",[112,1665,1666,1673,1682,1695,1702,1715],{"__ignoreMap":150},[154,1667,1668,1671],{"class":156,"line":12},[154,1669,1670],{"class":512},"inhibit_rules",[154,1672,1178],{"class":497},[154,1674,1675,1677,1680],{"class":156,"line":21},[154,1676,1183],{"class":497},[154,1678,1679],{"class":512}," source_match",[154,1681,1178],{"class":497},[154,1683,1684,1687,1689,1691,1693],{"class":156,"line":30},[154,1685,1686],{"class":512},"      alertname",[154,1688,382],{"class":497},[154,1690,562],{"class":524},[154,1692,1236],{"class":527},[154,1694,628],{"class":524},[154,1696,1697,1700],{"class":156,"line":39},[154,1698,1699],{"class":512},"    target_match_re",[154,1701,1178],{"class":497},[154,1703,1704,1706,1708,1710,1713],{"class":156,"line":48},[154,1705,1686],{"class":512},[154,1707,382],{"class":497},[154,1709,562],{"class":524},[154,1711,1712],{"class":527},"HighCPUUsage|HighMemoryUsage|DiskSpaceLow",[154,1714,628],{"class":524},[154,1716,1717,1720,1722,1724,1726,1728,1730],{"class":156,"line":57},[154,1718,1719],{"class":512},"    equal",[154,1721,382],{"class":497},[154,1723,521],{"class":497},[154,1725,516],{"class":524},[154,1727,1401],{"class":527},[154,1729,516],{"class":524},[154,1731,642],{"class":497},[71,1733,1734,1735,1737,1738,1741],{},"When ",[112,1736,1236],{}," fires for a host, AlertManager suppresses all other alerts for that same host. There's no useful signal in a ",[112,1739,1740],{},"HighMemoryUsage"," alert for a machine that isn't reachable. Without this, a single dead host can generate a cascade of noise.",[86,1743],{},[89,1745,763,1747,1750],{"id":1746},"the-last_seen-pattern",[112,1748,1749],{},"last_seen"," Pattern",[71,1752,1753,1754,1757,1758,1760],{},"When a host disappears completely, Prometheus eventually stops having active series data for it. ",[112,1755,1756],{},"up{instance=\"...\"}"," doesn't return ",[112,1759,1394],{}," — it returns nothing, because there's no scrape happening. You lose the ability to answer \"when did this thing last check in?\"",[71,1762,1763],{},"A recording rule fixes this by continuously writing a timestamp whenever a host is up:",[145,1765,1767],{"className":1161,"code":1766,"language":1163,"meta":150,"style":150},"groups:\n  - name: recording_rules\n    rules:\n      - record: node_last_seen_timestamp\n        expr: time() * up{job=\"node_exporter\"}\n",[112,1768,1769,1775,1786,1792,1804],{"__ignoreMap":150},[154,1770,1771,1773],{"class":156,"line":12},[154,1772,1428],{"class":512},[154,1774,1178],{"class":497},[154,1776,1777,1779,1781,1783],{"class":156,"line":21},[154,1778,1183],{"class":497},[154,1780,1437],{"class":512},[154,1782,382],{"class":497},[154,1784,1785],{"class":527}," recording_rules\n",[154,1787,1788,1790],{"class":156,"line":30},[154,1789,1447],{"class":512},[154,1791,1178],{"class":497},[154,1793,1794,1796,1799,1801],{"class":156,"line":39},[154,1795,1207],{"class":497},[154,1797,1798],{"class":512}," record",[154,1800,382],{"class":497},[154,1802,1803],{"class":527}," node_last_seen_timestamp\n",[154,1805,1806,1808,1810],{"class":156,"line":48},[154,1807,1470],{"class":512},[154,1809,382],{"class":497},[154,1811,1812],{"class":527}," time() * up{job=\"node_exporter\"}\n",[71,1814,1815,1816,1819],{},"This writes the current Unix timestamp on every evaluation cycle, but only when ",[112,1817,1818],{},"up == 1",". When a host goes dark, the last written value persists in storage. In Grafana:",[145,1821,1823],{"className":327,"code":1822,"language":329,"meta":150,"style":150},"time() - node_last_seen_timestamp\n",[112,1824,1825],{"__ignoreMap":150},[154,1826,1827],{"class":156,"line":12},[154,1828,1822],{},[71,1830,1831],{},"Format as duration and you get: \"last seen 3h 22m ago.\" It's a small thing but it's become one of the most-used panels.",[86,1833],{},[89,1835,773],{"id":772},[71,1837,1838],{},"By the end of the first week, the stack was functionally replacing NagiosXI for host monitoring. Prometheus scraping every host every 15 seconds, dashboards showing the fleet, AlertManager routing alerts with inhibit rules and deduplication, recording rules keeping last-seen timestamps for hosts that went dark.",[71,1840,1841],{},"But there was a question I hadn't resolved yet.",[71,1843,1844],{},"Node Exporter is a single-purpose binary — host metrics and nothing else. The moment we wanted logs or traces from these same hosts, we'd need additional agents running alongside it. And adding a host to monitoring still meant four manual steps: SSH in, install Node Exporter, write the target file, reload Prometheus.",[71,1846,1847],{},"My colleague had been working in parallel, exploring the multiple-exporter approach — a separate binary for each signal type. I'd been looking at Grafana Alloy, which promised a single agent that could handle all of it. We hadn't converged yet, and there were real questions about whether Alloy was ready enough to build on.",[71,1849,1850],{},"That's what Part 2 is about.",[82,1852],{"path":809},[789,1854,1855],{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s8zF2, html code.shiki .s8zF2{--shiki-default:#A0ADA0}html pre.shiki code .sySUi, html code.shiki .sySUi{--shiki-default:#59873A}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}html pre.shiki code .si04Y, html code.shiki .si04Y{--shiki-default:#AB5959}html pre.shiki code .suHK_, html code.shiki .suHK_{--shiki-default:#393A34}html pre.shiki code .sEi1f, html code.shiki .sEi1f{--shiki-default:#A65E2B}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .s61at, html code.shiki .s61at{--shiki-default:#99841877}html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sSP4y, html code.shiki .sSP4y{--shiki-default:#B5695977}html pre.shiki code .sbBg2, html code.shiki .sbBg2{--shiki-default:#1E754F}",{"title":150,"searchDepth":21,"depth":21,"links":1857},[1858,1859,1860,1861,1863,1864,1865,1867],{"id":839,"depth":21,"text":840},{"id":857,"depth":21,"text":858},{"id":951,"depth":21,"text":952},{"id":1246,"depth":21,"text":1862},"The up Metric",{"id":1299,"depth":21,"text":1300},{"id":1411,"depth":21,"text":1412},{"id":1746,"depth":21,"text":1866},"The last_seen Pattern",{"id":772,"depth":21,"text":773},"2025-01-17","How we migrated from NagiosXI to a modern open-source observability stack — and why getting the foundation right mattered more than I expected.",{},"10 min",{"title":823,"description":1869},"blog\u002Flgtm-stack\u002Fpart-1",[814,815,817,916,818],"\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-1.png","7NeM6IsFhBh6e-n5FcTCZrpdQwLcnU1u-1RFk4aMi1A",{"id":1878,"title":1879,"body":1880,"category":804,"date":2489,"description":2490,"extension":807,"meta":2491,"navigation":174,"path":787,"readTime":1871,"seo":2492,"stem":2493,"tags":2494,"thumbnail":2496,"__hash__":2497},"blog\u002Fblog\u002Flgtm-stack\u002Fpart-3.md","Building a Production Monitoring Stack from Scratch — Part 3: Loki, Tempo & the Full Observability Picture",{"type":65,"value":1881,"toc":2479},[1882,1891,1893,1895,1897,1901,1904,1907,1909,1913,1916,1922,1928,1937,1943,1945,1949,1952,1966,2086,2089,2091,2095,2098,2105,2116,2119,2125,2128,2139,2141,2144,2147,2219,2222,2225,2318,2320,2324,2327,2354,2357,2360,2408,2418,2421,2423,2427,2430,2447,2450,2456,2459,2461,2463,2466,2469,2473,2476],[68,1883,1884],{},[71,1885,1886,77,1888],{},[74,1887,76],{},[74,1889,1890],{},"Part 3 of 4",[82,1892],{"path":84},[82,1894],{"path":809},[86,1896],{},[89,1898,1900],{"id":1899},"what-was-still-missing","What Was Still Missing",[71,1902,1903],{},"After Parts 1 and 2, host health was solid. Prometheus pulling from Alloy on every node, dashboards showing the fleet, AlertManager firing when something went wrong. But all of that is infrastructure-level visibility — CPU spiking, disk filling, host going dark. It tells you a machine is struggling. It doesn't tell you what was happening inside your applications when it did.",[71,1905,1906],{},"For that you need logs and traces. This part covers adding both — Loki for log aggregation and Tempo for distributed tracing — and how the central Alloy instance on mon-node-a ties all three signal types together.",[86,1908],{},[89,1910,1912],{"id":1911},"two-alloy-roles","Two Alloy Roles",[71,1914,1915],{},"Before getting into Loki and Tempo, it's worth being clear about something that can cause confusion: there are two distinct Alloy deployments in this setup and they do completely different things.",[71,1917,1918,1921],{},[74,1919,1920],{},"Alloy on each enrolled node"," — runs the unix exporter, exposes host metrics on port 12345, gets scraped by Prometheus. This is the pull-based setup from Part 2. Nothing changes here.",[71,1923,1924,1927],{},[74,1925,1926],{},"Central Alloy on mon-node-a"," — runs as a container alongside Prometheus, Loki, and Tempo. Opens OTel endpoints on ports 4317 (gRPC) and 4318 (HTTP). Any instrumented application sends its telemetry here, and Alloy routes each signal type to the right backend.",[71,1929,1930,1931,1933,1934,1936],{},"The separation is clean: node Alloy handles infrastructure signals via pull, central Alloy handles application signals via OTel push. Prometheus only scrapes the node Alloys. The ",[112,1932,249],{}," metric concern from Part 2 doesn't apply here — we're not relying on ",[112,1935,249],{}," for application health, only for host availability.",[145,1938,1941],{"className":1939,"code":1940,"language":364},[362],"[ Enrolled Nodes ]\n      |\n  Alloy :12345  (one per node, host metrics)\n      |\n      ↓  PULL\n[ Prometheus ]  →  AlertManager\n      ↑\n      |  remote_write (application metrics only)\n      |\n[ Central Alloy :4317\u002F:4318 ]  ←  instrumented applications (OTel)\n      |\n      ├──→ Tempo      (traces)\n      └──→ Loki       (logs)\n\n[ Grafana ]  ←── queries Prometheus, Loki, Tempo\n",[112,1942,1940],{"__ignoreMap":150},[86,1944],{},[89,1946,1948],{"id":1947},"storage-minio","Storage: MinIO",[71,1950,1951],{},"Both Loki and Tempo need a durable storage backend. In a cloud environment that would be S3. Here, MinIO provides an S3-compatible store running as a container on mon-node-a.",[71,1953,1954,1955,1654,1958,1961,1962,1965],{},"Three buckets: ",[112,1956,1957],{},"loki-data",[112,1959,1960],{},"loki-ruler",", and ",[112,1963,1964],{},"tempo",". The entrypoint script pre-creates the directories before MinIO starts — a small thing that saves a confusing startup failure on first run.",[145,1967,1969],{"className":1161,"code":1968,"language":1163,"meta":150,"style":150},"minio:\n  image: minio\u002Fminio:latest\n  environment:\n    - MINIO_ACCESS_KEY=observability\n    - MINIO_SECRET_KEY=supersecret\n  entrypoint:\n    - sh\n    - -euc\n    - |\n      mkdir -p \u002Fdata\u002Ftempo\n      mkdir -p \u002Fdata\u002Floki-data\n      mkdir -p \u002Fdata\u002Floki-ruler\n      minio server \u002Fdata --console-address ':9001'\n  networks:\n    - monitoring\n  volumes:\n    - .\u002Fdata\u002Fminio:\u002Fdata\n",[112,1970,1971,1978,1988,1995,2003,2010,2017,2024,2031,2038,2043,2048,2053,2058,2065,2072,2079],{"__ignoreMap":150},[154,1972,1973,1976],{"class":156,"line":12},[154,1974,1975],{"class":512},"minio",[154,1977,1178],{"class":497},[154,1979,1980,1983,1985],{"class":156,"line":21},[154,1981,1982],{"class":512},"  image",[154,1984,382],{"class":497},[154,1986,1987],{"class":527}," minio\u002Fminio:latest\n",[154,1989,1990,1993],{"class":156,"line":30},[154,1991,1992],{"class":512},"  environment",[154,1994,1178],{"class":497},[154,1996,1997,2000],{"class":156,"line":39},[154,1998,1999],{"class":497},"    -",[154,2001,2002],{"class":527}," MINIO_ACCESS_KEY=observability\n",[154,2004,2005,2007],{"class":156,"line":48},[154,2006,1999],{"class":497},[154,2008,2009],{"class":527}," MINIO_SECRET_KEY=supersecret\n",[154,2011,2012,2015],{"class":156,"line":57},[154,2013,2014],{"class":512},"  entrypoint",[154,2016,1178],{"class":497},[154,2018,2019,2021],{"class":156,"line":188},[154,2020,1999],{"class":497},[154,2022,2023],{"class":527}," sh\n",[154,2025,2026,2028],{"class":156,"line":194},[154,2027,1999],{"class":497},[154,2029,2030],{"class":527}," -euc\n",[154,2032,2033,2035],{"class":156,"line":199},[154,2034,1999],{"class":497},[154,2036,2037],{"class":1533}," |\n",[154,2039,2040],{"class":156,"line":204},[154,2041,2042],{"class":527},"      mkdir -p \u002Fdata\u002Ftempo\n",[154,2044,2045],{"class":156,"line":210},[154,2046,2047],{"class":527},"      mkdir -p \u002Fdata\u002Floki-data\n",[154,2049,2050],{"class":156,"line":216},[154,2051,2052],{"class":527},"      mkdir -p \u002Fdata\u002Floki-ruler\n",[154,2054,2055],{"class":156,"line":222},[154,2056,2057],{"class":527},"      minio server \u002Fdata --console-address ':9001'\n",[154,2059,2060,2063],{"class":156,"line":228},[154,2061,2062],{"class":512},"  networks",[154,2064,1178],{"class":497},[154,2066,2067,2069],{"class":156,"line":458},[154,2068,1999],{"class":497},[154,2070,2071],{"class":527}," monitoring\n",[154,2073,2074,2077],{"class":156,"line":463},[154,2075,2076],{"class":512},"  volumes",[154,2078,1178],{"class":497},[154,2080,2081,2083],{"class":156,"line":469},[154,2082,1999],{"class":497},[154,2084,2085],{"class":527}," .\u002Fdata\u002Fminio:\u002Fdata\n",[71,2087,2088],{},"The MinIO web console on port 9001 is useful when first bringing things up — you can watch objects appearing in the buckets and confirm that Loki and Tempo are actually flushing data rather than buffering it indefinitely.",[86,2090],{},[89,2092,2094],{"id":2093},"loki","Loki",[71,2096,2097],{},"Loki runs in microservices mode with read, write, and backend roles as separate containers, each with three replicas. The read and write paths scale independently, which matters as log volume grows.",[71,2099,2100,2101,2104],{},"A ",[112,2102,2103],{},"loki-init"," container runs first to set correct directory ownership — Loki processes run as UID 10001 and the volume mount needs to reflect that before anything starts.",[71,2106,2107,2108,2111,2112,2115],{},"All external traffic goes through an nginx gateway in front of the cluster. Central Alloy pushes logs to ",[112,2109,2110],{},"http:\u002F\u002Floki-gateway:3100\u002Floki\u002Fapi\u002Fv1\u002Fpush",". Grafana queries ",[112,2113,2114],{},"http:\u002F\u002Floki-gateway:3100",". Neither needs to know which replica handles a given request.",[71,2117,2118],{},"A few config decisions worth noting:",[71,2120,2121,2124],{},[112,2122,2123],{},"s3forcepathstyle: true"," is required when talking to MinIO — it uses path-style URLs rather than the virtual-hosted style AWS uses, and without this flag nothing stores correctly.",[71,2126,2127],{},"Replication factor 3 means each chunk is written to all three write replicas. Since they all back onto the same MinIO instance this is about write redundancy rather than independent storage — but it means the cluster survives a replica restart without data loss in the WAL.",[71,2129,2130,2131,2134,2135,2138],{},"The three component types discover each other via memberlist gossip on port 7946, joining by container name. Getting the ",[112,2132,2133],{},"join_members"," list right — ",[112,2136,2137],{},"[\"loki-read\", \"loki-write\", \"loki-backend\"]"," — is what brings the cluster together.",[86,2140],{},[89,2142,2143],{"id":1964},"Tempo",[71,2145,2146],{},"Tempo also runs in microservices mode. The components and what each does:",[863,2148,2149,2157],{},[866,2150,2151],{},[869,2152,2153,2155],{},[872,2154,874],{},[872,2156,877],{},[879,2158,2159,2169,2179,2189,2199,2209],{},[869,2160,2161,2166],{},[884,2162,2163],{},[112,2164,2165],{},"tempo-distributor",[884,2167,2168],{},"Receives traces from Alloy, routes to ingesters",[869,2170,2171,2176],{},[884,2172,2173],{},[112,2174,2175],{},"tempo-ingester-0\u002F1\u002F2",[884,2177,2178],{},"Buffers traces in memory, flushes to MinIO",[869,2180,2181,2186],{},[884,2182,2183],{},[112,2184,2185],{},"tempo-query-frontend",[884,2187,2188],{},"Entry point for Grafana queries",[869,2190,2191,2196],{},[884,2192,2193],{},[112,2194,2195],{},"tempo-querier",[884,2197,2198],{},"Executes queries against ingesters and object storage",[869,2200,2201,2206],{},[884,2202,2203],{},[112,2204,2205],{},"tempo-compactor",[884,2207,2208],{},"Merges and compacts trace blocks",[869,2210,2211,2216],{},[884,2212,2213],{},[112,2214,2215],{},"tempo-metrics-generator",[884,2217,2218],{},"Derives RED metrics from trace data, writes to Prometheus",[71,2220,2221],{},"The metrics generator is worth understanding. It reads incoming traces and derives standard RED metrics — Rate, Errors, Duration — then writes them back to Prometheus via remote_write. The practical effect is that you get service-level dashboards showing request rates, error rates, and latency percentiles automatically from trace data, without any additional metric instrumentation in your applications. The traces are the source of truth; Tempo does the calculation.",[71,2223,2224],{},"It also builds a service dependency graph from trace data that Grafana can render as an interactive topology map — which services call which, with live latency and error rates on each edge.",[145,2226,2228],{"className":1161,"code":2227,"language":1163,"meta":150,"style":150},"metrics_generator:\n  storage:\n    remote_write:\n      - url: http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\n        send_exemplars: true\n  processor:\n    service_graphs:\n      wait: 10s\n      max_items: 10000\n      workers: 10\n",[112,2229,2230,2237,2244,2251,2263,2273,2280,2287,2297,2308],{"__ignoreMap":150},[154,2231,2232,2235],{"class":156,"line":12},[154,2233,2234],{"class":512},"metrics_generator",[154,2236,1178],{"class":497},[154,2238,2239,2242],{"class":156,"line":21},[154,2240,2241],{"class":512},"  storage",[154,2243,1178],{"class":497},[154,2245,2246,2249],{"class":156,"line":30},[154,2247,2248],{"class":512},"    remote_write",[154,2250,1178],{"class":497},[154,2252,2253,2255,2258,2260],{"class":156,"line":39},[154,2254,1207],{"class":497},[154,2256,2257],{"class":512}," url",[154,2259,382],{"class":497},[154,2261,2262],{"class":527}," http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fwrite\n",[154,2264,2265,2268,2270],{"class":156,"line":48},[154,2266,2267],{"class":512},"        send_exemplars",[154,2269,382],{"class":497},[154,2271,2272],{"class":1533}," true\n",[154,2274,2275,2278],{"class":156,"line":57},[154,2276,2277],{"class":512},"  processor",[154,2279,1178],{"class":497},[154,2281,2282,2285],{"class":156,"line":188},[154,2283,2284],{"class":512},"    service_graphs",[154,2286,1178],{"class":497},[154,2288,2289,2292,2294],{"class":156,"line":194},[154,2290,2291],{"class":512},"      wait",[154,2293,382],{"class":497},[154,2295,2296],{"class":527}," 10s\n",[154,2298,2299,2302,2304],{"class":156,"line":199},[154,2300,2301],{"class":512},"      max_items",[154,2303,382],{"class":497},[154,2305,2307],{"class":2306},"s-TwI"," 10000\n",[154,2309,2310,2313,2315],{"class":156,"line":204},[154,2311,2312],{"class":512},"      workers",[154,2314,382],{"class":497},[154,2316,2317],{"class":2306}," 10\n",[86,2319],{},[89,2321,2323],{"id":2322},"getting-application-signals-in","Getting Application Signals In",[71,2325,2326],{},"From an application's perspective, the integration is a single environment variable:",[145,2328,2330],{"className":961,"code":2329,"language":963,"meta":150,"style":150},"OTEL_EXPORTER_OTLP_ENDPOINT=http:\u002F\u002Fmon-node-a:4317\nOTEL_SERVICE_NAME=my-service\n",[112,2331,2332,2344],{"__ignoreMap":150},[154,2333,2334,2338,2341],{"class":156,"line":12},[154,2335,2337],{"class":2336},"svycV","OTEL_EXPORTER_OTLP_ENDPOINT",[154,2339,2340],{"class":497},"=",[154,2342,2343],{"class":527},"http:\u002F\u002Fmon-node-a:4317\n",[154,2345,2346,2349,2351],{"class":156,"line":21},[154,2347,2348],{"class":2336},"OTEL_SERVICE_NAME",[154,2350,2340],{"class":497},[154,2352,2353],{"class":527},"my-service\n",[71,2355,2356],{},"The OTel SDK handles the rest. Traces, logs, and metrics all go to the same endpoint and Alloy sorts them.",[71,2358,2359],{},"The central Alloy config receives all three signal types through one receiver and routes each to its backend:",[145,2361,2363],{"className":147,"code":2362,"language":149,"meta":150,"style":150},"otelcol.receiver.otlp \"otlp_receiver\" {\n  grpc { endpoint = \"0.0.0.0:4317\" }\n  http { endpoint = \"0.0.0.0:4318\" }\n  output {\n    traces  = [otelcol.processor.batch.default.input]\n    logs    = [otelcol.processor.batch.default.input]\n    metrics = [otelcol.processor.batch.default.input]\n  }\n}\n",[112,2364,2365,2370,2375,2380,2385,2390,2395,2400,2404],{"__ignoreMap":150},[154,2366,2367],{"class":156,"line":12},[154,2368,2369],{},"otelcol.receiver.otlp \"otlp_receiver\" {\n",[154,2371,2372],{"class":156,"line":21},[154,2373,2374],{},"  grpc { endpoint = \"0.0.0.0:4317\" }\n",[154,2376,2377],{"class":156,"line":30},[154,2378,2379],{},"  http { endpoint = \"0.0.0.0:4318\" }\n",[154,2381,2382],{"class":156,"line":39},[154,2383,2384],{},"  output {\n",[154,2386,2387],{"class":156,"line":48},[154,2388,2389],{},"    traces  = [otelcol.processor.batch.default.input]\n",[154,2391,2392],{"class":156,"line":57},[154,2393,2394],{},"    logs    = [otelcol.processor.batch.default.input]\n",[154,2396,2397],{"class":156,"line":188},[154,2398,2399],{},"    metrics = [otelcol.processor.batch.default.input]\n",[154,2401,2402],{"class":156,"line":194},[154,2403,225],{},[154,2405,2406],{"class":156,"line":199},[154,2407,169],{},[71,2409,2410,2411,2414,2415,2417],{},"After batching, signals split to their respective exporters: traces to the Tempo distributor via OTLP, logs to Loki via ",[112,2412,2413],{},"loki.write",", application metrics to Prometheus via ",[112,2416,236],{},".",[71,2419,2420],{},"The OTel to Alloy to Tempo path worked on the first proper attempt — the pipeline model makes the data flow explicit enough that when something isn't arriving where you expect it, it's usually obvious which component in the chain is the problem.",[86,2422],{},[89,2424,2426],{"id":2425},"connecting-everything-in-grafana","Connecting Everything in Grafana",[71,2428,2429],{},"Three data sources on mon-node-b:",[123,2431,2432,2437,2442],{},[126,2433,2434,2436],{},[74,2435,817],{}," — host metrics, application metrics, and the RED metrics Tempo generates",[126,2438,2439,2441],{},[74,2440,2094],{}," — application logs",[126,2443,2444,2446],{},[74,2445,2143],{}," — distributed traces",[71,2448,2449],{},"The part that makes these three genuinely useful together rather than just three separate views is derived fields in Loki. Any log line containing a trace ID becomes a clickable link to that trace in Tempo:",[145,2451,2454],{"className":2452,"code":2453,"language":364},[362],"Field name: traceId\nRegex: traceId=(\\w+)\nInternal link: Tempo → ${__value.raw}\n",[112,2455,2453],{"__ignoreMap":150},[71,2457,2458],{},"From a trace in Tempo you can navigate back to the Loki logs for that service in the same time window. The three signals become navigable together rather than three separate places to look.",[86,2460],{},[89,2462,773],{"id":772},[71,2464,2465],{},"The stack now covers all three observability pillars. Host health and availability through Prometheus and Alloy on each node, unchanged from Part 2. Application logs through Loki. Distributed traces through Tempo, with RED metrics derived automatically. All queryable from Grafana with the signals linked to each other.",[71,2467,2468],{},"What was still manual: enrolling a new host still meant SSHing in, installing Alloy, writing its config, creating a target file, and reloading Prometheus. That friction was the last remaining operational problem — and fixing it turned into something bigger than just a script.",[71,2470,2471],{},[74,2472,784],{},[82,2474],{"path":2475},"\u002Fblog\u002Flgtm-stack\u002Fpart-4",[789,2477,2478],{},"html pre.shiki code .su6XF, html code.shiki .su6XF{--shiki-default:#998418}html pre.shiki code .sYZai, html code.shiki .sYZai{--shiki-default:#999999}html pre.shiki code .spphp, html code.shiki .spphp{--shiki-default:#B56959}html pre.shiki code .sbBg2, html code.shiki .sbBg2{--shiki-default:#1E754F}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s-TwI, html code.shiki .s-TwI{--shiki-default:#2F798A}html pre.shiki code .svycV, html code.shiki .svycV{--shiki-default:#B07D48}",{"title":150,"searchDepth":21,"depth":21,"links":2480},[2481,2482,2483,2484,2485,2486,2487,2488],{"id":1899,"depth":21,"text":1900},{"id":1911,"depth":21,"text":1912},{"id":1947,"depth":21,"text":1948},{"id":2093,"depth":21,"text":2094},{"id":1964,"depth":21,"text":2143},{"id":2322,"depth":21,"text":2323},{"id":2425,"depth":21,"text":2426},{"id":772,"depth":21,"text":773},"2026-05-19","Adding log aggregation with Loki and distributed tracing with Tempo — completing the metrics, logs, and traces picture.",{},{"title":1879,"description":2490},"blog\u002Flgtm-stack\u002Fpart-3",[814,2094,2143,815,816,2495,818],"OpenTelemetry","\u002Fimages\u002Fthumbnails\u002Flgtm-stack\u002Fpart-3.png","9s48r2qM_PDpsOglDZPADwXDaUQsdhpZExXgq7Db3Q0",1780657374636]