← Return to Selected Works

Iris — Giving LGTM a brain

PrometheusGrafanaAlloy

Overview

Iris is a production-grade infrastructure monitoring platform built for KNUST (Kwame Nkrumah University of Science and Technology). It automates the onboarding of servers into a Prometheus-based observability stack, manages the full lifecycle of infrastructure incidents, and gives the monitoring environment an intelligent layer through AI-powered analysis, automated digests, and operational runbooks.

The system is named after the Greek goddess of the rainbow and messenger of the gods — fitting for a platform whose job is to relay the state of infrastructure to the people responsible for it.


What It Does

At its core, Iris solves a real operational problem: getting dozens (or hundreds) of servers properly instrumented, monitored, and connected to the right people when something goes wrong — without doing it manually each time.

Host Enrollment is the entry point. An administrator submits a host — Linux or Windows — and Iris SSH or WinRM's into it, installs the Grafana Alloy metrics agent, deploys a configuration tailored to the services running on that host, registers it in Prometheus service discovery, and sends a confirmation notification. What would take 20 minutes manually takes under 2 minutes, consistently, for every host.

Incident Management is where Iris does its most important work. AlertManager fires a webhook when Prometheus rules trigger. Iris receives that webhook, enriches the alert with context from a vector knowledge base (relevant runbooks, similar past incidents, infrastructure documentation), generates an AI-authored notification with recommended remediation steps, routes it to the right team via Microsoft Teams and email, and stores the incident for future learning. Every alert becomes more useful than it would be alone.

Operational Intelligence sits on top of all of this. Iris collects metrics snapshots, tracks maintenance windows with full lifecycle management, notifies teams when services are deployed, delivers daily, weekly, and monthly digest reports scoped to each maintainer's specific responsibilities, and continuously builds a richer knowledge base that makes future incidents faster to resolve.


Key Features

Automated Host Enrollment

  • SSH-based enrollment for Linux hosts (Debian, RHEL, SUSE, and derivatives)
  • WinRM-based enrollment for Windows hosts
  • Automatic detection of host OS, architecture, and installed services
  • Grafana Alloy agent installation with version management
  • Jinja2-templated Alloy configurations backed by a database template store
  • Support for service-specific configurations: node metrics, Nginx, Apache, MySQL, PostgreSQL, MongoDB, Windows Performance Counters
  • Prometheus target file registration with full label management (hostname, job, environment, service type, host ID)
  • Validation of prerequisites: disk space, connectivity, systemd availability
  • Firewall rule management for metrics ports

Batch Enrollment

  • Bulk enrollment via JSON, CSV, or YAML file upload
  • Sequential or concurrent execution strategies
  • Per-host progress tracking and result storage
  • Retry logic for failed hosts
  • Downloadable templates for batch file formats

Prometheus & Grafana Alloy Integration

  • File-based service discovery for Prometheus
  • Support for multiple federated Prometheus instances
  • Atomic target file updates with Prometheus reload
  • Prometheus API querying for target verification and metrics collection
  • Alloy configuration validation using alloy fmt on the remote host

AlertManager Webhook Processing

  • Receives AlertManager v4 webhook payloads
  • Asynchronous processing via Celery task queue
  • Alert enrichment with runbooks, similar incidents, and infrastructure context
  • LLM-generated incident notifications with remediation recommendations
  • Team-aware routing based on service and host ownership

AI-Powered Incident Intelligence (RAG)

  • Weaviate vector database for semantic storage and retrieval
  • Ollama-backed embeddings (nomic-embed-text) and language model (Llama 3)
  • Three knowledge collections: runbooks, past incidents, infrastructure documentation
  • Semantic search across all collections at incident time
  • Continuously growing knowledge base as incidents are processed and resolved

Host Tagging System

  • Operational tags: ignore_alerts, known_issue, under_maintenance, flaky, custom
  • Optional expiry timestamps for temporary tags
  • Metadata key-value pairs for custom annotation
  • Notification suppression for tagged hosts

Incident Lifecycle Management

  • Full incident storage with status tracking (firing → resolved)
  • Resolution notes and root cause documentation
  • Filter and query by alert name, instance, severity, service type, status
  • Resolution time tracking and SLA flagging in digest reports

Automated Digests

  • Daily, weekly, and monthly digest reports via Celery Beat
  • Personalized: each maintainer receives only their hosts and services
  • Severity distribution, SLA flags, and resolution status
  • Delivered via Microsoft Teams adaptive cards and HTML email
  • Stakeholder segmentation: scoped maintainers, service stakeholders, infrastructure stakeholders, general recipients

Maintenance Window Management

  • Full lifecycle: plan → start → extend → end / cancel
  • Types: scheduled and emergency
  • Categories: infrastructure, application, network, database, security
  • Automated notifications at window start, during, and on completion
  • AI-generated summaries of maintenance activities
  • Configurable reminder scheduling before planned windows
  • Bulk cancel operations

Deployment Notifications

  • Deployment event ingestion with version, environment, and service information
  • Semantic version comparison (upgrade vs rollback detection)
  • AI-generated change summaries from commit messages
  • Notifications routed to service maintainers
  • Deployment record storage for audit trail

Metrics Collection & Snapshots

  • Prometheus metrics scraped and stored in PostgreSQL every 15 minutes
  • Configurable retention (default 90 days) with automated pruning
  • CPU, memory, and disk thresholds with warning and critical levels
  • Historical trend data for digest and reporting card generation

Service & Maintainer Registry

  • Service registry with maintainer and escalation manager assignments
  • Host-level maintainer overrides independent of service assignments
  • Notification preference flags per host and service
  • Used across enrollment, incident routing, digests, and deployment notifications

Observability of Iris Itself

  • OpenTelemetry integration for distributed tracing, metrics, and structured logging
  • Configurable OTLP export to a collector endpoint
  • Per-service tracing across FastAPI routes and Celery tasks

Architecture

Iris is built on a modern async Python stack:

LayerTechnology
API FrameworkFastAPI (async)
Task QueueCelery with Redis broker
SchedulerCelery Beat
Primary DatabasePostgreSQL 16 (via asyncpg / SQLModel)
Vector DatabaseWeaviate 1.33
AI / EmbeddingsOllama (Llama 3, nomic-embed-text)
Remote ExecutionParamiko (SSH), PyWinRM (WinRM)
NotificationsMicrosoft Teams (webhooks + Graph API), KNUST Email Gateway
Monitoring AgentGrafana Alloy
Metrics SourcePrometheus
ObservabilityOpenTelemetry

The application is split into three cooperating processes:

  • iris — the FastAPI API server, handling synchronous request/response and dispatching async work
  • iris-worker — one or more Celery worker processes executing enrollment, incident processing, digest generation, maintenance, and deployment tasks
  • iris-beat — the Celery Beat scheduler driving periodic tasks (digests, metrics collection, metric pruning)

Deployment

Iris runs on Docker Swarm in production, deployed to KNUST's internal Docker registry (dreg.knust.edu.gh). The stack is pinned to a specific monitoring node (knust-monitoring) and shares a Docker overlay network with the rest of the monitoring stack (Prometheus, Grafana, AlertManager, Weaviate, Ollama).

Production resource allocation:

ContainerMemoryCPU
iris (API)2 GB limit / 512 MB reserved1.0 / 0.25
iris-worker4 GB limit / 1 GB reserved2.0 / 0.5
iris-beatLightweightMinimal

Prometheus target files are mounted directly from the host at /opt/monitoring-stack/shared/prometheus/targets, allowing Iris to write target files that Prometheus reads without any additional network hop. Iris configuration files (templates, credentials) are mounted from /opt/monitoring-stack/shared/iris/.

The application is built from a two-stage Dockerfile:

  1. A builder stage installs Python dependencies from KNUST's private PyPI registry
  2. A slim runtime stage runs the application as a non-root user

Database migrations are managed with Alembic, and schema evolution is handled as part of the deployment pipeline.


Security

  • API key authentication middleware (configurable, disabled in dev)
  • All secrets externalized to environment variables
  • SSH private key support for host enrollment (no password storage required)
  • Azure AD app registration for Microsoft Teams Graph API access
  • KNUST email gateway uses API key authentication over HTTPS
  • CORS configured per environment (open in dev, restricted in production)
  • Non-root container user in production images
  • Internal Docker network isolation between services

Technology Summary

Python 3.12 · FastAPI · Celery · PostgreSQL · Weaviate · Ollama · Grafana Alloy · Prometheus · AlertManager · Microsoft Teams · OpenTelemetry · Docker Swarm · Redis · Paramiko · SQLModel · Alembic · Jinja2 · LangChain