Iris — Giving LGTM a brain

Overview

Iris is a production-grade infrastructure monitoring platform built for KNUST (Kwame Nkrumah University of Science and Technology). It automates the onboarding of servers into a Prometheus-based observability stack, manages the full lifecycle of infrastructure incidents, and gives the monitoring environment an intelligent layer through AI-powered analysis, automated digests, and operational runbooks.

The system is named after the Greek goddess of the rainbow and messenger of the gods — fitting for a platform whose job is to relay the state of infrastructure to the people responsible for it.

What It Does

At its core, Iris solves a real operational problem: getting dozens (or hundreds) of servers properly instrumented, monitored, and connected to the right people when something goes wrong — without doing it manually each time.

Host Enrollment is the entry point. An administrator submits a host — Linux or Windows — and Iris SSH or WinRM's into it, installs the Grafana Alloy metrics agent, deploys a configuration tailored to the services running on that host, registers it in Prometheus service discovery, and sends a confirmation notification. What would take 20 minutes manually takes under 2 minutes, consistently, for every host.

Incident Management is where Iris does its most important work. AlertManager fires a webhook when Prometheus rules trigger. Iris receives that webhook, enriches the alert with context from a vector knowledge base (relevant runbooks, similar past incidents, infrastructure documentation), generates an AI-authored notification with recommended remediation steps, routes it to the right team via Microsoft Teams and email, and stores the incident for future learning. Every alert becomes more useful than it would be alone.

Operational Intelligence sits on top of all of this. Iris collects metrics snapshots, tracks maintenance windows with full lifecycle management, notifies teams when services are deployed, delivers daily, weekly, and monthly digest reports scoped to each maintainer's specific responsibilities, and continuously builds a richer knowledge base that makes future incidents faster to resolve.

Key Features

Automated Host Enrollment

SSH-based enrollment for Linux hosts (Debian, RHEL, SUSE, and derivatives)
WinRM-based enrollment for Windows hosts
Automatic detection of host OS, architecture, and installed services
Grafana Alloy agent installation with version management
Jinja2-templated Alloy configurations backed by a database template store
Support for service-specific configurations: node metrics, Nginx, Apache, MySQL, PostgreSQL, MongoDB, Windows Performance Counters
Prometheus target file registration with full label management (hostname, job, environment, service type, host ID)
Validation of prerequisites: disk space, connectivity, systemd availability
Firewall rule management for metrics ports

Batch Enrollment

Bulk enrollment via JSON, CSV, or YAML file upload
Sequential or concurrent execution strategies
Per-host progress tracking and result storage
Retry logic for failed hosts
Downloadable templates for batch file formats

Prometheus & Grafana Alloy Integration

File-based service discovery for Prometheus
Support for multiple federated Prometheus instances
Atomic target file updates with Prometheus reload
Prometheus API querying for target verification and metrics collection
Alloy configuration validation using alloy fmt on the remote host

AlertManager Webhook Processing

Receives AlertManager v4 webhook payloads
Asynchronous processing via Celery task queue
Alert enrichment with runbooks, similar incidents, and infrastructure context
LLM-generated incident notifications with remediation recommendations
Team-aware routing based on service and host ownership

AI-Powered Incident Intelligence (RAG)

Weaviate vector database for semantic storage and retrieval
Ollama-backed embeddings (nomic-embed-text) and language model (Llama 3)
Three knowledge collections: runbooks, past incidents, infrastructure documentation
Semantic search across all collections at incident time
Continuously growing knowledge base as incidents are processed and resolved

Host Tagging System

Operational tags: ignore_alerts, known_issue, under_maintenance, flaky, custom
Optional expiry timestamps for temporary tags
Metadata key-value pairs for custom annotation
Notification suppression for tagged hosts

Incident Lifecycle Management

Full incident storage with status tracking (firing → resolved)
Resolution notes and root cause documentation
Filter and query by alert name, instance, severity, service type, status
Resolution time tracking and SLA flagging in digest reports

Automated Digests

Daily, weekly, and monthly digest reports via Celery Beat
Personalized: each maintainer receives only their hosts and services
Severity distribution, SLA flags, and resolution status
Delivered via Microsoft Teams adaptive cards and HTML email
Stakeholder segmentation: scoped maintainers, service stakeholders, infrastructure stakeholders, general recipients

Maintenance Window Management

Full lifecycle: plan → start → extend → end / cancel
Types: scheduled and emergency
Categories: infrastructure, application, network, database, security
Automated notifications at window start, during, and on completion
AI-generated summaries of maintenance activities
Configurable reminder scheduling before planned windows
Bulk cancel operations

Deployment Notifications

Deployment event ingestion with version, environment, and service information
Semantic version comparison (upgrade vs rollback detection)
AI-generated change summaries from commit messages
Notifications routed to service maintainers
Deployment record storage for audit trail

Metrics Collection & Snapshots

Prometheus metrics scraped and stored in PostgreSQL every 15 minutes
Configurable retention (default 90 days) with automated pruning
CPU, memory, and disk thresholds with warning and critical levels
Historical trend data for digest and reporting card generation

Service & Maintainer Registry

Service registry with maintainer and escalation manager assignments
Host-level maintainer overrides independent of service assignments
Notification preference flags per host and service
Used across enrollment, incident routing, digests, and deployment notifications

Observability of Iris Itself

OpenTelemetry integration for distributed tracing, metrics, and structured logging
Configurable OTLP export to a collector endpoint
Per-service tracing across FastAPI routes and Celery tasks

Architecture

Iris is built on a modern async Python stack:

Layer	Technology
API Framework	FastAPI (async)
Task Queue	Celery with Redis broker
Scheduler	Celery Beat
Primary Database	PostgreSQL 16 (via asyncpg / SQLModel)
Vector Database	Weaviate 1.33
AI / Embeddings	Ollama (Llama 3, nomic-embed-text)
Remote Execution	Paramiko (SSH), PyWinRM (WinRM)
Notifications	Microsoft Teams (webhooks + Graph API), KNUST Email Gateway
Monitoring Agent	Grafana Alloy
Metrics Source	Prometheus
Observability	OpenTelemetry

The application is split into three cooperating processes:

iris — the FastAPI API server, handling synchronous request/response and dispatching async work
iris-worker — one or more Celery worker processes executing enrollment, incident processing, digest generation, maintenance, and deployment tasks
iris-beat — the Celery Beat scheduler driving periodic tasks (digests, metrics collection, metric pruning)

Deployment

Iris runs on Docker Swarm in production, deployed to KNUST's internal Docker registry (dreg.knust.edu.gh). The stack is pinned to a specific monitoring node (knust-monitoring) and shares a Docker overlay network with the rest of the monitoring stack (Prometheus, Grafana, AlertManager, Weaviate, Ollama).

Production resource allocation:

Container	Memory	CPU
iris (API)	2 GB limit / 512 MB reserved	1.0 / 0.25
iris-worker	4 GB limit / 1 GB reserved	2.0 / 0.5
iris-beat	Lightweight	Minimal

Prometheus target files are mounted directly from the host at /opt/monitoring-stack/shared/prometheus/targets, allowing Iris to write target files that Prometheus reads without any additional network hop. Iris configuration files (templates, credentials) are mounted from /opt/monitoring-stack/shared/iris/.

The application is built from a two-stage Dockerfile:

A builder stage installs Python dependencies from KNUST's private PyPI registry
A slim runtime stage runs the application as a non-root user

Database migrations are managed with Alembic, and schema evolution is handled as part of the deployment pipeline.

Security

API key authentication middleware (configurable, disabled in dev)
All secrets externalized to environment variables
SSH private key support for host enrollment (no password storage required)
Azure AD app registration for Microsoft Teams Graph API access
KNUST email gateway uses API key authentication over HTTPS
CORS configured per environment (open in dev, restricted in production)
Non-root container user in production images
Internal Docker network isolation between services

Technology Summary

Python 3.12 · FastAPI · Celery · PostgreSQL · Weaviate · Ollama · Grafana Alloy · Prometheus · AlertManager · Microsoft Teams · OpenTelemetry · Docker Swarm · Redis · Paramiko · SQLModel · Alembic · Jinja2 · LangChain