Iris — Giving LGTM a brain
Overview
Iris is a production-grade infrastructure monitoring platform built for KNUST (Kwame Nkrumah University of Science and Technology). It automates the onboarding of servers into a Prometheus-based observability stack, manages the full lifecycle of infrastructure incidents, and gives the monitoring environment an intelligent layer through AI-powered analysis, automated digests, and operational runbooks.
The system is named after the Greek goddess of the rainbow and messenger of the gods — fitting for a platform whose job is to relay the state of infrastructure to the people responsible for it.
What It Does
At its core, Iris solves a real operational problem: getting dozens (or hundreds) of servers properly instrumented, monitored, and connected to the right people when something goes wrong — without doing it manually each time.
Host Enrollment is the entry point. An administrator submits a host — Linux or Windows — and Iris SSH or WinRM's into it, installs the Grafana Alloy metrics agent, deploys a configuration tailored to the services running on that host, registers it in Prometheus service discovery, and sends a confirmation notification. What would take 20 minutes manually takes under 2 minutes, consistently, for every host.
Incident Management is where Iris does its most important work. AlertManager fires a webhook when Prometheus rules trigger. Iris receives that webhook, enriches the alert with context from a vector knowledge base (relevant runbooks, similar past incidents, infrastructure documentation), generates an AI-authored notification with recommended remediation steps, routes it to the right team via Microsoft Teams and email, and stores the incident for future learning. Every alert becomes more useful than it would be alone.
Operational Intelligence sits on top of all of this. Iris collects metrics snapshots, tracks maintenance windows with full lifecycle management, notifies teams when services are deployed, delivers daily, weekly, and monthly digest reports scoped to each maintainer's specific responsibilities, and continuously builds a richer knowledge base that makes future incidents faster to resolve.
Key Features
Automated Host Enrollment
- SSH-based enrollment for Linux hosts (Debian, RHEL, SUSE, and derivatives)
- WinRM-based enrollment for Windows hosts
- Automatic detection of host OS, architecture, and installed services
- Grafana Alloy agent installation with version management
- Jinja2-templated Alloy configurations backed by a database template store
- Support for service-specific configurations: node metrics, Nginx, Apache, MySQL, PostgreSQL, MongoDB, Windows Performance Counters
- Prometheus target file registration with full label management (hostname, job, environment, service type, host ID)
- Validation of prerequisites: disk space, connectivity, systemd availability
- Firewall rule management for metrics ports
Batch Enrollment
- Bulk enrollment via JSON, CSV, or YAML file upload
- Sequential or concurrent execution strategies
- Per-host progress tracking and result storage
- Retry logic for failed hosts
- Downloadable templates for batch file formats
Prometheus & Grafana Alloy Integration
- File-based service discovery for Prometheus
- Support for multiple federated Prometheus instances
- Atomic target file updates with Prometheus reload
- Prometheus API querying for target verification and metrics collection
- Alloy configuration validation using
alloy fmton the remote host
AlertManager Webhook Processing
- Receives AlertManager v4 webhook payloads
- Asynchronous processing via Celery task queue
- Alert enrichment with runbooks, similar incidents, and infrastructure context
- LLM-generated incident notifications with remediation recommendations
- Team-aware routing based on service and host ownership
AI-Powered Incident Intelligence (RAG)
- Weaviate vector database for semantic storage and retrieval
- Ollama-backed embeddings (nomic-embed-text) and language model (Llama 3)
- Three knowledge collections: runbooks, past incidents, infrastructure documentation
- Semantic search across all collections at incident time
- Continuously growing knowledge base as incidents are processed and resolved
Host Tagging System
- Operational tags:
ignore_alerts,known_issue,under_maintenance,flaky,custom - Optional expiry timestamps for temporary tags
- Metadata key-value pairs for custom annotation
- Notification suppression for tagged hosts
Incident Lifecycle Management
- Full incident storage with status tracking (firing → resolved)
- Resolution notes and root cause documentation
- Filter and query by alert name, instance, severity, service type, status
- Resolution time tracking and SLA flagging in digest reports
Automated Digests
- Daily, weekly, and monthly digest reports via Celery Beat
- Personalized: each maintainer receives only their hosts and services
- Severity distribution, SLA flags, and resolution status
- Delivered via Microsoft Teams adaptive cards and HTML email
- Stakeholder segmentation: scoped maintainers, service stakeholders, infrastructure stakeholders, general recipients
Maintenance Window Management
- Full lifecycle: plan → start → extend → end / cancel
- Types: scheduled and emergency
- Categories: infrastructure, application, network, database, security
- Automated notifications at window start, during, and on completion
- AI-generated summaries of maintenance activities
- Configurable reminder scheduling before planned windows
- Bulk cancel operations
Deployment Notifications
- Deployment event ingestion with version, environment, and service information
- Semantic version comparison (upgrade vs rollback detection)
- AI-generated change summaries from commit messages
- Notifications routed to service maintainers
- Deployment record storage for audit trail
Metrics Collection & Snapshots
- Prometheus metrics scraped and stored in PostgreSQL every 15 minutes
- Configurable retention (default 90 days) with automated pruning
- CPU, memory, and disk thresholds with warning and critical levels
- Historical trend data for digest and reporting card generation
Service & Maintainer Registry
- Service registry with maintainer and escalation manager assignments
- Host-level maintainer overrides independent of service assignments
- Notification preference flags per host and service
- Used across enrollment, incident routing, digests, and deployment notifications
Observability of Iris Itself
- OpenTelemetry integration for distributed tracing, metrics, and structured logging
- Configurable OTLP export to a collector endpoint
- Per-service tracing across FastAPI routes and Celery tasks
Architecture
Iris is built on a modern async Python stack:
| Layer | Technology |
|---|---|
| API Framework | FastAPI (async) |
| Task Queue | Celery with Redis broker |
| Scheduler | Celery Beat |
| Primary Database | PostgreSQL 16 (via asyncpg / SQLModel) |
| Vector Database | Weaviate 1.33 |
| AI / Embeddings | Ollama (Llama 3, nomic-embed-text) |
| Remote Execution | Paramiko (SSH), PyWinRM (WinRM) |
| Notifications | Microsoft Teams (webhooks + Graph API), KNUST Email Gateway |
| Monitoring Agent | Grafana Alloy |
| Metrics Source | Prometheus |
| Observability | OpenTelemetry |
The application is split into three cooperating processes:
- iris — the FastAPI API server, handling synchronous request/response and dispatching async work
- iris-worker — one or more Celery worker processes executing enrollment, incident processing, digest generation, maintenance, and deployment tasks
- iris-beat — the Celery Beat scheduler driving periodic tasks (digests, metrics collection, metric pruning)
Deployment
Iris runs on Docker Swarm in production, deployed to KNUST's internal Docker registry (dreg.knust.edu.gh). The stack is pinned to a specific monitoring node (knust-monitoring) and shares a Docker overlay network with the rest of the monitoring stack (Prometheus, Grafana, AlertManager, Weaviate, Ollama).
Production resource allocation:
| Container | Memory | CPU |
|---|---|---|
| iris (API) | 2 GB limit / 512 MB reserved | 1.0 / 0.25 |
| iris-worker | 4 GB limit / 1 GB reserved | 2.0 / 0.5 |
| iris-beat | Lightweight | Minimal |
Prometheus target files are mounted directly from the host at /opt/monitoring-stack/shared/prometheus/targets, allowing Iris to write target files that Prometheus reads without any additional network hop. Iris configuration files (templates, credentials) are mounted from /opt/monitoring-stack/shared/iris/.
The application is built from a two-stage Dockerfile:
- A builder stage installs Python dependencies from KNUST's private PyPI registry
- A slim runtime stage runs the application as a non-root user
Database migrations are managed with Alembic, and schema evolution is handled as part of the deployment pipeline.
Security
- API key authentication middleware (configurable, disabled in dev)
- All secrets externalized to environment variables
- SSH private key support for host enrollment (no password storage required)
- Azure AD app registration for Microsoft Teams Graph API access
- KNUST email gateway uses API key authentication over HTTPS
- CORS configured per environment (open in dev, restricted in production)
- Non-root container user in production images
- Internal Docker network isolation between services
Technology Summary
Python 3.12 · FastAPI · Celery · PostgreSQL · Weaviate · Ollama · Grafana Alloy · Prometheus · AlertManager · Microsoft Teams · OpenTelemetry · Docker Swarm · Redis · Paramiko · SQLModel · Alembic · Jinja2 · LangChain