SRE-Agent is a Kubernetes-native operator that brings automated self-healing and AI-powered root cause analysis (RCA) to the demanding environment of financial services applications.
It was built as part of the Bank of Anthos project to demonstrate how Site Reliability Engineering (SRE) principles can improve reliability and resilience in mission-critical banking systems.
In financial services, downtime is extremely costlyβnot just financially, but also reputationally.
As banks adopt cloud-native technologies and microservices, complexity increases and failures become harder to manage.
β οΈ Slow Incident Response β Manual fixes can take hours.- π Toil & Burnout β Repetitive manual ops cause fatigue and errors.
- π Reactive Mode β Teams firefight instead of preventing issues.
- $152M/year β Average annual loss due to downtime for large financial firms.
- $9,000/minute β Average cost of downtime across industries.
- $5M/hour β Potential cost of outages in banking/finance.
- 48% of financial firms experience a βhigh-impactβ outage weekly.
π These numbers make automation and self-healing a necessity.
We designed SRE-Agent to:
- π€ Automate Incident Response β Detect & remediate failures (pod crashes, resource contention) instantly.
- π Reduce Toil β Free ops teams from repetitive fixes.
- π§ Provide Actionable Insights β AI-powered RCA via Google Gemini API.
- βοΈ Stay Flexible β YAML-based healing rules for easy customization.
- π AI-Powered RCA β Analyze logs & metrics with
kubectl-ai
+ Gemini. - π οΈ Automated Remediation β Take corrective actions automatically.
- πΈ FinOps & Cost Optimization β Identify and remove waste.
- π Rule-Based Healing β YAML playbooks for custom rules.
- βΈοΈ Kubernetes-Native Operator β Uses Kubernetes API.
- π Prometheus Integration β Metric-driven healing actions.
- π€ AI-Powered RCA β Google Gemini integration.
- π Leader Election β Prevents conflicting actions.
- π§© Configurable & Extensible β Add new rules easily.
- π§ͺ Dry-Run Mode β Safe testing before applying fixes.
- π REST API β For manual interventions & status checks.
- β±οΈ Reduce MTTR β 5x faster incident resolution (inspired by Netflix/Etsy practices).
- πΈ Cut Cloud Costs β Up to 28β32% savings via automation.
- π Reduce Toil β More focus on strategy, less on firefighting.
The SRE-Agent integrates with Kubernetes and Prometheus to detect, heal, and analyze failures automatically.

## β‘ Getting Started
### β
Prerequisites
- A Kubernetes cluster (GKE preferred)
- `kubectl` configured
- Prometheus installed (optional)
- Google Cloud project with Gemini API enabled
---
### βοΈ Configuration
**Define healing rules:**
```bash
kubectl create configmap sre-agent-playbook --from-file=healing-playbook.yaml
Set Gemini API key:
export GEMINI_API_KEY=<YOUR_GEMINI_API_KEY>
Build & push Docker image:
docker build -t gcr.io/<YOUR_PROJECT_ID>/sre-agent:latest .
docker push gcr.io/<YOUR_PROJECT_ID>/sre-agent:latest
Apply Kubernetes manifests:
kubectl apply -f kubernetes-manifests/sre-agent.yaml
- Runs as a Kubernetes Deployment
- Uses leader election for high availability (HA)
- Continuously monitors cluster events & metrics
- Executes healing rules (e.g., restarts
CrashLoopBackOff
pods) - Triggers RCA via Gemini API for detailed insights