Skip to content

Self-healing Kubernetes Agent built with FastAPI, Prometheus, and Google Gemini API, deployed on GKE for automatic pod recovery and monitoring.

License

Notifications You must be signed in to change notification settings

lakkawardhananjay/sre-agent

πŸš€ SRE-Agent: A Kubernetes Self-Healing Operator for Financial Services

SRE-Agent is a Kubernetes-native operator that brings automated self-healing and AI-powered root cause analysis (RCA) to the demanding environment of financial services applications.
It was built as part of the Bank of Anthos project to demonstrate how Site Reliability Engineering (SRE) principles can improve reliability and resilience in mission-critical banking systems.


🌍 Why SRE Matters in Banking

In financial services, downtime is extremely costlyβ€”not just financially, but also reputationally.
As banks adopt cloud-native technologies and microservices, complexity increases and failures become harder to manage.

Key Challenges:

  • ⚠️ Slow Incident Response – Manual fixes can take hours.
  • πŸ˜“ Toil & Burnout – Repetitive manual ops cause fatigue and errors.
  • πŸ›‘ Reactive Mode – Teams firefight instead of preventing issues.

πŸ’° The Financial Impact of Downtime

  • $152M/year – Average annual loss due to downtime for large financial firms.
  • $9,000/minute – Average cost of downtime across industries.
  • $5M/hour – Potential cost of outages in banking/finance.
  • 48% of financial firms experience a β€œhigh-impact” outage weekly.

πŸ‘‰ These numbers make automation and self-healing a necessity.


🎯 Why We Built SRE-Agent

We designed SRE-Agent to:

  • πŸ€– Automate Incident Response – Detect & remediate failures (pod crashes, resource contention) instantly.
  • πŸ”„ Reduce Toil – Free ops teams from repetitive fixes.
  • 🧠 Provide Actionable Insights – AI-powered RCA via Google Gemini API.
  • βš™οΈ Stay Flexible – YAML-based healing rules for easy customization.

🧠 AI + SRE: A New Era

  • πŸ” AI-Powered RCA – Analyze logs & metrics with kubectl-ai + Gemini.
  • πŸ› οΈ Automated Remediation – Take corrective actions automatically.
  • πŸ’Έ FinOps & Cost Optimization – Identify and remove waste.

✨ Features

  • πŸ“œ Rule-Based Healing – YAML playbooks for custom rules.
  • ☸️ Kubernetes-Native Operator – Uses Kubernetes API.
  • πŸ“Š Prometheus Integration – Metric-driven healing actions.
  • πŸ€– AI-Powered RCA – Google Gemini integration.
  • πŸ”’ Leader Election – Prevents conflicting actions.
  • 🧩 Configurable & Extensible – Add new rules easily.
  • πŸ§ͺ Dry-Run Mode – Safe testing before applying fixes.
  • 🌐 REST API – For manual interventions & status checks.

πŸ“Š Impact & ROI

  • ⏱️ Reduce MTTR – 5x faster incident resolution (inspired by Netflix/Etsy practices).
  • πŸ’Έ Cut Cloud Costs – Up to 28–32% savings via automation.
  • 😌 Reduce Toil – More focus on strategy, less on firefighting.

πŸ—οΈ Architecture

The SRE-Agent integrates with Kubernetes and Prometheus to detect, heal, and analyze failures automatically.

Visual Flow

Image
## ⚑ Getting Started

### βœ… Prerequisites
- A Kubernetes cluster (GKE preferred)  
- `kubectl` configured  
- Prometheus installed (optional)  
- Google Cloud project with Gemini API enabled  

---

### βš™οΈ Configuration

**Define healing rules:**
```bash
kubectl create configmap sre-agent-playbook --from-file=healing-playbook.yaml

Set Gemini API key:

export GEMINI_API_KEY=<YOUR_GEMINI_API_KEY>

πŸš€ Deployment

Build & push Docker image:

docker build -t gcr.io/<YOUR_PROJECT_ID>/sre-agent:latest .
docker push gcr.io/<YOUR_PROJECT_ID>/sre-agent:latest

Apply Kubernetes manifests:

kubectl apply -f kubernetes-manifests/sre-agent.yaml

🧩 How It Works

  • Runs as a Kubernetes Deployment
  • Uses leader election for high availability (HA)
  • Continuously monitors cluster events & metrics
  • Executes healing rules (e.g., restarts CrashLoopBackOff pods)
  • Triggers RCA via Gemini API for detailed insights

About

Self-healing Kubernetes Agent built with FastAPI, Prometheus, and Google Gemini API, deployed on GKE for automatic pod recovery and monitoring.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 21