|
1 |
| -# Bank of Anthos |
| 1 | +# 🚀 SRE-Agent: A Kubernetes Self-Healing Operator for Financial Services |
2 | 2 |
|
3 |
| -<!-- Checks badge below seem to take a "neutral" check as a negative and shows failures if some checks are neutral. Commenting out the badge for now. --> |
4 |
| -<!--  --> |
5 |
| -[](https://cymbal-bank.fsi.cymbal.dev) |
| 3 | +SRE-Agent is a **Kubernetes-native operator** that brings **automated self-healing** and **AI-powered root cause analysis (RCA)** to the demanding environment of **financial services applications**. |
| 4 | +It was built as part of the **Bank of Anthos** project to demonstrate how Site Reliability Engineering (SRE) principles can improve reliability and resilience in **mission-critical banking systems**. |
7 | 5 |
|
8 |
| -**Bank of Anthos** is a sample HTTP-based web app that simulates a bank's payment processing network, allowing users to create artificial bank accounts and complete transactions. |
| 6 | +--- |
9 | 7 |
|
10 |
| -Google uses this application to demonstrate how developers can modernize enterprise applications using Google Cloud products, including: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine), [Anthos Service Mesh (ASM)](https://cloud.google.com/anthos/service-mesh), [Anthos Config Management (ACM)](https://cloud.google.com/anthos/config-management), [Migrate to Containers](https://cloud.google.com/migrate/containers), [Spring Cloud GCP](https://spring.io/projects/spring-cloud-gcp), [Cloud Operations](https://cloud.google.com/products/operations), [Cloud SQL](https://cloud.google.com/sql/docs), [Cloud Build](https://cloud.google.com/build), and [Cloud Deploy](https://cloud.google.com/deploy). This application works on any Kubernetes cluster. |
| 8 | +## 🌍 Why SRE Matters in Banking |
11 | 9 |
|
12 |
| -If you are using Bank of Anthos, please ★Star this repository to show your interest! |
| 10 | +In financial services, downtime is **extremely costly**—not just financially, but also reputationally. |
| 11 | +As banks adopt **cloud-native** technologies and **microservices**, complexity increases and failures become harder to manage. |
13 | 12 |
|
14 |
| -**Note to Googlers:** Please fill out the form at [go/bank-of-anthos-form](https://goto2.corp.google.com/bank-of-anthos-form). |
| 13 | +### Key Challenges: |
| 14 | +- ⚠️ **Slow Incident Response** – Manual fixes can take hours. |
| 15 | +- 😓 **Toil & Burnout** – Repetitive manual ops cause fatigue and errors. |
| 16 | +- 🛑 **Reactive Mode** – Teams firefight instead of preventing issues. |
15 | 17 |
|
16 |
| -## Screenshots |
| 18 | +--- |
17 | 19 |
|
18 |
| -| Sign in | Home | |
19 |
| -| ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | |
20 |
| -| [](/docs/img/login.png) | [](/docs/img/transactions.png) | |
| 20 | +## 💰 The Financial Impact of Downtime |
21 | 21 |
|
| 22 | +- **$152M/year** – Average annual loss due to downtime for large financial firms. |
| 23 | +- **$9,000/minute** – Average cost of downtime across industries. |
| 24 | +- **$5M/hour** – Potential cost of outages in banking/finance. |
| 25 | +- **48%** of financial firms experience a “high-impact” outage **weekly**. |
22 | 26 |
|
23 |
| -## Service architecture |
| 27 | +👉 These numbers make **automation and self-healing a necessity**. |
24 | 28 |
|
25 |
| - |
| 29 | +--- |
26 | 30 |
|
27 |
| -| Service | Language | Description | |
28 |
| -| ------------------------------------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | |
29 |
| -| [frontend](/src/frontend) | Python | Exposes an HTTP server to serve the website. Contains login page, signup page, and home page. | |
30 |
| -| [ledger-writer](/src/ledger/ledgerwriter) | Java | Accepts and validates incoming transactions before writing them to the ledger. | |
31 |
| -| [balance-reader](/src/ledger/balancereader) | Java | Provides efficient readable cache of user balances, as read from `ledger-db`. | |
32 |
| -| [transaction-history](/src/ledger/transactionhistory) | Java | Provides efficient readable cache of past transactions, as read from `ledger-db`. | |
33 |
| -| [ledger-db](/src/ledger/ledger-db) | PostgreSQL | Ledger of all transactions. Option to pre-populate with transactions for demo users. | |
34 |
| -| [user-service](/src/accounts/userservice) | Python | Manages user accounts and authentication. Signs JWTs used for authentication by other services. | |
35 |
| -| [contacts](/src/accounts/contacts) | Python | Stores list of other accounts associated with a user. Used for drop down in "Send Payment" and "Deposit" forms. | |
36 |
| -| [accounts-db](/src/accounts/accounts-db) | PostgreSQL | Database for user accounts and associated data. Option to pre-populate with demo users. | |
37 |
| -| [loadgenerator](/src/loadgenerator) | Python/Locust | Continuously sends requests imitating users to the frontend. Periodically creates new accounts and simulates transactions between them. | |
| 31 | +## 🎯 Why We Built SRE-Agent |
38 | 32 |
|
39 |
| -## Interactive quickstart (GKE) |
| 33 | +We designed SRE-Agent to: |
40 | 34 |
|
41 |
| -The following button opens up an interactive tutorial showing how to deploy Bank of Anthos in GKE: |
| 35 | +- 🤖 **Automate Incident Response** – Detect & remediate failures (pod crashes, resource contention) instantly. |
| 36 | +- 🔄 **Reduce Toil** – Free ops teams from repetitive fixes. |
| 37 | +- 🧠 **Provide Actionable Insights** – AI-powered RCA via **Google Gemini API**. |
| 38 | +- ⚙️ **Stay Flexible** – YAML-based healing rules for easy customization. |
42 | 39 |
|
43 |
| -[](https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=https://github.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&cloudshell_tutorial=extras/cloudshell/tutorial.md) |
| 40 | +--- |
44 | 41 |
|
45 |
| -## Quickstart (GKE) |
| 42 | +## 🧠 AI + SRE: A New Era |
46 | 43 |
|
47 |
| -1. Ensure you have the following requirements: |
48 |
| - - [Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project). |
49 |
| - - Shell environment with `gcloud`, `git`, and `kubectl`. |
| 44 | +- 🔍 **AI-Powered RCA** – Analyze logs & metrics with `kubectl-ai` + Gemini. |
| 45 | +- 🛠️ **Automated Remediation** – Take corrective actions automatically. |
| 46 | +- 💸 **FinOps & Cost Optimization** – Identify and remove waste. |
50 | 47 |
|
51 |
| -2. Clone the repository. |
| 48 | +--- |
52 | 49 |
|
53 |
| - ```sh |
54 |
| - git clone https://github.com/GoogleCloudPlatform/bank-of-anthos |
55 |
| - cd bank-of-anthos/ |
56 |
| - ``` |
| 50 | +## ✨ Features |
57 | 51 |
|
58 |
| -3. Set the Google Cloud project and region and ensure the Google Kubernetes Engine API is enabled. |
| 52 | +- 📜 **Rule-Based Healing** – YAML playbooks for custom rules. |
| 53 | +- ☸️ **Kubernetes-Native Operator** – Uses Kubernetes API. |
| 54 | +- 📊 **Prometheus Integration** – Metric-driven healing actions. |
| 55 | +- 🤖 **AI-Powered RCA** – Google Gemini integration. |
| 56 | +- 🔒 **Leader Election** – Prevents conflicting actions. |
| 57 | +- 🧩 **Configurable & Extensible** – Add new rules easily. |
| 58 | +- 🧪 **Dry-Run Mode** – Safe testing before applying fixes. |
| 59 | +- 🌐 **REST API** – For manual interventions & status checks. |
59 | 60 |
|
60 |
| - ```sh |
61 |
| - export PROJECT_ID=<PROJECT_ID> |
62 |
| - export REGION=us-central1 |
63 |
| - gcloud services enable container.googleapis.com \ |
64 |
| - --project=${PROJECT_ID} |
65 |
| - ``` |
| 61 | +--- |
66 | 62 |
|
67 |
| - Substitute `<PROJECT_ID>` with the ID of your Google Cloud project. |
| 63 | +## 📊 Impact & ROI |
68 | 64 |
|
69 |
| -4. Create a GKE cluster and get the credentials for it. |
| 65 | +- ⏱️ **Reduce MTTR** – 5x faster incident resolution (inspired by Netflix/Etsy practices). |
| 66 | +- 💸 **Cut Cloud Costs** – Up to **28–32% savings** via automation. |
| 67 | +- 😌 **Reduce Toil** – More focus on strategy, less on firefighting. |
70 | 68 |
|
71 |
| - ```sh |
72 |
| - gcloud container clusters create-auto bank-of-anthos \ |
73 |
| - --project=${PROJECT_ID} --region=${REGION} |
74 |
| - ``` |
| 69 | +--- |
75 | 70 |
|
76 |
| - Creating the cluster may take a few minutes. |
| 71 | +## 🏗️ Architecture |
77 | 72 |
|
78 |
| -5. Deploy Bank of Anthos to the cluster. |
| 73 | +The SRE-Agent integrates with Kubernetes and Prometheus to detect, heal, and analyze failures automatically. |
79 | 74 |
|
80 |
| - ```sh |
81 |
| - kubectl apply -f ./extras/jwt/jwt-secret.yaml |
82 |
| - kubectl apply -f ./kubernetes-manifests |
83 |
| - ``` |
| 75 | +### Visual Flow |
| 76 | +.png) |
84 | 77 |
|
85 |
| -6. Wait for the pods to be ready. |
86 | 78 |
|
87 |
| - ```sh |
88 |
| - kubectl get pods |
89 |
| - ``` |
| 79 | +```mermaid |
| 80 | +flowchart TD |
| 81 | + A[Prometheus Metrics & Kubernetes Events] --> B[SRE-Agent Operator] |
| 82 | + B -->|Healing Rules| C[Automated Remediation] |
| 83 | + B -->|Logs & Events| D[Gemini API] |
| 84 | + D --> E[AI-Powered RCA Report] |
| 85 | + C --> F[Kubernetes Cluster Stabilized] |
| 86 | +⚡ Getting Started |
| 87 | +✅ Prerequisites |
| 88 | +A Kubernetes cluster (GKE preferred). |
90 | 89 |
|
91 |
| - After a few minutes, you should see the Pods in a `Running` state: |
| 90 | +kubectl configured. |
92 | 91 |
|
93 |
| - ``` |
94 |
| - NAME READY STATUS RESTARTS AGE |
95 |
| - accounts-db-6f589464bc-6r7b7 1/1 Running 0 99s |
96 |
| - balancereader-797bf6d7c5-8xvp6 1/1 Running 0 99s |
97 |
| - contacts-769c4fb556-25pg2 1/1 Running 0 98s |
98 |
| - frontend-7c96b54f6b-zkdbz 1/1 Running 0 98s |
99 |
| - ledger-db-5b78474d4f-p6xcb 1/1 Running 0 98s |
100 |
| - ledgerwriter-84bf44b95d-65mqf 1/1 Running 0 97s |
101 |
| - loadgenerator-559667b6ff-4zsvb 1/1 Running 0 97s |
102 |
| - transactionhistory-5569754896-z94cn 1/1 Running 0 97s |
103 |
| - userservice-78dc876bff-pdhtl 1/1 Running 0 96s |
104 |
| - ``` |
| 92 | +Prometheus installed (optional). |
105 | 93 |
|
106 |
| -7. Access the web frontend in a browser using the frontend's external IP. |
| 94 | +Google Cloud project with Gemini API enabled. |
107 | 95 |
|
108 |
| - ```sh |
109 |
| - kubectl get service frontend | awk '{print $4}' |
110 |
| - ``` |
| 96 | +⚙️ Configuration |
| 97 | +Define healing rules: |
111 | 98 |
|
112 |
| - Visit `http://EXTERNAL_IP` in a web browser to access your instance of Bank of Anthos. |
| 99 | +bash |
| 100 | +Copy code |
| 101 | +kubectl create configmap sre-agent-playbook --from-file=healing-playbook.yaml |
| 102 | +Set Gemini API key: |
113 | 103 |
|
114 |
| -8. Once you are done with it, delete the GKE cluster. |
| 104 | +bash |
| 105 | +Copy code |
| 106 | +export GEMINI_API_KEY=<YOUR_GEMINI_API_KEY> |
| 107 | +🚀 Deployment |
| 108 | +Build & push image: |
115 | 109 |
|
116 |
| - ```sh |
117 |
| - gcloud container clusters delete bank-of-anthos \ |
118 |
| - --project=${PROJECT_ID} --region=${REGION} |
119 |
| - ``` |
| 110 | +bash |
| 111 | +Copy code |
| 112 | +docker build -t gcr.io/<YOUR_PROJECT_ID>/sre-agent:latest . |
| 113 | +docker push gcr.io/<YOUR_PROJECT_ID>/sre-agent:latest |
| 114 | +Apply manifests: |
120 | 115 |
|
121 |
| - Deleting the cluster may take a few minutes. |
| 116 | +bash |
| 117 | +Copy code |
| 118 | +kubectl apply -f kubernetes-manifests/sre-agent.yaml |
| 119 | +🧩 How It Works |
| 120 | +Runs as a Kubernetes Deployment. |
122 | 121 |
|
123 |
| -## Additional deployment options |
| 122 | +Uses leader election for HA. |
124 | 123 |
|
125 |
| -- **Workload Identity**: [See these instructions.](/docs/workload-identity.md) |
126 |
| -- **Cloud SQL**: [See these instructions](/extras/cloudsql) to replace the in-cluster databases with hosted Google Cloud SQL. |
127 |
| -- **Multi Cluster with Cloud SQL**: [See these instructions](/extras/cloudsql-multicluster) to replicate the app across two regions using GKE, Multi Cluster Ingress, and Google Cloud SQL. |
128 |
| -- **Istio**: [See these instructions](/extras/istio) to configure an IngressGateway. |
129 |
| -- **Anthos Service Mesh**: ASM requires Workload Identity to be enabled in your GKE cluster. [See the workload identity instructions](/docs/workload-identity.md) to configure and deploy the app. Then, apply `extras/istio/` to your cluster to configure frontend ingress. |
130 |
| -- **Java Monolith (VM)**: We provide a version of this app where the three Java microservices are coupled together into one monolithic service, which you can deploy inside a VM (eg. Google Compute Engine). See the [ledgermonolith](/src/ledgermonolith) directory. |
| 124 | +Continuously monitors cluster events & metrics. |
131 | 125 |
|
132 |
| -## Documentation |
| 126 | +Executes healing rules (e.g., restart CrashLoopBackOff pods). |
133 | 127 |
|
134 |
| -<!-- This section is duplicated in the docs/ README: https://github.com/GoogleCloudPlatform/bank-of-anthos/blob/main/docs/README.md --> |
135 |
| - |
136 |
| -- [Development](/docs/development.md) to learn how to run and develop this app locally. |
137 |
| -- [Environments](/docs/environments.md) to learn how to deploy on non-GKE clusters. |
138 |
| -- [Workload Identity](/docs/workload-identity.md) to learn how to set-up Workload Identity. |
139 |
| -- [CI/CD pipeline](/docs/ci-cd-pipeline.md) to learn details about and how to set-up the CI/CD pipeline. |
140 |
| -- [Troubleshooting](/docs/troubleshooting.md) to learn how to resolve common problems. |
141 |
| - |
142 |
| -## Demos featuring Bank of Anthos |
143 |
| -- [Tutorial: Explore Anthos (Google Cloud docs)](https://cloud.google.com/anthos/docs/tutorials/explore-anthos) |
144 |
| -- [Tutorial: Migrating a monolith VM to GKE](https://cloud.google.com/migrate/containers/docs/migrating-monolith-vm-overview-setup) |
145 |
| -- [Tutorial: Running distributed services on GKE private clusters using ASM](https://cloud.google.com/service-mesh/docs/distributed-services-private-clusters) |
146 |
| -- [Tutorial: Run full-stack workloads at scale on GKE](https://cloud.google.com/kubernetes-engine/docs/tutorials/full-stack-scale) |
147 |
| -- [Architecture: Anthos on bare metal](https://cloud.google.com/architecture/ara-anthos-on-bare-metal) |
148 |
| -- [Architecture: Creating and deploying secured applications](https://cloud.google.com/architecture/security-foundations/creating-deploying-secured-apps) |
149 |
| -- [Keynote @ Google Cloud Next '20: Building trust for speedy innovation](https://www.youtube.com/watch?v=7QR1z35h_yc) |
150 |
| -- [Workshop @ IstioCon '22: Manage and secure distributed services with ASM](https://www.youtube.com/watch?v=--mPdAxovfE) |
| 128 | +Triggers RCA via Gemini API for detailed insights. |
0 commit comments