Skip to content

CI: Scale Test Egress Gateway (scale-egw) #41492

@marseel

Description

@marseel

CI failure

Workflow: https://github.com/cilium/cilium/actions/workflows/scale-test-egw.yaml

Note that both tests "baseline" and "egw" test fail. So it's not just EGW regression, but potentially Cilium regression.

Started failing on August 5th.
Around that time, there were only two modifications to workflow itself:
f2239b7
d0d7455

There most relevant change was egw_utils_ref from ebe06a35f96ed5458603c2744b91d1b86cc6c2a4 to 79b62757cbccec717ecdb1395505434f24242616 that contained a couple of commits:
cilium/scaffolding@94bba22
cilium/scaffolding@1c65e9b
cilium/scaffolding@6973fa4
cilium/scaffolding@e32304f

Most suspicious thing is that EGW test pod becomes ready after ~10 minutes:

I0903 01:12:00.885130   10340 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady 
I0903 01:22:05.970642   10340 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 1 running (1 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 

This causes whole workflow to timeout.
Before, the same operation was taking < 4 minutes:

I0804 01:18:29.449542   10023 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady 
I0804 01:22:14.556450   10023 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 1 running (1 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 

From logs-egw-client-pod-stress-base-0-egw-client-20250902-013908.log there is a bunch of warnings like:

2025-09-02T01:11:35.262051497Z time=2025-09-02T01:11:35.261Z level=WARN source=/go/src/github.com/cilium/scaffolding/egw-scale-utils/pkg/client.go:118 msg="Dialing took more than 100ms" component=client external-target=192.168.99.158:1338 cnt=20026 errcnt=0 elapsed=1.042292325s

which suggest there might be a regression connecting to external targets, even without EGW as baseline test fails in the same way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!feature/egress-gatewayImpacts the egress IP gateway feature.sig/scalabilityImpacts how well Cilium handles a high rate of events or churn.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions