-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
CI failure
Workflow: https://github.com/cilium/cilium/actions/workflows/scale-test-egw.yaml
Note that both tests "baseline" and "egw" test fail. So it's not just EGW regression, but potentially Cilium regression.
Started failing on August 5th.
Around that time, there were only two modifications to workflow itself:
f2239b7
d0d7455
There most relevant change was egw_utils_ref from ebe06a35f96ed5458603c2744b91d1b86cc6c2a4 to 79b62757cbccec717ecdb1395505434f24242616 that contained a couple of commits:
cilium/scaffolding@94bba22
cilium/scaffolding@1c65e9b
cilium/scaffolding@6973fa4
cilium/scaffolding@e32304f
Most suspicious thing is that EGW test pod becomes ready after ~10 minutes:
I0903 01:12:00.885130 10340 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady
I0903 01:22:05.970642 10340 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 1 running (1 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
This causes whole workflow to timeout.
Before, the same operation was taking < 4 minutes:
I0804 01:18:29.449542 10023 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady
I0804 01:22:14.556450 10023 wait_for_pods.go:122] WaitForRunningPods: labelSelector(app.kubernetes.io/name=egw-client,app.kubernetes.io/instance=stress-base): Pods: 1 out of 1 created, 1 running (1 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
From logs-egw-client-pod-stress-base-0-egw-client-20250902-013908.log there is a bunch of warnings like:
2025-09-02T01:11:35.262051497Z time=2025-09-02T01:11:35.261Z level=WARN source=/go/src/github.com/cilium/scaffolding/egw-scale-utils/pkg/client.go:118 msg="Dialing took more than 100ms" component=client external-target=192.168.99.158:1338 cnt=20026 errcnt=0 elapsed=1.042292325s
which suggest there might be a regression connecting to external targets, even without EGW as baseline test fails in the same way.