Skip to content

flake: TestPGCoordinatorSingle_AgentWithClient - timeout updating node for reconnection (PostgreSQL 17) #1045

@blink-so

Description

@blink-so

CI Failure Details

CI Run Link: https://github.com/coder/coder/actions/runs/18180473350/job/51755178055

Commit Info:

  • SHA: 76d6e13185b2acd8d0fd23e43e0d9c3eb8bcccc0
  • Date: 2025-10-02 01:28:48 UTC
  • Failed Job: test-go-pg-17 (Job ID: 51755178055)

Test Location: enterprise/tailnet/pgcoord_test.go:204

Test Name: TestPGCoordinatorSingle_AgentWithClient

Error Analysis

Error Message:

Received unexpected error:
context deadline exceeded

timeout updating node for reconnection (repeated 9 times)

Condition never satisfied

timeout waiting for responses to close for client

Root Cause: The test is timing out while waiting for tailnet coordinator peer updates during a reconnection scenario. The test expects the coordinator to send node updates after an agent reconnects, but those updates are not being received within the timeout period (testutil.WaitSuperLong).

Specific failures:

  1. Line 234: agent.AssertEventuallyHasDERP(client.ID, 11) - times out waiting for the reconnected agent to receive client node
  2. Line 241: client.AssertEventuallyHasDERP(agent.ID, 35) - times out waiting for client to receive final DERP update
  3. Line 244-246: Cleanup assertions also timeout

The test logs show extensive coordinator activity (heartbeats, peer updates, tunnel updates) but something prevents the node updates from completing successfully.

PostgreSQL 17 Specific

This test is running specifically with PostgreSQL 17 (test-go-pg-17 job). The failure may be related to:

  • PostgreSQL 17-specific behavior changes
  • Timing differences with PostgreSQL 17's query execution
  • Pubsub notification behavior differences

Assignment Analysis

Git Blame Results:

git blame -L 204,246 enterprise/tailnet/pgcoord_test.go

The test was primarily authored by:

  • Spike Curtis (commit cc17d2feea) - 2023-06-21: Core test structure
  • Spike Curtis (commit d6154c4310) - 2024-09-12: Most recent significant updates including the failing assertions

Assignment Rationale: Following the primary assignment strategy of git blame on the specific test function, Spike Curtis should be assigned as he authored both the original test and the most recent modifications to the failing assertions (lines 234, 241, 244-246).

Test Details

The test validates:

  1. Initial agent-client DERP connection exchange
  2. DERP updates propagate between peers
  3. Agent reconnection - this is where it fails
  4. Rapid successive updates maintain order
  5. Cleanup after ungraceful disconnects

The failure occurs specifically during the reconnection scenario (lines 229-234) where a new agent connection with the same ID should immediately receive the client's node information.

Related Issues

No duplicate issues found in comprehensive search of coder/internal repository.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions