Skip to content

Conversation

@nikhilsuri-db
Copy link
Contributor

@nikhilsuri-db nikhilsuri-db commented Nov 4, 2025

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • Other

Description

This PR introduces a circuit breaker pattern to the telemetry system to prevent cascading failures and improve system resilience. The implementation uses the pybreaker library to monitor telemetry request failures and automatically open the circuit when failure rates exceed configurable thresholds, blocking further requests to protect downstream services. When the circuit is open, telemetry requests are temporarily blocked until the system recovers, at which point the circuit transitions to half-open for testing and eventually closes when normal operation resumes. The circuit breaker configuration is centralized with immutable settings, includes comprehensive logging for monitoring, and maintains full backward compatibility with existing telemetry functionality.

How is this tested?

  • Unit tests
  • E2E Tests
  • Manually
  • N/A

Related Tickets & Documents

https://docs.google.com/document/d/1ftRvby9bwDZzE3s1tOb4hJ4Pd9USiXskb9cDw-uQNPM/edit?usp=sharing

Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
)


class CircuitBreakerStateListener(CircuitBreakerListener):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used for logging purposed for now

Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
@github-actions
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

{ version = ">=18.0.0", python = ">=3.13", optional=true }
]
pyjwt = "^2.0.0"
pybreaker = "^1.0.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikrantpuppala Can we be sure that adding new library won't break any other client usage?

pool_connections: Optional[int] = None,
pool_maxsize: Optional[int] = None,
user_agent: Optional[str] = None,
telemetry_circuit_breaker_enabled: Optional[bool] = None,
Copy link
Contributor Author

@nikhilsuri-db nikhilsuri-db Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make it true in next PR post end to end testing

logger.error("HTTP request failed after retries: %s", e)
raise RequestError(f"HTTP request failed: {e}")

# Try to extract HTTP status code from the MaxRetryError
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting http_code here which will be used by CircuitBreaker to consider regression in telemetry endpoint



@dataclass(frozen=True)
class CircuitBreakerConfig:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single object to store CB config

return breaker

@classmethod
def _create_noop_circuit_breaker(cls) -> CircuitBreaker:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where client config setup failed or used by test cases.

Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
Signed-off-by: Nikhil Suri <nikhil.suri@databricks.com>
This allows telemetry to fail silently without raising exceptions.
"""
from unittest.mock import Mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, using test imports in production code is intended here?


# Clear any existing state
CircuitBreakerManager._instances.clear()
CircuitBreakerManager.initialize(CircuitBreakerConfig())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a similar initialization in main function too?

self.pool_connections = pool_connections or 10
self.pool_maxsize = pool_maxsize or 20
self.user_agent = user_agent
self.telemetry_circuit_breaker_enabled = bool(telemetry_circuit_breaker_enabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not add this to build_client_context too?

"""
config = cls._config
if config is None:
raise RuntimeError("CircuitBreakerManager not initialized")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't be throwing errors at all, I suppose?

# The reason may contain a response object with status
http_code = getattr(e.reason.response, "status", None)
elif (
hasattr(e, "response")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : these ~30 lines feels like it would live more happily in a helper

# Try to extract HTTP status code from the MaxRetryError
http_code = None
if (
hasattr(e, "reason")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to see the defensive approach with urlLib object structure ! we have had several compatibility issue with this library before (I also remember a sev1 once), also can you double check if this is the expected structure once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I meant across versions. see internal doc for more context on the type of issues we get)

Returns:
CircuitBreaker instance for the host
"""
if not cls._config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this inside lock too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants