- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
In a HA cluster, each kube-apiserver has an ID. Controllers have access to the list of IDs for living kube-apiservers in the cluster.
The dynamic coordinated storage version API needs such a list to garbage collect stale records. The API priority and fairness feature needs a unique identifier for an apiserver reporting its concurrency limit.
Currently, such a list is already maintained in the “kubernetes” endpoints, where the kube-apiservers’ advertised IP addresses are the IDs. However it is not working in all flavors of Kubernetes deployments. For example, if there is a load balancer for the cluster, where the advertise IP address is set to the IP address of the load balancer, all three kube-apiservers will have the same advertise IP address.
- Provide a mechanism in which controllers can uniquely identify kube-apiserver's in a cluster.
- improving the availability of kube-apiserver
Similar to the node heartbeats,
a kube-apiserver will store its ID in a Lease object. All kube-apiserver Leases
will be stored in the kube-system
namespace.
The lease creation and heart beat
will be managed by the start-kube-apiserver-identity-lease-controller
post start hook
and expired leases will be garbage collected by the start-kube-apiserver-identity-lease-garbage-collector
post start hook in kube-apiserver.
In this proposal we focus on kube-apiservers. Aggregated apiservers don’t have the same problem, because their record is already exposed via the service. By listing the pods selected by the service, an aggregated server can learn the list of living servers with distinct podIPs. A server can get its own IDs via downward API.
We prefer that expired Leases remain for a longer duration as opposed to collecting them quickly, because in the latter case, if a Lease is falsely collected by accident, it can do more damage than the former case. Take the storage version API scenario as an example, if a kube-apiserver accidentally missed a heartbeat and got its Lease garbage collected, its StorageVersion can be falsely garbage collected as a consequence. In this case, the storage migrator won’t be able to migrate the storage, unless this kube-aipserver gets restarted and re-registers its StorageVersion. On the other hand, if a kube-apiserver is gone and its Lease still stays around for an hour or two, it will only delay the storage migration for the same period of time.
The kubelet heartbeat
logic already written
will be re-used. The lease creation and heart beat will be managed by the start-kube-apiserver-identity-lease-controller
post-start-hook and expired leases will be garbage collected by the start-kube-apiserver-identity-lease-garbage-collector
post-start-hook in kube-apiserver. The refresh rate, lease duration will be configurable through kube-apiserver
flags
The format of the lease will be kube-apiserver-<hash-using-hostname>
. A hash based on the hostname is used for two reasons:
- To ensure that a
kube-apiserver
that is restarting will attempt to obtain its previous lease, avoiding system churn when a kube-apiserver Lease is garbage collected. - Avoiding the need to truncate the lease name when using longer hostnames that exceed the 64 character limit for object names, which can lead to naming conflicts.
Each lease will have a kubernetes.io/hostname
label with the actual hostname seen by kube-apiserver which cluster admins
can use to determine which kube-apiserver owns a Lease object. However, the holder identity of the
lease (lease.spec.holderIdentity
) will be uniquely generated per start-up, which can be used as an indicator for
ownership churn of the lease. All kube-apiserver leases will also have a component label k8s.io/component=kube-apiserver
.
In the future, we may consider providing a flag in kube-apiserver
to override the lease name, but we don't anticipate
needing this today.
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
staging/src/k8s.io/apiserver/pkg/endpoints
- integration test for creating the Namespace and the Lease on kube-apiserver startup
- integration test for not creating the StorageVersions after creating the Lease
- integration test for garbage collecting a Lease that isn't refreshed
- integration test for not garbage collecting a Lease that is refreshed
Proposed e2e tests:
- an e2e test that validates the existence of the Lease objects per kube-apiserver
- an e2e test that restarts a kube-apiserver and validates that a new Lease is created with a newly generated ID and the old lease is garbage collected
Alpha should provide basic functionality covered with tests described above.
- Appropriate metrics are agreed on and implemented
- Sufficient integration tests covering basic functionality of this enhancement.
- e2e tests outlined in the test plan are implemented
==TODO==
For non-optional features moving to GA, the graduation criteria must include conformance tests.
- This feature is proposed for the control plane internal use. Master-node skew is not considered.
- During a rolling update, an HA cluster may have old and new masters. Old masters won't create Leases, nor garbage collect Leases.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: APIServerIdentity
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
kube-apiserver will store identity Leases in the kube-system
namespace.
Expired leases will be actively garbage collected by a post-start-hook in kube-apiserver.
Yes. Stale Lease objects will remain stale (renewTime won't get updated)
Stale Lease objects will be garbage collected.
There are some tests that require enabling the feature gate in apiserver_identity_test.go. However, there are no tests validating feature enablement/disablement based on the gate. These tests should be added prior to Beta.
Existing workloads should not be impacteded by this feature, unless they were
looking for kube-apiserver Lease objects in the kube-system
namespace, which can be
found using the k8s.io/component=kube-apiserver
label.
Recently added healthcheck metrics for apiserver, which includes
the health of the post start hook can be used to inform rollback, specifically kubernetes_healthcheck{poststarthook/start-kube-apiserver-identity-lease-controller}
and kubernetes_healthcheck{poststarthook/start-kube-apiserver-identity-lease-garbage-collector}
Manual testing for upgrade/rollback will be done prior to Beta. Steps taken for manual tests will be updated here.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
The existence of kube-apiserver Lease objects in the kube-system
namespace
will determine if the feature is working. Operators can check for clients that are accessing
the Lease object to see if workloads or other controllers are relying on this feature.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
.spec.holderIdentity
,.spec.acquireTime
,.spec.renewTime
,.spec.leaseTransitions
- Other (treat as last resort)
- Details: audit logs for clients that are reading the Lease objects
Some reasonable SLOs could be:
- Number of (non-expired) Leases in
kube-system
is equal to the number of expected kube-apiservers 95% of the time. - kube-apiservers hold a lease which is not older than 2 times the frequency of the lease heart beat 95% of time.
All leases owned by kube-apiservers can be found using the k8s.io/component=kube-apiserver
label.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: kubernetes_healthcheck
- [Optional] Aggregation method: name="poststarthook/start-kube-apiserver-identity-lease-controller", name="poststarthook/start-kube-apiserver-identity-lease-garbage-collector"
- Components exposing the metric: kube-apiserver
Are there any missing metrics that would be useful to have to improve observability of this feature?
A metric measuring the last updated time for a lease could be useful, but it could introduce cardinality problems since the lease is changed on every restart of kube-apiserver.
We may consider adding a metric exposing the count of leases in kube-system
.
No
Yes, kube-apiserver will be making new API calls as part of the lease controller.
No, the feature will use the existing Lease API.
No
Yes, it will increase the number of Leases in a cluster by the number of control plane VMs.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
The lease controller may use additional resources in kube-apiserver, but it is likely negligible.
Lease objects for a given kube-apiserver may become stale if the kube-apiserver or etcd is non-responsive. Clients should be able to respond accordingly by checking the lease expiration.
- lease objects can become stale if etcd is unavailable and clients do not check lease expiration.
- kube-apiserver heart beats consuming too many resources (unlikely but possible)
- 2020-09-18: KEP introduced
- 2022-10-05: KEP updated with Beta criteria and all PRR questions answered.
We define a new API for kube-apiserver identity. Similar to Event, we make the storage path for the new object type tack on the TTL. Etcd will delete objects who don’t get their TTL refreshed in time.
- Pros:
- We don’t need to write a controller to garbage collect expired records, nor worry about client-server clock skew.
- We can extend the API in future to include more information (e.g. version, feature, config)
- Cons:
- We need a new dedicated API
Note that the proposed solution doesn't prevent us from switching to a new API in future. Similar to node heartbeats switched from node status to leases.
The existing “kubernetes” Endpoints mechanism can be inherited to solve the kube-apiserver identity problem. There are two parts of the mechanism:
- Each kube-apiserver periodically writes a lease of its ID (address) with a TTL to etcd through the storage interface. The lease object itself is an Endpoints. Leases will be deleted by etcd for servers who fail to refresh the TTL in time.
- A controller reads the leases through the storage interface, to collect the list of IP addresses. The controller updates the “kubernetes” Endpoints to match the IP address list.
We inherit the first part of the existing mechanism (the etcd TTL lease), but change the key and value. The key will be the new ID. All the keys will be stored under a special prefix “/apiserverleases/” (similar to the existing mechanism). The value will be a Lease object. A kube-apiserver obtains the list of IDs by directly listing/watching the leases through the storage interface.
- Cons:
- We depend on a side-channel API, which is against Kubernetes philosophy
- Clients like the kube-controller-manager cannot access the storage interface. For the storage version API, if we put the garbage collector in kube-apiserver instead of kube-controller-manager, the lack of leader election may cause update conflicts.
The kube-apiservers still write the master leases to etcd, but a controller will watch the master leases and update an existing public API (e.g. store it in a defined way in a Lease). Note that we cannot use the endpoints API like the “kubernetes” endpoints, because the endpoints API is designed to store a list of addresses, but our IDs are not IP addresses.
- Cons:
- We depend on a side-channel API, which is against Kubernetes philosophy
Similar to Alternative 1, the kube-apiservers write the master leases to etcd, and a controller watches the master leases, but updates a new public API specifically designed to host information about the API servers, including its ID, enabled feature gates, etc.
- Cons:
- We depend on a side-channel API, which is against Kubernetes philosophy