Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransportGetAllocationStatsAction may cause significant load on elected master #110716

Closed
idegtiarenko opened this issue Jul 10, 2024 · 13 comments · Fixed by #124898
Closed

TransportGetAllocationStatsAction may cause significant load on elected master #110716

idegtiarenko opened this issue Jul 10, 2024 · 13 comments · Fixed by #124898
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@idegtiarenko
Copy link
Contributor

Elasticsearch Version

8.14

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

TransportGetAllocationStatsAction runs on elected master so it is theoretically possible to overload it by executing node stats requests around various nodes in cluster, especially in a clusters with many shards as complexity is proportional to the shard count.

The aggregated result of the computation is small (5 numbers per node), we should consider caching it for small period of time (1 minute?) and reuse it between different calls during.

if (NodesStatsRequestParameters.Metric.ALLOCATIONS.containedIn(metrics)) {
client.execute(
TransportGetAllocationStatsAction.TYPE,
new TransportGetAllocationStatsAction.Request(new TaskId(clusterService.localNode().getId(), task.getId())),
listener.delegateFailure((l, r) -> {
ActionListener.respondAndRelease(l, newResponse(request, merge(responses, r.getNodeAllocationStats()), failures));
})
);
} else {

Steps to Reproduce

n/a

Logs (if relevant)

No response

@idegtiarenko idegtiarenko added >bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jul 10, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@mhl-b
Copy link
Contributor

mhl-b commented Jul 10, 2024

Do we need client caching too? Server caching should help when there are few nodes with large number of shards. But if there many nodes that frequently poll stats, networking overhead will cripple up. May be distribute load, client picks random node from cluster and if that node does not have it in cache, forward to master and populate cache.

@DaveCTurner
Copy link
Contributor

TransportGetAllocationStatsAction$Response is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.

@shreedaddy

This comment was marked as off-topic.

@shreedaddy
Copy link

TransportGetAllocationStatsAction$Response is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.

Please let me know if this issue is open. If it is can you please assign it to me.

@DaveCTurner
Copy link
Contributor

Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.

@shreedaddy
Copy link

Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.

Will get started.

@shreedaddy

This comment was marked as off-topic.

@shreedaddy

This comment was marked as off-topic.

@DaveCTurner

This comment was marked as off-topic.

@shreedaddy

This comment was marked as off-topic.

@DaveCTurner

This comment was marked as off-topic.

@dragan509

This comment was marked as off-topic.

JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 14, 2025
Adds a new setting
TransportGetAllocationStatsAction.CACHE_MAX_AGE_SETTING to configure
the max age for cached AllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 14, 2025
Adds a new setting
TransportGetAllocationStatsAction.CACHE_MAX_AGE_SETTING to configure
the max age for cached AllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 18, 2025
Adds a new setting
TransportGetAllocationStatsAction.CACHE_MAX_AGE_SETTING to configure
the max age for cached AllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 18, 2025
Adds a new setting
TransportGetAllocationStatsAction.CACHE_MAX_AGE_SETTING to configure
the max age for cached AllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 20, 2025
Adds a new setting
TransportGetAllocationStatsAction.CACHE_MAX_AGE_SETTING to configure
the max age for cached AllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 25, 2025
Adds a new cache and setting
TransportGetAllocationStatsAction.CACHE_TTL_SETTING
"cluster.routing.allocation.stats.cache.ttl" to configure the max age
for cached NodeAllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Mar 25, 2025
Adds a new cache and setting
TransportGetAllocationStatsAction.CACHE_TTL_SETTING
"cluster.routing.allocation.stats.cache.ttl" to configure the max age
for cached NodeAllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes elastic#110716
JeremyDahlgren added a commit that referenced this issue Mar 25, 2025
…5588)

Adds a new cache and setting
TransportGetAllocationStatsAction.CACHE_TTL_SETTING
"cluster.routing.allocation.stats.cache.ttl" to configure the max age
for cached NodeAllocationStats on the master.  The default
value is currently 1 minute per the suggestion in issue 110716.

Closes #110716
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants