Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to allocation architecture guide #125328

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/internal/DistributedArchitectureGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,23 @@ works in parallel with the storage engine.)

# Allocation

### Indexes and Shards

Each index consists of a fixed number of primary shards. The number of primary shards cannot be changed for the lifetime of the index. Each
primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each
primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed dynamically. Each

(or maybe "at runtime")?

shard copy (primary or replica) can be in one of four states:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning here that these states are org.elasticsearch.cluster.routing.ShardRoutingState the ones in the routing table (i.e. within the ClusterState) and the transitions between these states are part of the dance between data node and master node to reflect the lifecycle of a shard:

UNASSIGNED -> INITIALIZING happens when the master wants the data node to start creating this shard copy.
INITIALIZING -> STARTED happens when recovery is fully complete and the data node tells the master it's ready to serve requests.
STARTED -> RELOCATING happens when the master wants to initialize the node elsewhere.

A failure can take a shard in any state back to UNASSIGNED. Or the shard entry can be removed entirely from the cluster state. In either case, that tells the data node to stop whatever it is doing and shut down the shard.

Also IMO it'd be more discoverable to expand the Javadocs for ShardRoutingState with all this detail rather than hiding it away here.

- UNASSIGNED: Not present on / assigned to any data node.
- STARTED: Assigned to a specific data node and ready for indexing and search requests.
- INITIALIZING: Running recovery on an assigned node. Fast for an empty new index shard. Possibly lengthy when restoring from a snapshot or
moving from another node
- RELOCATING - The state of a shard on a source node while the shard is being moved away to a target node: the target node is running
recovery.

The `RoutingTable` and `RoutingNodes` classes are responsible for tracking to which data nodes each shard in the cluster is allocated: see
the [routing package javadoc][] for more details about these structures.

[routing package javadoc]: https://github.com/elastic/elasticsearch/blob/v9.0.0-beta1/server/src/main/java/org/elasticsearch/cluster/routing/package-info.java

### Core Components

The `DesiredBalanceShardsAllocator` is what runs shard allocation decisions. It leverages the `DesiredBalanceComputer` to produce
Expand Down Expand Up @@ -269,6 +286,22 @@ of shards, and an incentive to distribute shards within the same index across di
`NodeAllocationStatsAndWeightsCalculator` classes for more details on the weight calculations that support the `DesiredBalanceComputer`
decisions.

### Inter-Node Communicaton

The elected master node creates a shard allocation plan with the `DesiredBalanceShardsAllocator` and then selects incremental shard
movements towards the target allocation plan with the `DesiredBalanceReconciler`. The results of the `DesiredBalanceReconciler` is an
updated `RoutingTable`. The `RoutingTable` is part of the cluster state, so the master node updates the cluster state with the new
(incremental) desired shard allocation information. The updated cluster state is then published to the data nodes. Each data node will
observe any change in shard allocation related to that node and take action to achieve the new shard allocation by initiating creation of a
new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit:

Suggested change
new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes
new empty shard, starting recovery (copying) of an existing shard from another data node, or removing a shard. When the data node finishes

but also I feel we should expand a bit on what "removing a shard" means in this context? It means actively shutting down the corresponding running (or recovering) IndexShard instance, releasing all its resources.

Again, this feels like detail that should be in a Javadoc somewhere, maybe org.elasticsearch.cluster.routing.GlobalRoutingTable, and linked to from RoutingTable, IndexRoutingTable, IndexShardRoutingTable and ShardRouting.

a shard change, a request is sent to the master node to update the shard as having finished recovery/removal in the cluster state. The
cluster state is used by allocation as a fancy work queue: the master node conveys new work to the data nodes, which pick up the work and
report back when done.

See `DesiredBalanceShardsAllocator#submitReconcileTask` for the master node's cluster state update post-reconciliation.
See `IndicesClusterStateService#doApplyClusterState` for the data node hook to observe shard changes in the cluster state.
See `ShardStateAction#sendShardAction` for the data node request to the master node on completion of a shard state change.

# Autoscaling

The Autoscaling API in ES (Elasticsearch) uses cluster and node level statistics to provide a recommendation
Expand Down
Loading