Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL autogenerate docs v3 #124312

Merged
merged 45 commits into from
Mar 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
303790e
Initial work on v3 MD docs support for ES|QL functions
craigtaverner Mar 5, 2025
003117e
Newly generated SVG images
craigtaverner Mar 5, 2025
8bb7f6e
Fix manual macros {es} and {es-sql}
craigtaverner Mar 5, 2025
617c077
Fix some macros and render parameter descriptions too
craigtaverner Mar 5, 2025
908885d
Many fixes, both in generating docs and in original sources:
craigtaverner Mar 7, 2025
f95947c
Fix invalid links to mapping-reference
craigtaverner Mar 7, 2025
337af5b
Fix one link and revert details description reformatting
craigtaverner Mar 7, 2025
6a84839
[CI] Auto commit changes from spotless
elasticsearchmachine Mar 7, 2025
bda2396
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 7, 2025
87cad1d
Revert error message to avoid test failures
craigtaverner Mar 7, 2025
0f7186f
Merge branch 'esql_autogenerate_docs_v3' of github.com:craigtaverner/…
craigtaverner Mar 7, 2025
0aaecd1
Refactored all DocsV3 functions into DocsV3Support class
craigtaverner Mar 10, 2025
ed7cd2f
Re-write count-distinct to MD and split appendix out
craigtaverner Mar 10, 2025
a68291f
Added support for generating examples, appendices and more
craigtaverner Mar 10, 2025
609ee51
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 10, 2025
5df0062
[CI] Auto commit changes from spotless
elasticsearchmachine Mar 10, 2025
4943cca
Fix failing test from deleted `;` character
craigtaverner Mar 11, 2025
1c22212
Merge remote-tracking branch 'origin/main' into esql_autogenerate_doc…
craigtaverner Mar 11, 2025
782eb31
Merge branch 'esql_autogenerate_docs_v3' of github.com:craigtaverner/…
craigtaverner Mar 11, 2025
58aee20
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 11, 2025
d2fe87b
Fixed forbidden APIs issue
craigtaverner Mar 11, 2025
eada857
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 11, 2025
7cc5304
Manually moved remaining files from old location
craigtaverner Mar 11, 2025
8039f43
Support generating kibana docs and json specs
craigtaverner Mar 11, 2025
13828a0
Move inline_cast until we know what its for
craigtaverner Mar 11, 2025
767f790
Merge branch 'esql_autogenerate_docs_v3' of github.com:craigtaverner/…
craigtaverner Mar 11, 2025
dab3e76
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 11, 2025
8249980
Bring back inadvertently deleted files
craigtaverner Mar 12, 2025
71f1b7e
Merge remote-tracking branch 'origin/main' into esql_autogenerate_doc…
craigtaverner Mar 12, 2025
516836f
Refine kibana docs generation, include extracting/reformatted examples
craigtaverner Mar 12, 2025
8ede6a9
Try fix UTF8 examples
craigtaverner Mar 12, 2025
a052bcb
Move back to using match_operator to disambiguate, since we use strin…
craigtaverner Mar 12, 2025
299dc0f
Try fix docset ignoring of kibana docs
craigtaverner Mar 12, 2025
a9e7e89
Get match_operator example working again
craigtaverner Mar 12, 2025
4c8de32
Fixed detailed description for LIKE and RLIKE which have inline examples
craigtaverner Mar 12, 2025
279fbaf
Merge remote-tracking branch 'origin/main' into esql_autogenerate_doc…
craigtaverner Mar 12, 2025
310e016
Fixed failing test from change in escape characters
craigtaverner Mar 12, 2025
fa2200a
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 12, 2025
13b709b
Bring back deleted Kibana files
craigtaverner Mar 13, 2025
3aa804c
Merge remote-tracking branch 'origin/main' into esql_autogenerate_doc…
craigtaverner Mar 13, 2025
b8fc814
Merge branch 'main' into esql_autogenerate_docs_v3
craigtaverner Mar 13, 2025
2a492f2
Support detailed description for more operators
craigtaverner Mar 13, 2025
221563c
Merge branch 'esql_autogenerate_docs_v3' of github.com:craigtaverner/…
craigtaverner Mar 13, 2025
7c36d76
Merge remote-tracking branch 'origin/main' into esql_autogenerate_doc…
craigtaverner Mar 13, 2025
7fc9685
[CI] Auto commit changes from spotless
elasticsearchmachine Mar 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions docs/docset.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ project: 'Elasticsearch'
exclude:
- README.md
- internal/*
- reference/esql/functions/kibana/docs/*
- reference/esql/functions/README.md
- reference/query-languages/esql/kibana/docs/**
- reference/query-languages/esql/README.md
cross_links:
- beats
- cloud
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
* configurable precision, which decides on how to trade memory for accuracy,
* excellent accuracy on low-cardinality sets,
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.

The following chart shows how the error varies before and after the threshold:

![cardinality error](/images/cardinality_error.png "")

For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed,
this is likely to be the case. Accuracy in practice depends on the dataset in question. In general,
most datasets show consistently good accuracy. Also note that even with a threshold as low as 100,
the error remains very low (1-6% as seen in the above graph) even when counting millions of items.

The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of
hashes in a dataset can affect the accuracy of the cardinality.
Original file line number Diff line number Diff line change
@@ -1,60 +1,3 @@
## `PERCENTILE` [esql-percentile]

**Syntax**

:::{image} ../../../../../images/percentile.svg
:alt: Embedded
:class: text-center
:::

**Parameters**

true
**Description**

Returns the value at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values and the 50th percentile is the `MEDIAN`.

**Supported types**

| number | percentile | result |
| --- | --- | --- |
| double | double | double |
| double | integer | double |
| double | long | double |
| integer | double | double |
| integer | integer | double |
| integer | long | double |
| long | double | double |
| long | integer | double |
| long | long | double |

**Examples**

```esql
FROM employees
| STATS p0 = PERCENTILE(salary, 0)
, p50 = PERCENTILE(salary, 50)
, p99 = PERCENTILE(salary, 99)
```

| p0:double | p50:double | p99:double |
| --- | --- | --- |
| 25324 | 47003 | 74970.29 |

The expression can use inline functions. For example, to calculate a percentile of the maximum values of a multivalued column, first use `MV_MAX` to get the maximum value per row, and use the result with the `PERCENTILE` function

```esql
FROM employees
| STATS p80_max_salary_change = PERCENTILE(MV_MAX(salary_change), 80)
```

| p80_max_salary_change:double |
| --- |
| 12.132 |


### `PERCENTILE` is (usually) approximate [esql-percentile-approximate]

There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.

Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.
Expand All @@ -72,11 +15,3 @@ The following chart shows the relative error on a uniform distribution depending
![percentiles error](/images/percentiles_error.png "")

It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.

::::{warning}
`PERCENTILE` is also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm). This means you can get slightly different results using the same data.

::::



Original file line number Diff line number Diff line change
Expand Up @@ -65,19 +65,8 @@ Computing exact counts requires loading values into a hash set and returning its

This `cardinality` aggregation is based on the [HyperLogLog++](https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf) algorithm, which counts based on the hashes of the values with some interesting properties:

* configurable precision, which decides on how to trade memory for accuracy,
* excellent accuracy on low-cardinality sets,
* fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of `c`, the implementation that we are using requires about `c * 8` bytes.

The following chart shows how the error varies before and after the threshold:

![cardinality error](../../../images/cardinality_error.png "")

For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed, this is likely to be the case. Accuracy in practice depends on the dataset in question. In general, most datasets show consistently good accuracy. Also note that even with a threshold as low as 100, the error remains very low (1-6% as seen in the above graph) even when counting millions of items.

The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of hashes in a dataset can affect the accuracy of the cardinality.
:::{include} _snippets/search-aggregations-metrics-cardinality-aggregation-explanation.md
:::


## Pre-computed hashes [_pre_computed_hashes]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,31 +175,14 @@ GET latency/_search

## Percentiles are (usually) approximate [search-aggregations-metrics-percentile-aggregation-approximation]

There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.

Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, *approximate* percentiles are calculated.

The algorithm used by the `percentile` metric is called TDigest (introduced by Ted Dunning in [Computing Accurate Quantiles using T-Digests](https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf)).

When using this metric, there are a few guidelines to keep in mind:

* Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median
* For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
* As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated

The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:

![percentiles error](../../../images/percentiles_error.png "")

It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
:::{include} /reference/data-analysis/aggregations/_snippets/search-aggregations-metrics-percentile-aggregation-approximate.md
:::

::::{warning}
Percentile aggregations are also [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm). This means you can get slightly different results using the same data.

::::



## Compression [search-aggregations-metrics-percentile-aggregation-compression]

Approximate algorithms must balance memory utilization with estimation accuracy. This balance can be controlled using a `compression` parameter:
Expand Down
23 changes: 0 additions & 23 deletions docs/reference/esql/functions/README.md

This file was deleted.

10 changes: 0 additions & 10 deletions docs/reference/esql/functions/kibana/docs/not_rlike.md

This file was deleted.

22 changes: 0 additions & 22 deletions docs/reference/esql/functions/kibana/inline_cast.json

This file was deleted.

50 changes: 50 additions & 0 deletions docs/reference/query-languages/esql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
The ES|QL documentation is composed of static content and generated content.
The static content exists in this directory and can be edited by hand.
However, the sub-directories `_snippets`, `images` and `kibana` contain mostly
generated content.

### _snippets

In `_snippets` there are files that can be included within other files
using the [File Inclusion](https://elastic.github.io/docs-builder/syntax/file_inclusion/)
feature of the Elastic Docs V3 system.
Most, but not all, files in this directory are generated.
In particular the directories `_snippets/functions/*` and `_snippets/operators/*`
contain subdirectories that are mostly generated:

* `description` - description of each function scraped from `@FunctionInfo#description`
* `examples` - examples of each function scraped from `@FunctionInfo#examples`
* `parameters` - description of each function's parameters scraped from `@Param`
* `signature` - railroad diagram of the syntax to invoke each function
* `types` - a table of each combination of support type for each parameter. These are generated from tests.
* `layout` - a fully generated description for each function

Most functions can use the generated docs generated in the `layout` directory.
If we need something more custom for the function we can make a file in this
directory that can `include::` any parts of the files above.

To regenerate the files for a function run its tests using gradle.
For example to generate docs for the `CASE` function:
```
./gradlew :x-pack:plugin:esql:test -Dtests.class='CaseTests'
```

To regenerate the files for all functions run all of ESQL's tests using gradle:
```
./gradlew :x-pack:plugin:esql:test
```

### images

The `images` directory contains `functions` and `operators` sub-directories with
the `*.svg` files used to describe the syntax of each function or operator.
These are all generated by the same tests that generate the functions and operators docs above.

### kibana

The `kibana` directory contains `definition` and `docs` sub-directories that are generated:

* `kibana/definition` - function definitions for kibana's ESQL editor
* `kibana/docs` - the inline docs for kibana

These are also generated as part of the unit tests described above.

This file was deleted.

Loading