lexer: Treat more floats with empty exponent as valid tokens #131656

richard-uk1 · 2024-10-13T14:02:43Z

Summary

This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.

This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.

Motivation

When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (u8, i16, f32 etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse 1px into some structure like { number: f64, unit: &str }: { number: 1.0, unit: Pixels }. However, this is not possible for prefixes that begin with e or E, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.

This PR removes the special case for e/E, so numeric suffixes like em can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like #xxxxxx where x is a hexadecimal character [0-9a-fA-F]. All colors are valid Rust tokens, for example #123abc is interpreted as # followed by the number 123 with suffix abc, while #abc123 is # followed by the string "123abc". There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.

Guide-level explanation

Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.

Reference-level explanation

This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not u8, i16 etc.) the compiler will reject the number with an 'invalid suffix' error.

Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like 1e_______________23) but nevertheless it is handled correctly.

Drawbacks

I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (1em would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.

Rationale and alternatives

The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.

Prior art

There are a number of previous attempts to implement this functionality: #111628, #79912.

Unresolved questions

None

Future possibilities

A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.

r: @petrochenkov, since they reviewed #79912.

rustbot · 2024-10-13T14:02:51Z

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

richard-uk1 · 2024-11-10T12:57:10Z

I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose.

petrochenkov · 2024-12-10T18:51:22Z

Apologies for the delays.
This is a language change rather than a fix, and it's not very urgent, so I'll prioritize some other things in my review queue over it.
In any case I should get to in in February, when I return to work.

tests/ui/consts/custom-suffixes-exponent-like.rs

compiler/rustc_session/src/errors.rs

compiler/rustc_lexer/src/lib.rs

petrochenkov · 2025-02-21T18:13:54Z

I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward.

compiler/rustc_lexer/src/lib.rs

petrochenkov · 2025-02-21T18:42:57Z

There's also one subtle change in behavior here.
If we have a suffix that starts with _s and then continues with some unicode character that is

is_id_continue
not is_id_start, and
not an ascii digit,

then eat_literal_suffix fill fail to eat the tail of the suffix correctly because it expects a is_id_start character at the start.

For correct behavior we'll need to use self.eat_while(is_id_continue); instead of eat_literal_suffix if the suffix start position is overridden.
This should be testable with U+005F _ LOW LINE.

richard-uk1 · 2025-02-23T13:41:00Z

Hey @petrochenkov,

did you mean U+005F (underscore) as the test character? It wouldn't fit your criteria since it is a valid identifier start character. I've been using U+00B7 (middle dot) as it meets your criteria.
I checked and the nuber parser already fails to handle the case where there is a id_continue but not id_start character that isn't a digit in the suffix, e.g. (1_\u{00B7}). If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR? Please LMK if I've misunderstood.

Other than the points above, it should be ready for review again.

petrochenkov · 2025-02-28T21:27:40Z

@bors try @rust-timer queue

…ustc_session, r=<try> move some invalid exponent detection into rustc_session This PR moves part of the exponent checks from `rustc_lexer`/`rustc_parser` into `rustc_session`. This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with `e` are rejected *after* the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret `1m` or `30s` but not `7eggs` or `3em`. After this change all are handled the same. The lexer will still reject input if it contains `e` followed by a number, `+`/`-`, or `_` if they are not followed by a valid integer literal (number + `_`), but this doesn't affect macro authors who just want to access alpha suffixes. This PR is a continuation of rust-lang#79912. It is also solving exactly the same problem as [rust-lang#111628](rust-lang#111628). Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like `1e_______________23`) but nevertheless it is handled correctly. Also adds tests for various edge cases and improves diagnostics marginally. r: `@petrochenkov,` since they reviewed rust-lang#79912.

bors · 2025-02-28T21:28:51Z

⌛ Trying commit e276417 with merge c0ae197...

bors · 2025-02-28T23:28:42Z

☀️ Try build successful - checks-actions
Build commit: c0ae197 (c0ae1973f0fe7ffdee839ec09c530a222755a710)

rust-timer · 2025-03-01T00:57:21Z

Finished benchmarking commit (c0ae197): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (secondary -2.7%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-2.7%	[-2.7%, -2.7%]	1
All ❌✅ (primary)	-	-	0

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 770.987s -> 770.548s (-0.06%)
Artifact size: 361.90 MiB -> 361.98 MiB (0.02%)

petrochenkov · 2025-03-01T12:43:01Z

Okay, now this needs a proposal for the language team, with the motivation, description of the change, drawbacks, etc. (Probably right in this PR for a start, not a full RFC, but the basic RFC template can be used.)
With the proposal I'll nominate this for a formal lang team approval.
@rustbot author

mattheww · 2025-03-01T12:46:41Z

Note that the parallel issue #111615 is already nominated for lang team discussion (and marked as "easy decision").

richard-uk1 · 2025-03-01T22:25:34Z

@mattheww do you think that I can commandeer that issue? This PR doesn't completely fix the issue (1e+ is still parsed as an exponent rather than 1e, + but it does fix the case of hex colors (and indeed any sequence of letters and numbers now gets passed to the proc macro).

mattheww · 2025-03-01T22:36:38Z

I can't speak for the lang team but I expect they'd prefer to have one proposal on their plate rather than two similar ones. It seems to me that this PR covers what that issue is asking for.

I expect a writeup of exactly what's been implemented would be welcome.

mattheww · 2025-03-04T20:27:39Z

I've made a branch in my model of the lexer which makes it match the result of this PR. That should provide one way to see exactly what the difference in observable behaviour is.

Changes:
https://github.com/mattheww/lexeywan/compare/2025-03_e-suffix

Rendered writeup for this PR's behaviour:
https://mjw.woodcraft.me.uk/2025-lexeywan-e-suffix/

Rendered writeup for Rust 1.85:
https://mjw.woodcraft.me.uk/2024-lexeywan/

Here are some examples of what it does for particular tokens:

Newly accepted

source	suffix
`1e`	`e`
`1em`	`em`
`0b1em`	`em`
`1.0e`	`e`
`1.0em`	`em`
`1ef32`	`ef32`
`1e_f32`	`e_f32`

Still accepted

source	suffix
`1e2e`	`e`
`1e2em`	`em`
`1.0e2e`	`e`
`1.0e2em`	`em`

Still rejected

source	possible future tokenisation
`1e+f32`	`1` with suffix `e`, `+`, `f32`
`1e_·`	`1` with suffix `e_·`
`1e2·`	`1` with suffix `e2·`

Note that · is a character with XID_Continue but not XID_Start.

1e_· is the case discussed above and here.

1e2· is analysed as 1e2 followed by ·, and the · is rejected.
Compare 0b1·, which we've always rejected in a similar way although it could be taken to be 0 with suffix b1·.

richard-uk1 · 2025-03-07T15:51:18Z

@petrochenkov I've updated the PR comment to follow the RFC template. I think I've now done everything requested of me.

petrochenkov · 2025-03-07T22:40:56Z

Nominated.

This PR moves part of the exponent checks from rustc_lexer/rustc_parser into rustc_session

Things like this are not relevant to the language team.
It's better to give some grammar or at least link to #131656 (comment).

richard-uk1 · 2025-03-08T00:26:28Z

It's better to give some grammar or at least link to #131656 (comment).

I updated the 'summary' to point to the comment showing how different input is lexed.

rustbot assigned petrochenkov Oct 13, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 13, 2024