-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lexer: Treat more floats with empty exponent as valid tokens #131656
base: master
Are you sure you want to change the base?
lexer: Treat more floats with empty exponent as valid tokens #131656
Conversation
rustbot has assigned @petrochenkov. Use |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
8b41315
to
e8c244e
Compare
This comment has been minimized.
This comment has been minimized.
a642bff
to
82a8103
Compare
This comment has been minimized.
This comment has been minimized.
I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose. |
This comment has been minimized.
This comment has been minimized.
Apologies for the delays. |
I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward. |
There's also one subtle change in behavior here.
then For correct behavior we'll need to use |
633b1b0
to
d13c474
Compare
This comment has been minimized.
This comment has been minimized.
d13c474
to
2c33df5
Compare
This comment has been minimized.
This comment has been minimized.
c3947a0
to
cfb8823
Compare
This comment has been minimized.
This comment has been minimized.
cfb8823
to
79d5952
Compare
Hey @petrochenkov,
Other than the points above, it should be ready for review again. |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…ustc_session, r=<try> move some invalid exponent detection into rustc_session This PR moves part of the exponent checks from `rustc_lexer`/`rustc_parser` into `rustc_session`. This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with `e` are rejected *after* the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret `1m` or `30s` but not `7eggs` or `3em`. After this change all are handled the same. The lexer will still reject input if it contains `e` followed by a number, `+`/`-`, or `_` if they are not followed by a valid integer literal (number + `_`), but this doesn't affect macro authors who just want to access alpha suffixes. This PR is a continuation of rust-lang#79912. It is also solving exactly the same problem as [rust-lang#111628](rust-lang#111628). Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like `1e_______________23`) but nevertheless it is handled correctly. Also adds tests for various edge cases and improves diagnostics marginally. r: `@petrochenkov,` since they reviewed rust-lang#79912.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (c0ae197): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (secondary -2.7%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 770.987s -> 770.548s (-0.06%) |
Okay, now this needs a proposal for the language team, with the motivation, description of the change, drawbacks, etc. (Probably right in this PR for a start, not a full RFC, but the basic RFC template can be used.) |
Note that the parallel issue #111615 is already nominated for lang team discussion (and marked as "easy decision"). |
@mattheww do you think that I can commandeer that issue? This PR doesn't completely fix the issue ( |
I can't speak for the lang team but I expect they'd prefer to have one proposal on their plate rather than two similar ones. It seems to me that this PR covers what that issue is asking for. I expect a writeup of exactly what's been implemented would be welcome. |
I've made a branch in my model of the lexer which makes it match the result of this PR. That should provide one way to see exactly what the difference in observable behaviour is. Changes: Rendered writeup for this PR's behaviour: Rendered writeup for Rust 1.85: Here are some examples of what it does for particular tokens: Newly accepted
Still accepted
Still rejected
Note that
|
@petrochenkov I've updated the PR comment to follow the RFC template. I think I've now done everything requested of me. |
Nominated.
Things like this are not relevant to the language team. |
I updated the 'summary' to point to the comment showing how different input is lexed. |
Summary
This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.
This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.
Motivation
When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (
u8
,i16
,f32
etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse1px
into some structure like{ number: f64, unit: &str }
:{ number: 1.0, unit: Pixels }
. However, this is not possible for prefixes that begin withe
orE
, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.This PR removes the special case for
e
/E
, so numeric suffixes likeem
can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like#xxxxxx
wherex
is a hexadecimal character[0-9a-fA-F]
. All colors are valid Rust tokens, for example#123abc
is interpreted as#
followed by the number123
with suffixabc
, while#abc123
is#
followed by the string"123abc"
. There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.Guide-level explanation
Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.
Reference-level explanation
This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not
u8
,i16
etc.) the compiler will reject the number with an 'invalid suffix' error.Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like
1e_______________23
) but nevertheless it is handled correctly.Drawbacks
I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (
1em
would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.Rationale and alternatives
The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.
Prior art
There are a number of previous attempts to implement this functionality: #111628, #79912.
Unresolved questions
None
Future possibilities
A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.
r: @petrochenkov, since they reviewed #79912.