Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lexer: Treat more floats with empty exponent as valid tokens #131656

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

richard-uk1
Copy link
Contributor

@richard-uk1 richard-uk1 commented Oct 13, 2024

Summary

This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.

This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.

Motivation

When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (u8, i16, f32 etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse 1px into some structure like { number: f64, unit: &str }: { number: 1.0, unit: Pixels }. However, this is not possible for prefixes that begin with e or E, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.

This PR removes the special case for e/E, so numeric suffixes like em can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like #xxxxxx where x is a hexadecimal character [0-9a-fA-F]. All colors are valid Rust tokens, for example #123abc is interpreted as # followed by the number 123 with suffix abc, while #abc123 is # followed by the string "123abc". There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.

Guide-level explanation

Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.

Reference-level explanation

This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not u8, i16 etc.) the compiler will reject the number with an 'invalid suffix' error.

Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like 1e_______________23) but nevertheless it is handled correctly.

Drawbacks

I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (1em would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.

Rationale and alternatives

The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.

Prior art

There are a number of previous attempts to implement this functionality: #111628, #79912.

Unresolved questions

None

Future possibilities

A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.

r: @petrochenkov, since they reviewed #79912.

@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2024

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 13, 2024
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from 8b41315 to e8c244e Compare October 13, 2024 23:06
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from a642bff to 82a8103 Compare November 10, 2024 12:06
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 marked this pull request as ready for review November 10, 2024 12:56
@richard-uk1
Copy link
Contributor Author

I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose.

@rust-log-analyzer

This comment has been minimized.

@petrochenkov
Copy link
Contributor

Apologies for the delays.
This is a language change rather than a fix, and it's not very urgent, so I'll prioritize some other things in my review queue over it.
In any case I should get to in in February, when I return to work.

@petrochenkov
Copy link
Contributor

I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward.

@petrochenkov
Copy link
Contributor

There's also one subtle change in behavior here.
If we have a suffix that starts with _s and then continues with some unicode character that is

  • is_id_continue
  • not is_id_start, and
  • not an ascii digit,

then eat_literal_suffix fill fail to eat the tail of the suffix correctly because it expects a is_id_start character at the start.

For correct behavior we'll need to use self.eat_while(is_id_continue); instead of eat_literal_suffix if the suffix start position is overridden.
This should be testable with U+005F _ LOW LINE.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 21, 2025
@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from 633b1b0 to d13c474 Compare February 22, 2025 16:12
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from d13c474 to 2c33df5 Compare February 22, 2025 16:40
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from c3947a0 to cfb8823 Compare February 22, 2025 17:05
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from cfb8823 to 79d5952 Compare February 23, 2025 13:30
@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Feb 23, 2025

Hey @petrochenkov,

  • did you mean U+005F (underscore) as the test character? It wouldn't fit your criteria since it is a valid identifier start character. I've been using U+00B7 (middle dot) as it meets your criteria.
  • I checked and the nuber parser already fails to handle the case where there is a id_continue but not id_start character that isn't a digit in the suffix, e.g. (1_\u{00B7}). If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR? Please LMK if I've misunderstood.

Other than the points above, it should be ready for review again.

@rustbot rustbot removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Feb 28, 2025
@petrochenkov
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 28, 2025
@petrochenkov petrochenkov removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 28, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 28, 2025
…ustc_session, r=<try>

move some invalid exponent detection into rustc_session

This PR moves part of the exponent checks from `rustc_lexer`/`rustc_parser` into `rustc_session`.

This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with `e` are rejected *after* the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret `1m` or `30s` but not `7eggs` or `3em`. After this change all are handled the same. The lexer will still reject input if it contains `e` followed by a number, `+`/`-`, or `_` if they are not followed by a valid integer literal (number + `_`), but this doesn't affect macro authors who just want to access alpha suffixes.

This PR is a continuation of rust-lang#79912. It is also solving exactly the same problem as [rust-lang#111628](rust-lang#111628).

Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like `1e_______________23`) but nevertheless it is handled correctly.

Also adds tests for various edge cases and improves diagnostics marginally.

r: `@petrochenkov,` since they reviewed rust-lang#79912.
@bors
Copy link
Contributor

bors commented Feb 28, 2025

⌛ Trying commit e276417 with merge c0ae197...

@bors
Copy link
Contributor

bors commented Feb 28, 2025

☀️ Try build successful - checks-actions
Build commit: c0ae197 (c0ae1973f0fe7ffdee839ec09c530a222755a710)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (c0ae197): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (secondary -2.7%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.7% [-2.7%, -2.7%] 1
All ❌✅ (primary) - - 0

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 770.987s -> 770.548s (-0.06%)
Artifact size: 361.90 MiB -> 361.98 MiB (0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 1, 2025
@petrochenkov
Copy link
Contributor

Okay, now this needs a proposal for the language team, with the motivation, description of the change, drawbacks, etc. (Probably right in this PR for a start, not a full RFC, but the basic RFC template can be used.)
With the proposal I'll nominate this for a formal lang team approval.
@rustbot author

@rustbot rustbot added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Mar 1, 2025
@petrochenkov petrochenkov changed the title move some invalid exponent detection into rustc_session lexer: Treat more floats with empty exponent as valid tokens Mar 1, 2025
@mattheww
Copy link
Contributor

mattheww commented Mar 1, 2025

Note that the parallel issue #111615 is already nominated for lang team discussion (and marked as "easy decision").

@richard-uk1
Copy link
Contributor Author

@mattheww do you think that I can commandeer that issue? This PR doesn't completely fix the issue (1e+ is still parsed as an exponent rather than 1e, + but it does fix the case of hex colors (and indeed any sequence of letters and numbers now gets passed to the proc macro).

@mattheww
Copy link
Contributor

mattheww commented Mar 1, 2025

I can't speak for the lang team but I expect they'd prefer to have one proposal on their plate rather than two similar ones. It seems to me that this PR covers what that issue is asking for.

I expect a writeup of exactly what's been implemented would be welcome.

@mattheww
Copy link
Contributor

mattheww commented Mar 4, 2025

I've made a branch in my model of the lexer which makes it match the result of this PR. That should provide one way to see exactly what the difference in observable behaviour is.

Changes:
https://github.com/mattheww/lexeywan/compare/2025-03_e-suffix

Rendered writeup for this PR's behaviour:
https://mjw.woodcraft.me.uk/2025-lexeywan-e-suffix/

Rendered writeup for Rust 1.85:
https://mjw.woodcraft.me.uk/2024-lexeywan/

Here are some examples of what it does for particular tokens:

Newly accepted

source suffix
1e e
1em em
0b1em em
1.0e e
1.0em em
1ef32 ef32
1e_f32 e_f32

Still accepted

source suffix
1e2e e
1e2em em
1.0e2e e
1.0e2em em

Still rejected

source possible future tokenisation
1e+f32 1 with suffix e, +, f32
1e_· 1 with suffix e_·
1e2· 1 with suffix e2·

Note that · is a character with XID_Continue but not XID_Start.

1e_· is the case discussed above and here.

1e2· is analysed as 1e2 followed by ·, and the · is rejected.
Compare 0b1·, which we've always rejected in a similar way although it could be taken to be 0 with suffix b1·.

@richard-uk1
Copy link
Contributor Author

@petrochenkov I've updated the PR comment to follow the RFC template. I think I've now done everything requested of me.

@richard-uk1 richard-uk1 requested a review from petrochenkov March 7, 2025 15:51
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 7, 2025
@petrochenkov petrochenkov added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-lang Relevant to the language team, which will review and decide on the PR/issue. I-lang-nominated Nominated for discussion during a lang team meeting. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 7, 2025
@petrochenkov
Copy link
Contributor

Nominated.

This PR moves part of the exponent checks from rustc_lexer/rustc_parser into rustc_session

Things like this are not relevant to the language team.
It's better to give some grammar or at least link to #131656 (comment).

@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Mar 8, 2025

It's better to give some grammar or at least link to #131656 (comment).

I updated the 'summary' to point to the comment showing how different input is lexed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-lang-nominated Nominated for discussion during a lang team meeting. S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants