-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
4025110
2a222da
26892d3
27e8122
3762787
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -787,16 +787,14 @@ typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult; | |
| * | ||
| * If `yes_only` is true, then return MAYBE as soon as we determine | ||
| * the answer is not YES. | ||
| * | ||
| * For background and details on the algorithm, see UAX #15: | ||
| * https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms | ||
| */ | ||
| static QuickcheckResult | ||
| is_normalized_quickcheck(PyObject *self, PyObject *input, | ||
| int nfc, int k, int yes_only) | ||
|
||
| { | ||
| /* This is an implementation of the following algorithm: | ||
| https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms | ||
| See there for background. | ||
| */ | ||
|
|
||
| /* An older version of the database is requested, quickchecks must be | ||
| disabled. */ | ||
| if (self && UCD_Check(self)) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes_onlyis a bizarre name for a parameter that causes the function to returnMAYBE.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not really a fan of this name; I'd be glad for a better one.
The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe invert the sense and call it
want_definitemeaning "process the string until we're sure it's not normalized or we reach the end"?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm in the standard is the
yes_only = false(orwant_definite = true) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.Another aspect is that there is an asymmetry between a definite YES and a definite NO. In the
yes_only/!want_definitecase, what the caller really most wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.More ideas just now: perhaps
maybe_is_no, to mean "a MAYBE is no worse than a NO"?Or
yes_or_maybe, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for withyes_only, but probably a less paradoxical way of putting it 🙂There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!
I don't like quite these names, though. First, I don't think
is_normalizedis a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard calls
quickCheckin its sample code, and which uses a property in the UCD calledQuick_Check.More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" or
is_normalized_quickcheck, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.Here's another possible set of names to use with the same idea:
is_normalized_quickcheck_implis_normalized_quickcheckis the standard's quickcheck (yes_only/yes_or_maybe=false)is_normalized_quickercheckis the variant that's more impatient toreturnand hand responsibility back to the caller (yes_or_maybe=true)is_normalized_quickcheck_quickerOr if those names feel too long:
is_normalized_qc_impl,is_normalized_qc,is_normalized_qc_quicker.How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions):
yes_vs_maybe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.
So, let's go with the current state.