Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONSerialization: Improve parsing of numbers #1657

Merged
merged 3 commits into from
Nov 7, 2018

Conversation

spevans
Copy link
Contributor

@spevans spevans commented Aug 8, 2018

  • Check the number looks like a JSON number and exit early if not.

  • Use the native Int64(), UInt64(), Double() parsers to avoid creating
    a C string and passing to strtol()/strtod(). This also eliminates a
    memcpy() and removes the 63 digit restriction which would fail to
    parse numbers expressible by Double's full exponent.

  • For numbers with a leading '-' sign, parse using Int64() falling
    back to Double(), otherwise parse using UInt64() failling back to
    Double().

- Check the number looks like a JSON number and exit early if not.

- Use the native Int64(), UInt64(), Double() parsers to avoid creating
  a C string and passing to strtol()/strtod(). This also eliminates a
  memcpy() and removes the 63 digit restriction which would fail to
  parse numbers expressible by Double's full exponent.

- For numbers with a leading '-' sign, parse using Int64() falling
  back to Double(), otherwise parse using UInt64() failling back to
  Double().
@spevans
Copy link
Contributor Author

spevans commented Aug 8, 2018

@swift-ci please test

Copy link
Contributor

@itaiferber itaiferber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable change to me. Do we want to apply our efforts in also trying to parse Decimals here?

let MINUS = UInt8(ascii: "-")

var isNegative = false
var string = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to preserve the UTF-8 behavior here of walking a pointer along, rather than appending a single character at a time here? (Given that the vast majority of JSON provided is given to us in UTF-8, it'd be nice to maintain the performance there.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about this but strings on x86_64/ARM64 upto 15 ASCII characters should actually fit into a SSO so avoid a memory allocation which probably covers most numbers to be parsed.

The other issue is that strings passed to Int64(), UInt64() and Double() cant have any invalid trailing characters so this avoids creating a string of all the available characters using String(bytesNoCopy:) which I believe still get validated according to the encoding which could end up reading through the whole of the rest of the JSON document and then scanning through it to determine the new shorter count.

As an performance enhancement, when validating the characters its possible to count the number of integers and look for .eE and directly jump to parsing as a Double()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more along the lines of String.init(bytesNoCopy:length:encoding:freeWhenDone:) which would allow us to avoid reading to the end of the document and wouldn't necessitate doing any further validation. Might be worth doing some small perf tests, just to see. (Or is this not available in s-cl-f?)

As for looking for [.eE] — we do just this on Darwin: as soon as we encounter one of those characters we avoid parsing as an integer unnecessarily, which does save some time in common situations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately from https://github.com/apple/swift-corelibs-foundation/blob/a2b40951e8365da696d5105fd57a19c1f1c220ef/Foundation/NSString.swift#L1237:

public convenience init?(bytesNoCopy bytes: UnsafeMutableRawPointer, length len: Int, encoding: UInt, freeWhenDone freeBuffer: Bool) /* "NoCopy" is a hint */ {
        // just copy for now since the internal storage will be a copy anyhow
        self.init(bytes: bytes, length: len, encoding: encoding)
        if freeBuffer { // dont take the hint
            free(bytes)
        }
    }

So I don't think its that useful at the moment. I will look into bypassing the integer parsing where possible

@spevans
Copy link
Contributor Author

spevans commented Aug 10, 2018

For Decimal do you mean with a change like:

                 return (NSNumber(value: uintValue), index)
             }
         }
+        let decimalNumber = NSDecimalNumber(string: string)
+        if decimalNumber.isFinite {
+            return (decimalNumber, index)
+        }
         if let doubleValue = Double(string) {
             return (NSNumber(value: doubleValue), index)
         }

?

I think #1653 is needed before Decimal works properly but I can always do it as a follow-up

@itaiferber
Copy link
Contributor

As for Decimal, I was thinking about the heuristic I mentioned briefly in #1655. The string representation of a Double can have at most DBL_DECIMAL_DIG digits of precision (17 for an IEEE Double) before you start to lose precision. This means that if a string is longer than:

  • 1 (if it has a sign) +
  • 1 (if it has a decimal point and starts with a leading 0 like 0.xxxxxxxxxx) +
  • 1 (if it has a decimal point) +
  • DBL_DECIMAL_DIG +
  • E (if it has an exponent of length E, max 5 digits for e±308)

then you will lose precision on the parse. For instance,

  • Strings in the form "xxxxxxxxxx": max length 17
  • Strings in the form "±xxxxxxxxxx": max length 18
  • Strings in the form "xxxxx.yyyyy": max length 18
  • Strings in the form "xxxxx.yyyyye±zzz": max length 23
  • etc.

If the string is longer than this (we're losing precision) but parsing succeeds (it's a valid Double) and the magnitude would fit in a Decimal, then it might be worth parsing as a Decimal to avoid losing that precision. (This is what we do on Darwin, at least.)

- Fully validate that the number conforms to the JSON number
  specification.

- Determine if the number should be parsed as a UInt64, Int64 or
  Decimal before falling back to Decimal.
@spevans
Copy link
Contributor Author

spevans commented Aug 13, 2018

@swift-ci please test

2 similar comments
@spevans
Copy link
Contributor Author

spevans commented Aug 14, 2018

@swift-ci please test

@spevans
Copy link
Contributor Author

spevans commented Aug 14, 2018

@swift-ci please test

@spevans
Copy link
Contributor Author

spevans commented Aug 14, 2018

@swift-ci please test

1 similar comment
@spevans
Copy link
Contributor Author

spevans commented Aug 15, 2018

@swift-ci please test

@millenomi
Copy link
Contributor

@swift-ci please test and merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants