From 542a14b987a207829f43a2d52b4302aa90ae7cdd Mon Sep 17 00:00:00 2001 From: Paris Date: Wed, 17 Jul 2024 07:39:11 -0700 Subject: [PATCH 01/22] Delete CONTRIBUTING.md We are going to use a different contributing template go forward after migration --- CONTRIBUTING.md | 11 ----------- 1 file changed, 11 deletions(-) delete mode 100644 CONTRIBUTING.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index b4612da5f..000000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,11 +0,0 @@ -By submitting a pull request, you represent that you have the right to license -your contribution to Apple and the community, and agree by submitting the patch -that your contributions are licensed under the [Swift -license](https://swift.org/LICENSE.txt). - ---- - -Before submitting the pull request, please make sure you have tested your -changes and that they follow the Swift project [guidelines for contributing -code](https://swift.org/contributing/#contributing-code). - From 07110eb593b519c0e2ef2d9049427a6e9819fcc7 Mon Sep 17 00:00:00 2001 From: Paris Date: Wed, 17 Jul 2024 07:40:24 -0700 Subject: [PATCH 02/22] Delete CODE_OF_CONDUCT.md after this repo migrates to /swiftlang today, it will be opt-ing in to the org wide code of conduct that is present at the root .github config folder. Having this file present means that it is opt-ing out of the org wide and has its own process which will not be true. --- CODE_OF_CONDUCT.md | 38 -------------------------------------- 1 file changed, 38 deletions(-) delete mode 100644 CODE_OF_CONDUCT.md diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md deleted file mode 100644 index 95c80e78b..000000000 --- a/CODE_OF_CONDUCT.md +++ /dev/null @@ -1,38 +0,0 @@ -# Code of Conduct - -To be a truly great community, Swift.org needs to welcome developers from all walks of life, with different backgrounds, and with a wide range of experience. A diverse and friendly community will have more great ideas, more unique perspectives, and produce more great code. We will work diligently to make the Swift community welcoming to everyone. - -To give clarity of what is expected of our members, this code of conduct is based on [contributor-covenant.org](http://contributor-covenant.org). This document is used across many open source communities, and we think it articulates our values well. - -### Contributor Code of Conduct v1.4 - -In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. - -Examples of behavior that contributes to creating a positive environment include: - -* Using welcoming and inclusive language (e.g., prefer non-gendered words like “folks” to “guys”, non-ableist words like “soundness check” to “sanity check”, etc.) -* Being respectful of differing viewpoints and experiences -* Gracefully accepting constructive criticism -* Focusing on what is best for the community -* Showing empathy towards other community members - -Examples of unacceptable behavior by participants include: - -* The use of sexualized language or imagery and unwelcome sexual attention or advances -* Trolling, insulting/derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others’ private information, such as a physical or electronic address, without explicit permission -* Other conduct which could reasonably be considered inappropriate in a professional setting - -Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. - -Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. - -This Code of Conduct applies within all project spaces managed by Swift.org, including (but not limited to) source code repositories, bug trackers, web sites, documentation, and online forums. It also applies when an individual is representing the project or its community in public spaces. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. - -Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting a member of the [Swift Core Team](https://swift.org/community/#community-structure) or by flagging the behavior for moderation (e.g., in the Forums), whether you are the target of that behavior or not. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. Maintainers are obligated to maintain confidentiality with regard to the reporter of an incident. The site of the disputed behavior is usually not an acceptable place to discuss moderation decisions, and moderators may move or remove any such discussion. - -Project maintainers are held to a higher standard, and project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership. -If you disagree with a moderation action, you can appeal to the Core Team (or individual Core Team members) privately. - -This policy is adapted from the Contributor Code of Conduct [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct/). From 7162118e78d2ec19ca86f2b098c88d90665d399d Mon Sep 17 00:00:00 2001 From: Allan Shortlidge Date: Wed, 24 Jul 2024 13:32:54 -0700 Subject: [PATCH 03/22] Migrate to internal imports. As of the Swift 6 compiler, `@_implementationOnly import` is deprecated in favor of `internal import` and as a result the use of `@_implementationOnly import` in this project is generating a lot of diagnostic noise when building the Swift standard library. For Swift libraries with library evolution, `@_implementationOnly import` and `internal import` are roughly functionally equivalent, aside from improved diagnostics for `internal import`. For non-resilient libraries, the main difference is that `internal import` does not actually hide a module dependency from downstream clients because the layout of a type in a non-resilient library may depend on types coming from an `internal import` (with `@_implementationOnly import` the same situation would result in a silent mis-compile, which is the reason that `@_implementationOnly import` is deprecated). The `_RegexParser` module dependency does not need to be hidden from clients since it is installed in standard locations in the SDK/toolchain. Therefore this migration should be safe, regardless of library resilience mode. From 82711ec4583fe96ff2a9f3ad9992fec48fb18493 Mon Sep 17 00:00:00 2001 From: Finagolfin Date: Tue, 6 Aug 2024 20:41:20 +0530 Subject: [PATCH 04/22] Use Bionic module from new Android overlay in Swift 6 instead The new module and overlay were merged into Swift 6 in swiftlang/swift#74758. --- Sources/VariadicsGenerator/VariadicsGenerator.swift | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Sources/VariadicsGenerator/VariadicsGenerator.swift b/Sources/VariadicsGenerator/VariadicsGenerator.swift index 91f1682dd..b1953bcd4 100644 --- a/Sources/VariadicsGenerator/VariadicsGenerator.swift +++ b/Sources/VariadicsGenerator/VariadicsGenerator.swift @@ -18,6 +18,8 @@ import Darwin import Glibc #elseif os(Windows) import CRT +#elseif canImport(Bionic) +import Bionic #endif // (T), (T) From e8951230046dbd301c68ac443a6f33a7f59a3a64 Mon Sep 17 00:00:00 2001 From: Alex Martini Date: Wed, 18 Sep 2024 11:29:44 -0700 Subject: [PATCH 05/22] Cross reference TSPL --- Sources/_StringProcessing/Regex/Core.swift | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/Sources/_StringProcessing/Regex/Core.swift b/Sources/_StringProcessing/Regex/Core.swift index 16c5c6601..11445531c 100644 --- a/Sources/_StringProcessing/Regex/Core.swift +++ b/Sources/_StringProcessing/Regex/Core.swift @@ -76,6 +76,16 @@ public protocol RegexComponent { /// instances using a clear and flexible declarative syntax. Using this /// style, you can combine, capture, and transform regexes, `RegexBuilder` /// types, and custom parsers. +/// +/// > Note: +/// > Prior to Swift 6, +/// > you might need to write `#/myregex/#` instead of `/myregex/` +/// > when you make a regular expression using a literal. +/// > For more information, +/// > see [Regular Expression Literals][regex-literal] in *[The Swift Programming Language][tspl]*. +/// +/// [regex-literal]: https://docs.swift.org/swift-book/documentation/the-swift-programming-language/lexicalstructure/#Regular-Expression-Literals +/// [tspl]: https://docs.swift.org/swift-book/ @available(SwiftStdlib 5.7, *) public struct Regex: RegexComponent { let program: Program From 8bf6e259b1612164b9250a287531e7dc3219bd05 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 9 Oct 2024 11:32:35 -0600 Subject: [PATCH 06/22] Fix bug in word boundary caching (#769) --- .../_StringProcessing/Unicode/WordBreaking.swift | 10 +++++----- Tests/RegexTests/MatchTests.swift | 13 ++++++++++++- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/Sources/_StringProcessing/Unicode/WordBreaking.swift b/Sources/_StringProcessing/Unicode/WordBreaking.swift index a1f2f13a6..8db40efac 100644 --- a/Sources/_StringProcessing/Unicode/WordBreaking.swift +++ b/Sources/_StringProcessing/Unicode/WordBreaking.swift @@ -81,18 +81,18 @@ extension String { } if #available(SwiftStdlib 5.7, *) { - var indices: Set = [] + if cache == nil { + cache = [] + } var j = maxIndex ?? range.lowerBound while j < range.upperBound, j <= i { - indices.insert(j) + cache!.insert(j) j = _wordIndex(after: j) } - cache = indices maxIndex = j - - return indices.contains(i) + return cache!.contains(i) } else { return false } diff --git a/Tests/RegexTests/MatchTests.swift b/Tests/RegexTests/MatchTests.swift index e910ac318..1b5c67e0d 100644 --- a/Tests/RegexTests/MatchTests.swift +++ b/Tests/RegexTests/MatchTests.swift @@ -2381,7 +2381,18 @@ extension RegexTests { XCTAssertTrue("cafe".contains(caseInsensitiveRegex)) XCTAssertTrue("CaFe".contains(caseInsensitiveRegex)) } - + + // https://github.com/swiftlang/swift-experimental-string-processing/issues/768 + func testWordBoundaryCaching() throws { + // This will first find word boundaries up til the middle before failing, + // then it will find word boundaries til late in the string, then fail, + // and finally should succeed on a word boundary cached from the first + // attempt. + let input = "first second third fourth" + let regex = try Regex(#".*second\bX|.*third\bX|.*first\b"#) + XCTAssertTrue(input.contains(regex)) + } + // MARK: Character Semantics var eComposed: String { "é" } From d7ae010aa03b0dbb25ef0974310b1143888d2365 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 9 Oct 2024 16:16:51 -0600 Subject: [PATCH 07/22] Cleanup algorithms code (#771) Cleanup algorithms code Delete lots of dead code, get off of collections and onto iterators, simplify implementations. --------- Co-authored-by: Nate Cook --- .../Algorithms/Algorithms/Ranges.swift | 134 ++------ .../Algorithms/Algorithms/Replace.swift | 12 +- .../Algorithms/Algorithms/Split.swift | 66 ++-- .../Algorithms/Algorithms/Trim.swift | 8 +- .../Consumers/CollectionConsumer.swift | 25 -- .../Consumers/FixedPatternConsumer.swift | 28 -- .../Algorithms/Consumers/ManyConsumer.swift | 47 --- .../Consumers/PredicateConsumer.swift | 74 ----- .../Algorithms/Matching/FirstMatch.swift | 24 -- .../Algorithms/Matching/MatchReplace.swift | 63 ---- .../Algorithms/Matching/MatchResult.swift | 9 - .../Algorithms/Matching/Matches.swift | 288 ++---------------- .../Matching/MatchingCollectionConsumer.swift | 48 --- .../Matching/MatchingCollectionSearcher.swift | 104 ------- .../Searchers/CollectionSearcher.swift | 54 ---- .../Searchers/ConsumerSearcher.swift | 109 ------- .../Searchers/NaivePatternSearcher.swift | 93 ------ .../Algorithms/Searchers/PatternOrEmpty.swift | 65 ---- .../Searchers/PredicateSearcher.swift | 44 --- Sources/_StringProcessing/CMakeLists.txt | 7 - .../RegexTests/AlgorithmsInternalsTests.swift | 15 +- 21 files changed, 88 insertions(+), 1229 deletions(-) delete mode 100644 Sources/_StringProcessing/Algorithms/Consumers/ManyConsumer.swift delete mode 100644 Sources/_StringProcessing/Algorithms/Consumers/PredicateConsumer.swift delete mode 100644 Sources/_StringProcessing/Algorithms/Matching/MatchingCollectionConsumer.swift delete mode 100644 Sources/_StringProcessing/Algorithms/Searchers/ConsumerSearcher.swift delete mode 100644 Sources/_StringProcessing/Algorithms/Searchers/NaivePatternSearcher.swift delete mode 100644 Sources/_StringProcessing/Algorithms/Searchers/PatternOrEmpty.swift delete mode 100644 Sources/_StringProcessing/Algorithms/Searchers/PredicateSearcher.swift diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift index 3f9b8d49a..a82fb875c 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift @@ -11,107 +11,33 @@ // MARK: `RangesCollection` -struct RangesCollection { - public typealias Base = Searcher.Searched - - let base: Base - let searcher: Searcher - private(set) public var startIndex: Index - - init(base: Base, searcher: Searcher) { - self.base = base - self.searcher = searcher - - var state = searcher.state(for: base, in: base.startIndex..: IteratorProtocol { - public typealias Base = Searcher.Searched - - let base: Base +struct RangesSequence { + let input: Searcher.Searched let searcher: Searcher - var state: Searcher.State - init(base: Base, searcher: Searcher) { - self.base = base + init(input: Searcher.Searched, searcher: Searcher) { + self.input = input self.searcher = searcher - self.state = searcher.state(for: base, in: base.startIndex.. Range? { - searcher.search(base, &state) - } -} - -extension RangesCollection: Sequence { - public func makeIterator() -> RangesIterator { - Iterator(base: base, searcher: searcher) - } -} - -extension RangesCollection: Collection { - // TODO: Custom `SubSequence` for the sake of more efficient slice iteration - - public struct Index { - var range: Range? + struct Iterator: IteratorProtocol { + let base: RangesSequence var state: Searcher.State - } - public var endIndex: Index { - // TODO: Avoid calling `state(for:startingAt)` here - Index( - range: nil, - state: searcher.state(for: base, in: base.startIndex.. Index { - var index = index - formIndex(after: &index) - return index - } - - public subscript(index: Index) -> Range { - guard let range = index.range else { - fatalError("Cannot subscript using endIndex") + init(_ base: RangesSequence) { + self.base = base + self.state = base.searcher.state(for: base.input, in: base.input.startIndex.. Bool { - switch (lhs.range, rhs.range) { - case (nil, nil): - return true - case (nil, _?), (_?, nil): - return false - case (let lhs?, let rhs?): - return lhs.lowerBound == rhs.lowerBound + mutating func next() -> Range? { + base.searcher.search(base.input, &state) } } +} - static func < (lhs: Self, rhs: Self) -> Bool { - switch (lhs.range, rhs.range) { - case (nil, _): - return false - case (_, nil): - return true - case (let lhs?, let rhs?): - return lhs.lowerBound < rhs.lowerBound - } +extension RangesSequence: Sequence { + func makeIterator() -> Iterator { + Iterator(self) } } @@ -122,8 +48,8 @@ extension RangesCollection.Index: Comparable { extension Collection { func _ranges( of searcher: S - ) -> RangesCollection where S.Searched == Self { - RangesCollection(base: self, searcher: searcher) + ) -> RangesSequence where S.Searched == Self { + RangesSequence(input: self, searcher: searcher) } } @@ -132,7 +58,7 @@ extension Collection { extension Collection where Element: Equatable { func _ranges( of other: C - ) -> RangesCollection> where C.Element == Element { + ) -> RangesSequence> where C.Element == Element { _ranges(of: ZSearcher(pattern: Array(other), by: ==)) } @@ -163,8 +89,8 @@ extension Collection where Element: Equatable { } @available(SwiftStdlib 5.7, *) -struct RegexRangesCollection { - let base: RegexMatchesCollection +struct RegexRangesSequence { + let base: RegexMatchesSequence init( input: String, @@ -181,9 +107,9 @@ struct RegexRangesCollection { } @available(SwiftStdlib 5.7, *) -extension RegexRangesCollection: Sequence { +extension RegexRangesSequence: Sequence { struct Iterator: IteratorProtocol { - var matchesBase: RegexMatchesCollection.Iterator + var matchesBase: RegexMatchesSequence.Iterator mutating func next() -> Range? { matchesBase.next().map(\.range) @@ -195,16 +121,6 @@ extension RegexRangesCollection: Sequence { } } -@available(SwiftStdlib 5.7, *) -extension RegexRangesCollection: Collection { - typealias Index = RegexMatchesCollection.Index - - var startIndex: Index { base.startIndex } - var endIndex: Index { base.endIndex } - func index(after i: Index) -> Index { base.index(after: i) } - subscript(position: Index) -> Range { base[position].range } -} - // MARK: Regex algorithms extension Collection where SubSequence == Substring { @@ -214,8 +130,8 @@ extension Collection where SubSequence == Substring { of regex: R, subjectBounds: Range, searchBounds: Range - ) -> RegexRangesCollection { - RegexRangesCollection( + ) -> RegexRangesSequence { + RegexRangesSequence( input: self[...].base, subjectBounds: subjectBounds, searchBounds: searchBounds, @@ -226,7 +142,7 @@ extension Collection where SubSequence == Substring { @_disfavoredOverload func _ranges( of regex: R - ) -> RegexRangesCollection { + ) -> RegexRangesSequence { _ranges( of: regex, subjectBounds: startIndex..( + func _replacing( _ ranges: Ranges, with replacement: Replacement, maxReplacements: Int = .max @@ -49,13 +49,15 @@ extension RangeReplaceableCollection { var result = Self() var index = startIndex - - // `maxRanges` is a workaround for https://github.com/apple/swift/issues/59522 - let maxRanges = ranges.prefix(maxReplacements) - for range in maxRanges { + var replacements = 0 + + for range in ranges { + if replacements == maxReplacements { break } + result.append(contentsOf: self[index.. { - public typealias Base = Searcher.Searched +struct SplitSequence { + typealias Input = Searcher.Searched - let ranges: RangesCollection + let ranges: RangesSequence var maxSplits: Int var omittingEmptySubsequences: Bool init( - ranges: RangesCollection, + ranges: RangesSequence, maxSplits: Int, omittingEmptySubsequences: Bool) { @@ -29,53 +29,53 @@ struct SplitCollection { } init( - base: Base, + input: Input, searcher: Searcher, maxSplits: Int, omittingEmptySubsequences: Bool) { - self.ranges = base._ranges(of: searcher) + self.ranges = input._ranges(of: searcher) self.maxSplits = maxSplits self.omittingEmptySubsequences = omittingEmptySubsequences } } -extension SplitCollection: Sequence { - public struct Iterator: IteratorProtocol { - let base: Base - var index: Base.Index - var ranges: RangesCollection.Iterator - var maxSplits: Int - var omittingEmptySubsequences: Bool +extension SplitSequence: Sequence { + struct Iterator: IteratorProtocol { + var ranges: RangesSequence.Iterator + var index: Input.Index + var maxSplits: Int var splitCounter = 0 + var omittingEmptySubsequences: Bool var isDone = false + var input: Input { ranges.base.input } + init( - ranges: RangesCollection, + ranges: RangesSequence, maxSplits: Int, omittingEmptySubsequences: Bool ) { - self.base = ranges.base - self.index = base.startIndex + self.index = ranges.input.startIndex self.ranges = ranges.makeIterator() self.maxSplits = maxSplits self.omittingEmptySubsequences = omittingEmptySubsequences } - public mutating func next() -> Base.SubSequence? { + mutating func next() -> Input.SubSequence? { guard !isDone else { return nil } /// Return the rest of base if it's non-empty or we're including /// empty subsequences. - func finish() -> Base.SubSequence? { + func finish() -> Input.SubSequence? { isDone = true - return index == base.endIndex && omittingEmptySubsequences + return index == input.endIndex && omittingEmptySubsequences ? nil - : base[index...] + : input[index...] } - if index == base.endIndex { + if index == input.endIndex { return finish() } @@ -96,12 +96,12 @@ extension SplitCollection: Sequence { } splitCounter += 1 - return base[index.. Iterator { + func makeIterator() -> Iterator { Iterator(ranges: ranges, maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } } @@ -109,13 +109,13 @@ extension SplitCollection: Sequence { // MARK: `CollectionSearcher` algorithms extension Collection { - func split( + func _split( by separator: Searcher, maxSplits: Int, omittingEmptySubsequences: Bool - ) -> SplitCollection where Searcher.Searched == Self { - SplitCollection( - base: self, + ) -> SplitSequence where Searcher.Searched == Self { + SplitSequence( + input: self, searcher: separator, maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) @@ -126,12 +126,12 @@ extension Collection { extension Collection where Element: Equatable { @_disfavoredOverload - func split( + func _split( by separator: C, maxSplits: Int, omittingEmptySubsequences: Bool - ) -> SplitCollection> where C.Element == Element { - split(by: ZSearcher(pattern: Array(separator), by: ==), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) + ) -> SplitSequence> where C.Element == Element { + _split(by: ZSearcher(pattern: Array(separator), by: ==), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) } // FIXME: Return `some Collection` for SE-0346 @@ -159,7 +159,7 @@ extension Collection where Element: Equatable { return str._split(separator: sep, maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences) as! [SubSequence] default: - return Array(split( + return Array(_split( by: ZSearcher(pattern: Array(separator), by: ==), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences)) @@ -186,7 +186,7 @@ extension StringProtocol where SubSequence == Substring { maxSplits: Int = .max, omittingEmptySubsequences: Bool = true ) -> [Substring] { - Array(self[...].split( + Array(self[...]._split( by: SubstringSearcher(text: "" as Substring, pattern: separator[...]), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences)) @@ -199,7 +199,7 @@ extension StringProtocol where SubSequence == Substring { maxSplits: Int = .max, omittingEmptySubsequences: Bool = true ) -> [Substring] { - Array(self[...].split( + Array(self[...]._split( by: SubstringSearcher(text: "" as Substring, pattern: separator[...]), maxSplits: maxSplits, omittingEmptySubsequences: omittingEmptySubsequences)) diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift index e870e1493..ff385856c 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Trim.swift @@ -44,7 +44,7 @@ extension RangeReplaceableCollection { // MARK: Predicate algorithms extension Collection { - fileprivate func endOfPrefix(while predicate: (Element) throws -> Bool) rethrows -> Index { + fileprivate func _endOfPrefix(while predicate: (Element) throws -> Bool) rethrows -> Index { try firstIndex(where: { try !predicate($0) }) ?? endIndex } @@ -52,7 +52,7 @@ extension Collection { public func trimmingPrefix( while predicate: (Element) throws -> Bool ) rethrows -> SubSequence { - let end = try endOfPrefix(while: predicate) + let end = try _endOfPrefix(while: predicate) return self[end...] } } @@ -62,7 +62,7 @@ extension Collection where SubSequence == Self { public mutating func trimPrefix( while predicate: (Element) throws -> Bool ) throws { - let end = try endOfPrefix(while: predicate) + let end = try _endOfPrefix(while: predicate) self = self[end...] } } @@ -73,7 +73,7 @@ extension RangeReplaceableCollection { public mutating func trimPrefix( while predicate: (Element) throws -> Bool ) rethrows { - let end = try endOfPrefix(while: predicate) + let end = try _endOfPrefix(while: predicate) removeSubrange(startIndex.. - ) -> Consumed.Index? -} - -extension BidirectionalCollectionConsumer { - func consumingBack(_ consumed: Consumed) -> Consumed.Index? { - consumingBack(consumed, in: consumed.startIndex.. Bool - where Consumed.SubSequence == Consumed - { - guard let index = consumingBack(consumed) else { return false } - consumed = consumed[.. - ) -> Consumed.Index? { - var index = range.upperBound - var patternIndex = pattern.endIndex - - while true { - if patternIndex == pattern.startIndex { - return index - } - - if index == range.lowerBound { - return nil - } - - consumed.formIndex(before: &index) - pattern.formIndex(before: &patternIndex) - - if consumed[index] != pattern[patternIndex] { - return nil - } - } - } -} diff --git a/Sources/_StringProcessing/Algorithms/Consumers/ManyConsumer.swift b/Sources/_StringProcessing/Algorithms/Consumers/ManyConsumer.swift deleted file mode 100644 index 10d9fd5c3..000000000 --- a/Sources/_StringProcessing/Algorithms/Consumers/ManyConsumer.swift +++ /dev/null @@ -1,47 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// -//===----------------------------------------------------------------------===// - -struct ManyConsumer { - let base: Base -} - -extension ManyConsumer: CollectionConsumer { - typealias Consumed = Base.Consumed - - func consuming( - _ consumed: Base.Consumed, - in range: Range - ) -> Base.Consumed.Index? { - var result = range.lowerBound - while let index = base.consuming(consumed, in: result.. - ) -> Base.Consumed.Index? { - var result = range.upperBound - while let index = base.consumingBack( - consumed, - in: range.lowerBound.. { - let predicate: (Consumed.Element) -> Bool -} - -extension PredicateConsumer: CollectionConsumer { - public func consuming( - _ consumed: Consumed, - in range: Range - ) -> Consumed.Index? { - let start = range.lowerBound - guard start != range.upperBound && predicate(consumed[start]) else { - return nil - } - return consumed.index(after: start) - } -} - -extension PredicateConsumer: BidirectionalCollectionConsumer - where Consumed: BidirectionalCollection -{ - func consumingBack( - _ consumed: Consumed, - in range: Range - ) -> Consumed.Index? { - let end = range.upperBound - guard end != range.lowerBound else { return nil } - let previous = consumed.index(before: end) - return predicate(consumed[previous]) ? previous : nil - } -} - -extension PredicateConsumer: StatelessCollectionSearcher { - public typealias Searched = Consumed - - public func search( - _ searched: Searched, - in range: Range - ) -> Range? { - // TODO: Make this reusable - guard let index = searched[range].firstIndex(where: predicate) else { - return nil - } - return index.. - ) -> Range? { - // TODO: Make this reusable - guard let index = searched[range].lastIndex(where: predicate) else { - return nil - } - return index..( - of searcher: S - ) -> _MatchResult? where S.Searched == Self { - var state = searcher.state(for: self, in: startIndex..( - of searcher: S - ) -> _BackwardMatchResult? - where S.BackwardSearched == Self - { - var state = searcher.backwardState(for: self, in: startIndex..( - _ searcher: Searcher, - with replacement: (_MatchResult) throws -> Replacement, - subrange: Range, - maxReplacements: Int = .max - ) rethrows -> Self where Searcher.Searched == SubSequence, - Replacement.Element == Element - { - precondition(maxReplacements >= 0) - - var index = subrange.lowerBound - var result = Self() - result.append(contentsOf: self[..( - _ searcher: Searcher, - with replacement: (_MatchResult) throws -> Replacement, - maxReplacements: Int = .max - ) rethrows -> Self where Searcher.Searched == SubSequence, - Replacement.Element == Element - { - try _replacing( - searcher, - with: replacement, - subrange: startIndex..( - _ searcher: Searcher, - with replacement: (_MatchResult) throws -> Replacement, - maxReplacements: Int = .max - ) rethrows where Searcher.Searched == SubSequence, - Replacement.Element == Element - { - self = try _replacing( - searcher, - with: replacement, - maxReplacements: maxReplacements) - } -} - // MARK: Regex algorithms extension RangeReplaceableCollection where SubSequence == Substring { diff --git a/Sources/_StringProcessing/Algorithms/Matching/MatchResult.swift b/Sources/_StringProcessing/Algorithms/Matching/MatchResult.swift index 94e6d8c3b..7d8157045 100644 --- a/Sources/_StringProcessing/Algorithms/Matching/MatchResult.swift +++ b/Sources/_StringProcessing/Algorithms/Matching/MatchResult.swift @@ -17,12 +17,3 @@ struct _MatchResult { match.startIndex.. { - let match: S.BackwardSearched.SubSequence - let result: S.Match - - var range: Range { - match.startIndex.. { - public typealias Base = Searcher.Searched - - let base: Base - let searcher: Searcher - private(set) public var startIndex: Index - - init(base: Base, searcher: Searcher) { - self.base = base - self.searcher = searcher - - var state = searcher.state(for: base, in: base.startIndex..: IteratorProtocol { - public typealias Base = Searcher.Searched - - let base: Base - let searcher: Searcher - var state: Searcher.State - - init(base: Base, searcher: Searcher) { - self.base = base - self.searcher = searcher - self.state = searcher.state(for: base, in: base.startIndex.. _MatchResult? { - searcher.matchingSearch(base, &state).map { range, result in - _MatchResult(match: base[range], result: result) - } - } -} - -extension MatchesCollection: Sequence { - public func makeIterator() -> MatchesIterator { - Iterator(base: base, searcher: searcher) - } -} - -extension MatchesCollection: Collection { - // TODO: Custom `SubSequence` for the sake of more efficient slice iteration - - struct Index { - var match: (range: Range, match: Searcher.Match)? - var state: Searcher.State - } - - public var endIndex: Index { - // TODO: Avoid calling `state(for:startingAt)` here - Index( - match: nil, - state: searcher.state(for: base, in: base.startIndex.. Index { - var index = index - formIndex(after: &index) - return index - } - - public subscript(index: Index) -> _MatchResult { - guard let (range, result) = index.match else { - fatalError("Cannot subscript using endIndex") - } - return _MatchResult(match: base[range], result: result) - } -} - -extension MatchesCollection.Index: Comparable { - public static func == (lhs: Self, rhs: Self) -> Bool { - switch (lhs.match?.range, rhs.match?.range) { - case (nil, nil): - return true - case (nil, _?), (_?, nil): - return false - case (let lhs?, let rhs?): - return lhs.lowerBound == rhs.lowerBound - } - } - - public static func < (lhs: Self, rhs: Self) -> Bool { - switch (lhs.match?.range, rhs.match?.range) { - case (nil, _): - return false - case (_, nil): - return true - case (let lhs?, let rhs?): - return lhs.lowerBound < rhs.lowerBound - } - } -} - -// MARK: `ReversedMatchesCollection` -// TODO: reversed matches - -struct ReversedMatchesCollection< - Searcher: BackwardMatchingCollectionSearcher -> { - public typealias Base = Searcher.BackwardSearched - - let base: Base - let searcher: Searcher - - init(base: Base, searcher: Searcher) { - self.base = base - self.searcher = searcher - } -} - -extension ReversedMatchesCollection: Sequence { - struct Iterator: IteratorProtocol { - let base: Base - let searcher: Searcher - var state: Searcher.BackwardState - - init(base: Base, searcher: Searcher) { - self.base = base - self.searcher = searcher - self.state = searcher.backwardState( - for: base, in: base.startIndex.. _BackwardMatchResult? { - searcher.matchingSearchBack(base, &state).map { range, result in - _BackwardMatchResult(match: base[range], result: result) - } - } - } - - public func makeIterator() -> Iterator { - Iterator(base: base, searcher: searcher) - } -} - -// TODO: `Collection` conformance - -// MARK: `CollectionSearcher` algorithms - -extension Collection { - func _matches( - of searcher: S - ) -> MatchesCollection where S.Searched == Self { - MatchesCollection(base: self, searcher: searcher) - } -} - -extension BidirectionalCollection { - func _matchesFromBack( - of searcher: S - ) -> ReversedMatchesCollection where S.BackwardSearched == Self { - ReversedMatchesCollection(base: self, searcher: searcher) - } -} - // MARK: Regex algorithms @available(SwiftStdlib 5.7, *) -struct RegexMatchesCollection { +struct RegexMatchesSequence { let input: String let subjectBounds: Range let searchBounds: Range let regex: Regex - let startIndex: Index - + init( input: String, subjectBounds: Range, @@ -201,15 +28,11 @@ struct RegexMatchesCollection { self.subjectBounds = subjectBounds self.searchBounds = searchBounds self.regex = regex - self.startIndex = (try? regex._firstMatch( - input, - subjectBounds: subjectBounds, - searchBounds: searchBounds)).map(Index.match) ?? .end } } @available(SwiftStdlib 5.7, *) -extension RegexMatchesCollection: Sequence { +extension RegexMatchesSequence: Sequence { /// Returns the index to start searching for the next match after `match`. fileprivate func searchIndex(after match: Regex.Match) -> String.Index? { if !match.range.isEmpty { @@ -218,7 +41,7 @@ extension RegexMatchesCollection: Sequence { // If the last match was an empty match, advance by one position and // run again, unless at the end of `input`. - if match.range.lowerBound == input.endIndex { + guard match.range.lowerBound < subjectBounds.upperBound else { return nil } @@ -231,29 +54,26 @@ extension RegexMatchesCollection: Sequence { } struct Iterator: IteratorProtocol { - let base: RegexMatchesCollection + let base: RegexMatchesSequence // Because `RegexMatchesCollection` eagerly computes the first match for // its `startIndex`, the iterator can use that match for its initial // iteration. For subsequent calls to `next()`, this value is `false`, and // `nextStart` is used to search for the next match. var initialIteration = true - var nextStart: String.Index? - - init(_ matches: RegexMatchesCollection) { + + // Set to nil when iteration is finished (because some regex can empty-match + // at the end of the subject). + var currentPosition: String.Index? + + init(_ matches: RegexMatchesSequence) { self.base = matches - self.nextStart = base.startIndex.match.flatMap(base.searchIndex(after:)) + self.currentPosition = base.subjectBounds.lowerBound } mutating func next() -> Regex.Match? { - // Initial case with pre-computed first match - if initialIteration { - initialIteration = false - return base.startIndex.match - } - - // `nextStart` is `nil` when iteration has completed - guard let start = nextStart, start <= base.searchBounds.upperBound else { + // `currentPosition` is `nil` when iteration has completed + guard let position = currentPosition, position <= base.searchBounds.upperBound else { return nil } @@ -261,8 +81,8 @@ extension RegexMatchesCollection: Sequence { let match = try? base.regex._firstMatch( base.input, subjectBounds: base.subjectBounds, - searchBounds: start...Match) - case end - - var match: Regex.Match? { - switch self { - case .match(let match): return match - case .end: return nil - } - } - - static func == (lhs: Self, rhs: Self) -> Bool { - switch (lhs, rhs) { - case (.match(let lhs), .match(let rhs)): - return lhs.range == rhs.range - case (.end, .end): - return true - case (.end, .match), (.match, .end): - return false - } - } - - static func < (lhs: Self, rhs: Self) -> Bool { - switch (lhs, rhs) { - case (.match(let lhs), .match(let rhs)): - // This implementation uses a tuple comparison so that an empty - // range `i.. Index { - guard let currentMatch = i.match else { - fatalError("Can't advance past the 'endIndex' of a match collection.") - } - - guard - let start = searchIndex(after: currentMatch), - start <= searchBounds.upperBound, - let nextMatch = try? regex._firstMatch( - input, - subjectBounds: subjectBounds, - searchBounds: start.. Regex.Match { - guard let match = position.match else { - fatalError("Can't subscript the 'endIndex' of a match collection.") - } - return match - } -} - extension BidirectionalCollection where SubSequence == Substring { @available(SwiftStdlib 5.7, *) @_disfavoredOverload func _matches( of regex: R - ) -> RegexMatchesCollection { - RegexMatchesCollection( + ) -> RegexMatchesSequence { + RegexMatchesSequence( input: self[...].base, subjectBounds: startIndex.. - ) -> (upperBound: Consumed.Index, match: Match)? -} - -extension MatchingCollectionConsumer { - func consuming( - _ consumed: Consumed, - in range: Range - ) -> Consumed.Index? { - matchingConsuming(consumed, in: range)?.upperBound - } -} - -// MARK: Consuming from the back - -protocol BidirectionalMatchingCollectionConsumer: - MatchingCollectionConsumer, BidirectionalCollectionConsumer -{ - func matchingConsumingBack( - _ consumed: Consumed, - in range: Range - ) -> (lowerBound: Consumed.Index, match: Match)? -} - -extension BidirectionalMatchingCollectionConsumer { - func consumingBack( - _ consumed: Consumed, - in range: Range - ) -> Consumed.Index? { - matchingConsumingBack(consumed, in: range)?.lowerBound - } -} - diff --git a/Sources/_StringProcessing/Algorithms/Matching/MatchingCollectionSearcher.swift b/Sources/_StringProcessing/Algorithms/Matching/MatchingCollectionSearcher.swift index 902d94591..b75f30c73 100644 --- a/Sources/_StringProcessing/Algorithms/Matching/MatchingCollectionSearcher.swift +++ b/Sources/_StringProcessing/Algorithms/Matching/MatchingCollectionSearcher.swift @@ -25,107 +25,3 @@ extension MatchingCollectionSearcher { matchingSearch(searched, &state)?.range } } - -protocol MatchingStatelessCollectionSearcher: - MatchingCollectionSearcher, StatelessCollectionSearcher -{ - func matchingSearch( - _ searched: Searched, - in range: Range - ) -> (range: Range, match: Match)? -} - -extension MatchingStatelessCollectionSearcher { - // for disambiguation between the `MatchingCollectionSearcher` and - // `StatelessCollectionSearcher` overloads - func search( - _ searched: Searched, - _ state: inout State - ) -> Range? { - matchingSearch(searched, &state)?.range - } - - func matchingSearch( - _ searched: Searched, - _ state: inout State - ) -> (range: Range, match: Match)? { - // TODO: deduplicate this logic with `StatelessCollectionSearcher`? - - guard - case .index(let index) = state.position, - let (range, value) = matchingSearch(searched, in: index.. - ) -> Range? { - matchingSearch(searched, in: range)?.range - } -} - -// MARK: Searching from the back - -protocol BackwardMatchingCollectionSearcher: BackwardCollectionSearcher { - associatedtype Match - func matchingSearchBack( - _ searched: BackwardSearched, - _ state: inout BackwardState - ) -> (range: Range, match: Match)? -} - -protocol BackwardMatchingStatelessCollectionSearcher: - BackwardMatchingCollectionSearcher, BackwardStatelessCollectionSearcher -{ - func matchingSearchBack( - _ searched: BackwardSearched, - in range: Range - ) -> (range: Range, match: Match)? -} - -extension BackwardMatchingStatelessCollectionSearcher { - func searchBack( - _ searched: BackwardSearched, - in range: Range - ) -> Range? { - matchingSearchBack(searched, in: range)?.range - } - - func matchingSearchBack( - _ searched: BackwardSearched, - _ state: inout BackwardState) -> (range: Range, match: Match)? - { - // TODO: deduplicate this logic with `StatelessBackwardCollectionSearcher`? - - guard - case .index(let index) = state.position, - let (range, value) = matchingSearchBack(searched, in: state.end..) -> BackwardState - func searchBack( - _ searched: BackwardSearched, - _ state: inout BackwardState - ) -> Range? -} - -protocol BackwardStatelessCollectionSearcher: BackwardCollectionSearcher - where BackwardState == DefaultSearcherState -{ - func searchBack( - _ searched: BackwardSearched, - in range: Range - ) -> Range? -} - -extension BackwardStatelessCollectionSearcher { - func backwardState( - for searched: BackwardSearched, - in range: Range - ) -> BackwardState { - BackwardState(position: .index(range.upperBound), end: range.lowerBound) - } - - func searchBack( - _ searched: BackwardSearched, - _ state: inout BackwardState) -> Range? { - guard - case .index(let index) = state.position, - let range = searchBack(searched, in: state.end.. { - let consumer: Consumer -} - -extension ConsumerSearcher: StatelessCollectionSearcher { - typealias Searched = Consumer.Consumed - - func search( - _ searched: Searched, - in range: Range - ) -> Range? { - var start = range.lowerBound - while true { - if let end = consumer.consuming(searched, in: start.. - ) -> Range? { - var end = range.upperBound - while true { - if let start = consumer.consumingBack( - searched, in: range.lowerBound.. - ) -> (range: Range, match: Consumer.Match)? { - var start = range.lowerBound - while true { - if let (end, value) = consumer.matchingConsuming( - searched, - in: start.. - ) -> (range: Range, match: Match)? { - var end = range.upperBound - while true { - if let (start, value) = consumer.matchingConsumingBack( - searched, in: range.lowerBound.. - where Searched.Element: Equatable, Pattern.Element == Searched.Element -{ - let pattern: Pattern -} - -extension NaivePatternSearcher: StatelessCollectionSearcher { - func search( - _ searched: Searched, - in range: Range - ) -> Range? { - var searchStart = range.lowerBound - - guard let patternFirst = pattern.first else { - return searchStart.. - ) -> Range? { - var searchEnd = range.upperBound - - guard let otherLastIndex = pattern.indices.last else { - return searchEnd.. { - let searcher: Searcher? -} - -extension PatternOrEmpty: CollectionSearcher { - typealias Searched = Searcher.Searched - - struct State { - enum Representation { - case state(Searcher.State) - case empty(index: Searched.Index, end: Searched.Index) - case emptyDone - } - - let representation: Representation - } - - func state( - for searched: Searcher.Searched, - in range: Range - ) -> State { - if let searcher = searcher { - return State( - representation: .state(searcher.state(for: searched, in: range))) - } else { - return State( - representation: .empty(index: range.lowerBound, end: range.upperBound)) - } - } - - func search( - _ searched: Searched, - _ state: inout State - ) -> Range? { - switch state.representation { - case .state(var s): - // TODO: Avoid a potential copy-on-write copy here - let result = searcher!.search(searched, &s) - state = State(representation: .state(s)) - return result - case .empty(let index, let end): - if index == end { - state = State(representation: .emptyDone) - } else { - state = State( - representation: .empty(index: searched.index(after: index), end: end)) - } - return index.. { - let predicate: (Searched.Element) -> Bool -} - -extension PredicateSearcher: StatelessCollectionSearcher { - func search( - _ searched: Searched, - in range: Range - ) -> Range? { - guard let index = searched[range].firstIndex(where: predicate) else { - return nil - } - return index.. - ) -> Range? { - guard let index = searched[range].lastIndex(where: predicate) else { - return nil - } - return index.. = matches + let _: RegexMatchesSequence = matches XCTAssertEqual(matches.map(\.output), expected) - - let i = matches.index(matches.startIndex, offsetBy: 3) - XCTAssertEqual(matches[i].output, expected[3]) - let j = matches.index(i, offsetBy: 5) - XCTAssertEqual(j, matches.endIndex) - - var index = matches.startIndex - while index < matches.endIndex { - XCTAssertEqual( - matches[index].output, - expected[matches.distance(from: matches.startIndex, to: index)]) - matches.formIndex(after: &index) - } } } From 4da66225d61bbc7c182ee1925ec318cb4c9bd2fb Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 9 Oct 2024 09:32:52 -0600 Subject: [PATCH 08/22] New benchmark: word boundaries in urls --- Package.swift | 28 +++++++++---------- Sources/RegexBenchmark/Benchmark.swift | 11 ++++++-- .../BenchmarkRegistration.swift | 2 ++ Sources/RegexBenchmark/BenchmarkRunner.swift | 13 ++++++++- Sources/RegexBenchmark/Inputs/URL.swift | 22 +++++++++++++++ Sources/RegexBenchmark/Suite/URLRegex.swift | 14 ++++++++++ 6 files changed, 72 insertions(+), 18 deletions(-) create mode 100644 Sources/RegexBenchmark/Inputs/URL.swift create mode 100644 Sources/RegexBenchmark/Suite/URLRegex.swift diff --git a/Package.swift b/Package.swift index b9b9d6d71..1f5e10f0a 100644 --- a/Package.swift +++ b/Package.swift @@ -59,9 +59,9 @@ let package = Package( name: "VariadicsGenerator", targets: ["VariadicsGenerator"]), // Disable to work around rdar://126877024 -// .executable( -// name: "RegexBenchmark", -// targets: ["RegexBenchmark"]) + .executable( + name: "RegexBenchmark", + targets: ["RegexBenchmark"]) ], dependencies: [ .package(url: "https://github.com/apple/swift-argument-parser", from: "1.0.0"), @@ -143,17 +143,17 @@ let package = Package( "_StringProcessing" ], swiftSettings: [availabilityDefinition]), -// .executableTarget( -// name: "RegexBenchmark", -// dependencies: [ -// .product(name: "ArgumentParser", package: "swift-argument-parser"), -// "_RegexParser", -// "_StringProcessing", -// "RegexBuilder" -// ], -// swiftSettings: [ -// .unsafeFlags(["-Xfrontend", "-disable-availability-checking"]), -// ]), + .executableTarget( + name: "RegexBenchmark", + dependencies: [ + .product(name: "ArgumentParser", package: "swift-argument-parser"), + "_RegexParser", + "_StringProcessing", + "RegexBuilder" + ], + swiftSettings: [ + .unsafeFlags(["-Xfrontend", "-disable-availability-checking"]), + ]), // MARK: Exercises .target( diff --git a/Sources/RegexBenchmark/Benchmark.swift b/Sources/RegexBenchmark/Benchmark.swift index 3a967c022..bcf8fa42a 100644 --- a/Sources/RegexBenchmark/Benchmark.swift +++ b/Sources/RegexBenchmark/Benchmark.swift @@ -153,6 +153,8 @@ struct CrossBenchmark { /// Whether to also run scalar-semantic mode var alsoRunScalarSemantic: Bool = true + var alsoRunSimpleWordBoundaries: Bool = false + func register(_ runner: inout BenchmarkRunner) { if isWhole { runner.registerCrossBenchmark( @@ -160,14 +162,16 @@ struct CrossBenchmark { input: input, pattern: regex, .whole, - alsoRunScalarSemantic: alsoRunScalarSemantic) + alsoRunScalarSemantic: alsoRunScalarSemantic, + alsoRunSimpleWordBoundaries: alsoRunSimpleWordBoundaries) } else { runner.registerCrossBenchmark( nameBase: baseName, input: input, pattern: regex, .allMatches, - alsoRunScalarSemantic: alsoRunScalarSemantic) + alsoRunScalarSemantic: alsoRunScalarSemantic, + alsoRunSimpleWordBoundaries: alsoRunSimpleWordBoundaries) if includeFirst || runner.includeFirstOverride { runner.registerCrossBenchmark( @@ -175,7 +179,8 @@ struct CrossBenchmark { input: input, pattern: regex, .first, - alsoRunScalarSemantic: alsoRunScalarSemantic) + alsoRunScalarSemantic: alsoRunScalarSemantic, + alsoRunSimpleWordBoundaries: alsoRunSimpleWordBoundaries) } } } diff --git a/Sources/RegexBenchmark/BenchmarkRegistration.swift b/Sources/RegexBenchmark/BenchmarkRegistration.swift index a3abef8e4..e12502e99 100644 --- a/Sources/RegexBenchmark/BenchmarkRegistration.swift +++ b/Sources/RegexBenchmark/BenchmarkRegistration.swift @@ -18,6 +18,8 @@ extension BenchmarkRunner { self.addDiceNotation() self.addErrorMessages() self.addIpAddress() + + self.addURLWithWordBoundaries() // -- end of registrations -- } } diff --git a/Sources/RegexBenchmark/BenchmarkRunner.swift b/Sources/RegexBenchmark/BenchmarkRunner.swift index b067b9679..6abee43aa 100644 --- a/Sources/RegexBenchmark/BenchmarkRunner.swift +++ b/Sources/RegexBenchmark/BenchmarkRunner.swift @@ -33,7 +33,8 @@ struct BenchmarkRunner { input: String, pattern: String, _ type: Benchmark.MatchType, - alsoRunScalarSemantic: Bool = true + alsoRunScalarSemantic: Bool = true, + alsoRunSimpleWordBoundaries: Bool ) { let swiftRegex = try! Regex(pattern) let nsRegex: NSRegularExpression @@ -58,6 +59,16 @@ struct BenchmarkRunner { type: .init(type), target: input)) + if alsoRunSimpleWordBoundaries { + register( + Benchmark( + name: nameBase + nameSuffix + "_SimpleWordBoundaries", + regex: swiftRegex.wordBoundaryKind(.simple), + pattern: pattern, + type: type, + target: input)) + } + if alsoRunScalarSemantic { register( Benchmark( diff --git a/Sources/RegexBenchmark/Inputs/URL.swift b/Sources/RegexBenchmark/Inputs/URL.swift new file mode 100644 index 000000000..b1b03f53d --- /dev/null +++ b/Sources/RegexBenchmark/Inputs/URL.swift @@ -0,0 +1,22 @@ +extension Inputs { + static let url: String = { + let element = """ + Item 1 | Item 2® •Item 3 Item4 + + + \t\t\t + + Check it out here: http://www.test.com/this-is-a-fake-url-that-should-be-replaced?a=1 + Check it out here: https://www.test.com/this-is-a-fake-url-that-should-be-replaced?a=1 + This is not a web link ftp://user@host:domain.com/path + This is a link without a scheme www.apple.com/mac + + This is some good text and should not be removed. + Thanks. + 😀🩷🤵🏿 + """ + let multiplier = 30 + return Array(repeating: element, count: multiplier).joined() + }() + +} diff --git a/Sources/RegexBenchmark/Suite/URLRegex.swift b/Sources/RegexBenchmark/Suite/URLRegex.swift new file mode 100644 index 000000000..e5f00f4e7 --- /dev/null +++ b/Sources/RegexBenchmark/Suite/URLRegex.swift @@ -0,0 +1,14 @@ +import _StringProcessing + +extension BenchmarkRunner { + mutating func addURLWithWordBoundaries() { + let urlRegex = #"https?://([-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6})\b[-a-zA-Z0-9()@:%_+.~#?&=]*"# + let url = CrossBenchmark( + baseName: "URLWithWordBoundaries", + regex: urlRegex, + input: Inputs.url, + alsoRunSimpleWordBoundaries: true + ) + url.register(&self) + } +} From 2212df8a2f49b3030b58a5fbf271df9e5d9d6a48 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Fri, 11 Oct 2024 09:29:01 -0600 Subject: [PATCH 09/22] Benchmarks: can specify by exact name instead of pattern --- Sources/RegexBenchmark/CLI.swift | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/Sources/RegexBenchmark/CLI.swift b/Sources/RegexBenchmark/CLI.swift index 77ebff47b..67dc4f8e2 100644 --- a/Sources/RegexBenchmark/CLI.swift +++ b/Sources/RegexBenchmark/CLI.swift @@ -37,7 +37,10 @@ struct Runner: ParsableCommand { @Flag(help: "Exclude running NSRegex benchmarks") var excludeNs = false - + + @Flag(help: "Rather than specify specific-benchmarks as patterns, use exact names") + var exactName = false + @Flag(help: """ Enable tracing of the engine (warning: lots of output). Prints out processor state each cycle @@ -73,7 +76,11 @@ swift build -c release -Xswiftc -DPROCESSOR_MEASUREMENTS_ENABLED if !self.specificBenchmarks.isEmpty { runner.suite = runner.suite.filter { b in specificBenchmarks.contains { pattern in - try! Regex(pattern).firstMatch(in: b.name) != nil + if exactName { + return pattern == b.name + } + + return try! Regex(pattern).firstMatch(in: b.name) != nil } } } From 73737f7cf29d22763479d65cdcea724cd819f7b8 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 9 Oct 2024 14:32:57 -0600 Subject: [PATCH 10/22] Persist persistent state in the processor. Refactor executor interfaces as well --- .../Algorithms/Algorithms/Ranges.swift | 4 +- .../Algorithms/Matching/Matches.swift | 85 +----- Sources/_StringProcessing/CMakeLists.txt | 2 - Sources/_StringProcessing/Compiler.swift | 4 +- .../_StringProcessing/Engine/Consume.swift | 58 ----- Sources/_StringProcessing/Engine/Engine.swift | 37 --- .../_StringProcessing/Engine/Processor.swift | 52 ++-- Sources/_StringProcessing/Executor.swift | 245 +++++++++++++----- Sources/_StringProcessing/Regex/Match.swift | 74 +++--- Tests/RegexTests/CaptureTests.swift | 14 +- Tests/RegexTests/CompileTests.swift | 4 +- 11 files changed, 275 insertions(+), 304 deletions(-) delete mode 100644 Sources/_StringProcessing/Engine/Consume.swift delete mode 100644 Sources/_StringProcessing/Engine/Engine.swift diff --git a/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift b/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift index a82fb875c..9c750c979 100644 --- a/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift +++ b/Sources/_StringProcessing/Algorithms/Algorithms/Ranges.swift @@ -99,10 +99,10 @@ struct RegexRangesSequence { regex: Regex ) { self.base = .init( + program: regex.regex.program.loweredProgram, input: input, subjectBounds: subjectBounds, - searchBounds: searchBounds, - regex: regex) + searchBounds: searchBounds) } } diff --git a/Sources/_StringProcessing/Algorithms/Matching/Matches.swift b/Sources/_StringProcessing/Algorithms/Matching/Matches.swift index 3c435be97..9b111d15f 100644 --- a/Sources/_StringProcessing/Algorithms/Matching/Matches.swift +++ b/Sources/_StringProcessing/Algorithms/Matching/Matches.swift @@ -12,85 +12,7 @@ // MARK: Regex algorithms @available(SwiftStdlib 5.7, *) -struct RegexMatchesSequence { - let input: String - let subjectBounds: Range - let searchBounds: Range - let regex: Regex - - init( - input: String, - subjectBounds: Range, - searchBounds: Range, - regex: Regex - ) { - self.input = input - self.subjectBounds = subjectBounds - self.searchBounds = searchBounds - self.regex = regex - } -} - -@available(SwiftStdlib 5.7, *) -extension RegexMatchesSequence: Sequence { - /// Returns the index to start searching for the next match after `match`. - fileprivate func searchIndex(after match: Regex.Match) -> String.Index? { - if !match.range.isEmpty { - return match.range.upperBound - } - - // If the last match was an empty match, advance by one position and - // run again, unless at the end of `input`. - guard match.range.lowerBound < subjectBounds.upperBound else { - return nil - } - - switch regex.initialOptions.semanticLevel { - case .graphemeCluster: - return input.index(after: match.range.upperBound) - case .unicodeScalar: - return input.unicodeScalars.index(after: match.range.upperBound) - } - } - - struct Iterator: IteratorProtocol { - let base: RegexMatchesSequence - - // Because `RegexMatchesCollection` eagerly computes the first match for - // its `startIndex`, the iterator can use that match for its initial - // iteration. For subsequent calls to `next()`, this value is `false`, and - // `nextStart` is used to search for the next match. - var initialIteration = true - - // Set to nil when iteration is finished (because some regex can empty-match - // at the end of the subject). - var currentPosition: String.Index? - - init(_ matches: RegexMatchesSequence) { - self.base = matches - self.currentPosition = base.subjectBounds.lowerBound - } - - mutating func next() -> Regex.Match? { - // `currentPosition` is `nil` when iteration has completed - guard let position = currentPosition, position <= base.searchBounds.upperBound else { - return nil - } - - // Otherwise, find the next match (if any) and compute `nextStart` - let match = try? base.regex._firstMatch( - base.input, - subjectBounds: base.subjectBounds, - searchBounds: position.. Iterator { - Iterator(self) - } -} +typealias RegexMatchesSequence = Executor.Matches extension BidirectionalCollection where SubSequence == Substring { @available(SwiftStdlib 5.7, *) @@ -99,10 +21,10 @@ extension BidirectionalCollection where SubSequence == Substring { of regex: R ) -> RegexMatchesSequence { RegexMatchesSequence( + program: regex.regex.program.loweredProgram, input: self[...].base, subjectBounds: startIndex...Match> for SE-0346 @@ -116,6 +38,7 @@ extension BidirectionalCollection where SubSequence == Substring { // FIXME: Array init calls count, which double-executes the regex :-( // FIXME: just return some Collection.Match> var result = Array.Match>() + for match in _matches(of: r) { result.append(match) } diff --git a/Sources/_StringProcessing/CMakeLists.txt b/Sources/_StringProcessing/CMakeLists.txt index 3b00470ff..0ebe2aed6 100644 --- a/Sources/_StringProcessing/CMakeLists.txt +++ b/Sources/_StringProcessing/CMakeLists.txt @@ -18,8 +18,6 @@ add_library(_StringProcessing Algorithms/Searchers/CollectionSearcher.swift Algorithms/Searchers/ZSearcher.swift Engine/Backtracking.swift - Engine/Consume.swift - Engine/Engine.swift Engine/InstPayload.swift Engine/Instruction.swift Engine/MEBuilder.swift diff --git a/Sources/_StringProcessing/Compiler.swift b/Sources/_StringProcessing/Compiler.swift index 530788266..33cffaf20 100644 --- a/Sources/_StringProcessing/Compiler.swift +++ b/Sources/_StringProcessing/Compiler.swift @@ -89,7 +89,7 @@ func _compileRegex( _ regex: String, _ syntax: SyntaxOptions = .traditional, _ semanticLevel: RegexSemanticLevel? = nil -) throws -> Executor { +) throws -> MEProgram { let ast = try parse(regex, syntax) let dsl: DSLTree @@ -104,7 +104,7 @@ func _compileRegex( dsl = ast.dslTree } let program = try Compiler(tree: dsl).emit() - return Executor(program: program) + return program } @_spi(RegexBenchmark) diff --git a/Sources/_StringProcessing/Engine/Consume.swift b/Sources/_StringProcessing/Engine/Consume.swift deleted file mode 100644 index 6af973919..000000000 --- a/Sources/_StringProcessing/Engine/Consume.swift +++ /dev/null @@ -1,58 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// -//===----------------------------------------------------------------------===// - -var checkComments = true - -extension Engine { - func makeProcessor( - input: String, bounds: Range, matchMode: MatchMode - ) -> Processor { - Processor( - program: program, - input: input, - subjectBounds: bounds, - searchBounds: bounds, - matchMode: matchMode, - isTracingEnabled: enableTracing, - shouldMeasureMetrics: enableMetrics) - } - - func makeFirstMatchProcessor( - input: String, - subjectBounds: Range, - searchBounds: Range - ) -> Processor { - Processor( - program: program, - input: input, - subjectBounds: subjectBounds, - searchBounds: searchBounds, - matchMode: .partialFromFront, - isTracingEnabled: enableTracing, - shouldMeasureMetrics: enableMetrics) - } -} - -extension Processor { - // TODO: Should we throw here? - mutating func consume() -> Input.Index? { - while true { - switch self.state { - case .accept: - return self.currentPosition - case .fail: - return nil - case .inProgress: self.cycle() - } - } - } -} - diff --git a/Sources/_StringProcessing/Engine/Engine.swift b/Sources/_StringProcessing/Engine/Engine.swift deleted file mode 100644 index a5cb11bd6..000000000 --- a/Sources/_StringProcessing/Engine/Engine.swift +++ /dev/null @@ -1,37 +0,0 @@ -//===----------------------------------------------------------------------===// -// -// This source file is part of the Swift.org open source project -// -// Copyright (c) 2021-2022 Apple Inc. and the Swift project authors -// Licensed under Apache License v2.0 with Runtime Library Exception -// -// See https://swift.org/LICENSE.txt for license information -// -//===----------------------------------------------------------------------===// - -// Currently, engine binds the type and consume binds an instance. -// But, we can play around with this. -struct Engine { - - let program: MEProgram - - // TODO: Pre-allocated register banks - - var instructions: InstructionList { program.instructions } - - var enableTracing: Bool { program.enableTracing } - var enableMetrics: Bool { program.enableMetrics } - - init(_ program: MEProgram) { - self.program = program - } -} - -struct AsyncEngine { /* ... */ } - -extension Engine: CustomStringConvertible { - var description: String { - // TODO: better description - return program.description - } -} diff --git a/Sources/_StringProcessing/Engine/Processor.swift b/Sources/_StringProcessing/Engine/Processor.swift index d6b2cfe0c..57e156fff 100644 --- a/Sources/_StringProcessing/Engine/Processor.swift +++ b/Sources/_StringProcessing/Engine/Processor.swift @@ -18,7 +18,7 @@ enum MatchMode { /// A concrete CU. Somehow will run the concrete logic and /// feed stuff back to generic code -struct Controller { +struct Controller: Equatable { var pc: InstructionAddress mutating func step() { @@ -48,6 +48,16 @@ struct Processor { /// `input.startIndex.. + let matchMode: MatchMode + let instructions: InstructionList + + // MARK: Update-only state + + var wordIndexCache: Set? = nil + var wordIndexMaxIndex: String.Index? = nil + + // MARK: Resettable state + /// The bounds within the subject for an individual search. /// /// `searchBounds` is equal to `subjectBounds` in some cases, but can be a @@ -57,12 +67,7 @@ struct Processor { /// Anchors like `^` and `.startOfSubject` use `subjectBounds` instead of /// `searchBounds`. The "start of matching" anchor `\G` uses `searchBounds` /// as its starting point. - let searchBounds: Range - - let matchMode: MatchMode - let instructions: InstructionList - - // MARK: Resettable state + var searchBounds: Range /// The current search position while processing. /// @@ -80,9 +85,6 @@ struct Processor { var storedCaptures: Array<_StoredCapture> - var wordIndexCache: Set? = nil - var wordIndexMaxIndex: String.Index? = nil - var state: State = .inProgress var failureReason: Error? = nil @@ -103,9 +105,7 @@ extension Processor { input: Input, subjectBounds: Range, searchBounds: Range, - matchMode: MatchMode, - isTracingEnabled: Bool, - shouldMeasureMetrics: Bool + matchMode: MatchMode ) { self.controller = Controller(pc: 0) self.instructions = program.instructions @@ -115,8 +115,8 @@ extension Processor { self.matchMode = matchMode self.metrics = ProcessorMetrics( - isTracingEnabled: isTracingEnabled, - shouldMeasureMetrics: shouldMeasureMetrics) + isTracingEnabled: program.enableTracing, + shouldMeasureMetrics: program.enableMetrics) self.currentPosition = searchBounds.lowerBound @@ -128,8 +128,12 @@ extension Processor { _checkInvariants() } - mutating func reset(currentPosition: Position) { + mutating func reset( + currentPosition: Position, + searchBounds: Range + ) { self.currentPosition = currentPosition + self.searchBounds = searchBounds self.controller = Controller(pc: 0) @@ -149,6 +153,22 @@ extension Processor { _checkInvariants() } + // Check that resettable state has been reset. Note that `reset()` + // takes a new current position and search bounds. + func isReset() -> Bool { + _checkInvariants() + guard self.controller == Controller(pc: 0), + self.savePoints.isEmpty, + self.callStack.isEmpty, + self.storedCaptures.allSatisfy({ $0.range == nil }), + self.state == .inProgress, + self.failureReason == nil + else { + return false + } + return true + } + func _checkInvariants() { assert(searchBounds.lowerBound >= subjectBounds.lowerBound) assert(searchBounds.upperBound <= subjectBounds.upperBound) diff --git a/Sources/_StringProcessing/Executor.swift b/Sources/_StringProcessing/Executor.swift index 5cf702514..6befcdbc8 100644 --- a/Sources/_StringProcessing/Executor.swift +++ b/Sources/_StringProcessing/Executor.swift @@ -11,93 +11,220 @@ internal import _RegexParser -struct Executor { - // TODO: consider let, for now lets us toggle tracing - var engine: Engine +/// `Executor` encapsulates the execution of the regex engine post-compilation. +/// It doesn't know anything about the `Regex` type or how to compile a regex. +@available(SwiftStdlib 5.7, *) +enum Executor { + static func prefixMatch( + _ program: MEProgram, + _ input: String, + subjectBounds: Range, + searchBounds: Range + ) throws -> Regex.Match? { + try Executor._run( + program, + input, + subjectBounds: subjectBounds, + searchBounds: searchBounds, + mode: .partialFromFront) + } - init(program: MEProgram) { - self.engine = Engine(program) + static func wholeMatch( + _ program: MEProgram, + _ input: String, + subjectBounds: Range, + searchBounds: Range + ) throws -> Regex.Match? { + try Executor._run( + program, + input, + subjectBounds: subjectBounds, + searchBounds: searchBounds, + mode: .wholeString) } - @available(SwiftStdlib 5.7, *) - func firstMatch( + static func firstMatch( + _ program: MEProgram, _ input: String, subjectBounds: Range, - searchBounds: Range, - graphemeSemantic: Bool + searchBounds: Range ) throws -> Regex.Match? { - var cpu = engine.makeFirstMatchProcessor( + var cpu = Processor( + program: program, input: input, subjectBounds: subjectBounds, - searchBounds: searchBounds) -#if PROCESSOR_MEASUREMENTS_ENABLED - defer { if cpu.metrics.shouldMeasureMetrics { cpu.printMetrics() } } -#endif - var low = searchBounds.lowerBound - let high = searchBounds.upperBound + searchBounds: searchBounds, + matchMode: .partialFromFront) + return try Executor._firstMatch( + program, + using: &cpu) + } + + static func _firstMatch( + _ program: MEProgram, + using cpu: inout Processor + ) throws -> Regex.Match? { + let isGraphemeSemantic = program.initialOptions.semanticLevel == .graphemeCluster + + var low = cpu.searchBounds.lowerBound + let high = cpu.searchBounds.upperBound while true { - if let m: Regex.Match = try _match( - input, from: low, using: &cpu - ) { + if let m = try Executor._run(program, &cpu) { return m } - if low >= high { return nil } - if graphemeSemantic { - low = input.index( - low, offsetBy: 1, limitedBy: searchBounds.upperBound) ?? searchBounds.upperBound + // Fast-path for start-anchored regex + if program.canOnlyMatchAtStart { + return nil + } + if low == high { return nil } + if isGraphemeSemantic { + cpu.input.formIndex(after: &low) } else { - input.unicodeScalars.formIndex(after: &low) + cpu.input.unicodeScalars.formIndex(after: &low) + } + guard low <= high else { + return nil } - cpu.reset(currentPosition: low) + cpu.reset(currentPosition: low, searchBounds: cpu.searchBounds) + } + } +} + +@available(SwiftStdlib 5.7, *) +extension Executor { + struct Matches: Sequence { + var program: MEProgram + var input: String + var subjectBounds: Range + var searchBounds: Range + + struct Iterator: IteratorProtocol { + var program: MEProgram + var processor: Processor + var finished = false + } + + func makeIterator() -> Iterator { + Iterator( + program: program, + processor: Processor( + program: program, + input: input, + subjectBounds: subjectBounds, + searchBounds: searchBounds, + matchMode: .partialFromFront)) + } + } +} + +@available(SwiftStdlib 5.7, *) +extension Executor.Matches.Iterator { + func nextSearchIndex( + after range: Range + ) -> String.Index? { + if !range.isEmpty { + return range.upperBound + } + + // If the last match was an empty match, advance by one position and + // run again, unless at the end of `input`. + guard range.lowerBound < processor.subjectBounds.upperBound else { + return nil + } + + switch program.initialOptions.semanticLevel { + case .graphemeCluster: + return processor.input.index(after: range.upperBound) + case .unicodeScalar: + return processor.input.unicodeScalars.index(after: range.upperBound) + } + } + + mutating func next() -> Regex.Match? { + if finished { + return nil + } + guard let match = try? Executor._firstMatch( + program, using: &processor + ) else { + return nil + } + + // If there's more input to process, advance our position + // and search bounds. Otherwise, set to fail fast. + if let currentPosition = nextSearchIndex(after: match.range) { + processor.reset( + currentPosition: currentPosition, + searchBounds: currentPosition..( +@available(SwiftStdlib 5.7, *) +extension Executor { + static func _run( + _ program: MEProgram, _ input: String, - in subjectBounds: Range, - _ mode: MatchMode + subjectBounds: Range, + searchBounds: Range, + mode: MatchMode ) throws -> Regex.Match? { - var cpu = engine.makeProcessor( - input: input, bounds: subjectBounds, matchMode: mode) -#if PROCESSOR_MEASUREMENTS_ENABLED - defer { if cpu.metrics.shouldMeasureMetrics { cpu.printMetrics() } } -#endif - return try _match(input, from: subjectBounds.lowerBound, using: &cpu) + var cpu = Processor( + program: program, + input: input, + subjectBounds: subjectBounds, + searchBounds: searchBounds, + matchMode: mode) + return try _run(program, &cpu) } - @available(SwiftStdlib 5.7, *) - func _match( - _ input: String, - from currentPosition: String.Index, - using cpu: inout Processor + static func _run( + _ program: MEProgram, + _ cpu: inout Processor ) throws -> Regex.Match? { - // FIXME: currentPosition is already encapsulated in cpu, don't pass in - // FIXME: cpu.consume() should return the matched range, not the upper bound - guard let endIdx = cpu.consume() else { - if let e = cpu.failureReason { - throw e - } + + let startPosition = cpu.currentPosition + guard let endIdx = try cpu.run() else { return nil } - let capList = MECaptureList( values: cpu.storedCaptures, - referencedCaptureOffsets: engine.program.referencedCaptureOffsets) + referencedCaptureOffsets: program.referencedCaptureOffsets) - let range = currentPosition.., - _ mode: MatchMode - ) throws -> Regex.Match? { - try match(input, in: subjectBounds, mode) +extension Processor { + fileprivate mutating func run() throws -> Input.Index? { +#if PROCESSOR_MEASUREMENTS_ENABLED + defer { if cpu.metrics.shouldMeasureMetrics { cpu.printMetrics() } } +#endif + if self.state == .fail { + if let e = failureReason { + throw e + } + return nil + } + assert(isReset()) + while true { + switch self.state { + case .accept: + return self.currentPosition + case .fail: + if let e = failureReason { + throw e + } + return nil + case .inProgress: self.cycle() + } + } } } diff --git a/Sources/_StringProcessing/Regex/Match.swift b/Sources/_StringProcessing/Regex/Match.swift index 0b0b2e797..b4cecead2 100644 --- a/Sources/_StringProcessing/Regex/Match.swift +++ b/Sources/_StringProcessing/Regex/Match.swift @@ -109,7 +109,12 @@ extension Regex { /// - Returns: The match, if this regex matches the entirety of `string`; /// otherwise, `nil`. public func wholeMatch(in string: String) throws -> Regex.Match? { - try _match(string, in: string.startIndex.. Regex.Match? { - try _match(string, in: string.startIndex.. Regex.Match? { - try _firstMatch(string, in: string.startIndex.. Regex.Match? { - try _match(string.base, in: string.startIndex.. Regex.Match? { - try _match(string.base, in: string.startIndex.. Regex.Match? { - try _firstMatch(string.base, in: string.startIndex.., - mode: MatchMode = .wholeString - ) throws -> Regex.Match? { - let executor = Executor(program: regex.program.loweredProgram) - return try executor.match(input, in: subjectBounds, mode) - } - - func _firstMatch( - _ input: String, - in subjectBounds: Range - ) throws -> Regex.Match? { - try regex.program.loweredProgram.canOnlyMatchAtStart - ? _match(input, in: subjectBounds, mode: .partialFromFront) - : _firstMatch(input, subjectBounds: subjectBounds, searchBounds: subjectBounds) - } - - func _firstMatch( - _ input: String, - subjectBounds: Range, - searchBounds: Range - ) throws -> Regex.Match? { - let executor = Executor(program: regex.program.loweredProgram) - let graphemeSemantic = regex.initialOptions.semanticLevel == .graphemeCluster - return try executor.firstMatch( - input, - subjectBounds: subjectBounds, - searchBounds: searchBounds, - graphemeSemantic: graphemeSemantic) + let bounds = string.startIndex.. Executor { - let tree = ast.dslTree - let prog = try! Compiler(tree: tree).emit() - let executor = Executor(program: prog) - return executor +func compile(_ ast: AST) -> MEProgram { + try! Compiler(tree: ast.dslTree).emit() } func captureTest( @@ -184,8 +181,11 @@ func captureTest( for (input, output) in tests { let inputRange = input.startIndex...wholeMatch( + compile(ast), + input, + subjectBounds: inputRange, + searchBounds: inputRange ) else { XCTFail("No match", file: file, line: line) return diff --git a/Tests/RegexTests/CompileTests.swift b/Tests/RegexTests/CompileTests.swift index d0500847b..05212388d 100644 --- a/Tests/RegexTests/CompileTests.swift +++ b/Tests/RegexTests/CompileTests.swift @@ -154,7 +154,7 @@ extension RegexTests { ) throws { assert(!equivs.isEmpty) let progs = try equivs.map { - try _compileRegex($0).engine.program + try _compileRegex($0) } let ref = progs.first! for (prog, equiv) in zip(progs, equivs).dropFirst() { @@ -325,7 +325,7 @@ extension RegexTests { do { let prog = try _compileRegex(regex, syntax, semanticLevel) var found: Set = [] - for inst in prog.engine.instructions { + for inst in prog.instructions { let decoded = DecodedInstr.decode(inst) found.insert(decoded) From c23151d008d7eb32ecb6d778f0f768d059027177 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Fri, 11 Oct 2024 10:05:19 -0600 Subject: [PATCH 11/22] Faster processor resets Make processor reset faster by tracking dirty registers and only clearing Arrays that are non-empty. --- Sources/_StringProcessing/Engine/Processor.swift | 8 ++++++-- Sources/_StringProcessing/Engine/Registers.swift | 12 +++++++++++- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/Sources/_StringProcessing/Engine/Processor.swift b/Sources/_StringProcessing/Engine/Processor.swift index 57e156fff..520f6cf66 100644 --- a/Sources/_StringProcessing/Engine/Processor.swift +++ b/Sources/_StringProcessing/Engine/Processor.swift @@ -139,8 +139,12 @@ extension Processor { self.registers.reset(sentinel: searchBounds.upperBound) - self.savePoints.removeAll(keepingCapacity: true) - self.callStack.removeAll(keepingCapacity: true) + if !self.savePoints.isEmpty { + self.savePoints.removeAll(keepingCapacity: true) + } + if !self.callStack.isEmpty { + self.callStack.removeAll(keepingCapacity: true) + } for idx in storedCaptures.indices { storedCaptures[idx] = .init() diff --git a/Sources/_StringProcessing/Engine/Registers.swift b/Sources/_StringProcessing/Engine/Registers.swift index 7c0d8e2a7..43fb0b8d7 100644 --- a/Sources/_StringProcessing/Engine/Registers.swift +++ b/Sources/_StringProcessing/Engine/Registers.swift @@ -41,6 +41,8 @@ extension Processor { // MARK: writeable, resettable + var isDirty = false + // currently, useful for range-based quantification var ints: [Int] @@ -58,17 +60,22 @@ extension Processor.Registers { } subscript(_ i: IntRegister) -> Int { get { ints[i.rawValue] } - set { ints[i.rawValue] = newValue } + set { + isDirty = true + ints[i.rawValue] = newValue + } } subscript(_ i: ValueRegister) -> Any { get { values[i.rawValue] } set { + isDirty = true values[i.rawValue] = newValue } } subscript(_ i: PositionRegister) -> Input.Index { get { positions[i.rawValue] } set { + isDirty = true positions[i.rawValue] = newValue } } @@ -128,6 +135,9 @@ extension Processor.Registers { } mutating func reset(sentinel: Input.Index) { + guard isDirty else { + return + } self.ints._setAll(to: 0) self.values._setAll(to: SentinelValue()) self.positions._setAll(to: Processor.Registers.sentinelIndex) From 3345b0c0bf615c87abc25ef84ababc3d653f4e13 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Tue, 15 Oct 2024 13:30:53 -0600 Subject: [PATCH 12/22] Benchmark: option to output perf as csv --- Sources/RegexBenchmark/BenchmarkResults.swift | 50 ++++++++++++++++++- Sources/RegexBenchmark/CLI.swift | 13 ++++- Sources/RegexBenchmark/Utils/Time.swift | 5 ++ 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/Sources/RegexBenchmark/BenchmarkResults.swift b/Sources/RegexBenchmark/BenchmarkResults.swift index ae9c5ded2..da66183fd 100644 --- a/Sources/RegexBenchmark/BenchmarkResults.swift +++ b/Sources/RegexBenchmark/BenchmarkResults.swift @@ -21,7 +21,21 @@ extension BenchmarkRunner { self.results = result print("Loaded results from \(url.path)") } - + + /// Attempts to save results in a CSV format to the given path + func saveCSV(to savePath: String) throws { + let url = URL(fileURLWithPath: savePath, isDirectory: false) + let parent = url.deletingLastPathComponent() + if !FileManager.default.fileExists(atPath: parent.path) { + try! FileManager.default.createDirectory( + atPath: parent.path, + withIntermediateDirectories: true) + } + print("Saving result as CSV to \(url.path)") + try results.saveCSV(to: url) + + } + /// Compare this runner's results against the results stored in the given file path func compare( against compareFilePath: String, @@ -153,6 +167,12 @@ struct Measurement: Codable, CustomStringConvertible { var description: String { return "\(median) (stdev: \(Time(stdev)), N = \(samples))" } + + var asCSV: String { + """ + \(median.asCSVSeconds), \(stdev), \(samples) + """ + } } struct BenchmarkResult: Codable, CustomStringConvertible { @@ -170,6 +190,13 @@ struct BenchmarkResult: Codable, CustomStringConvertible { } return base } + + var asCSV: String { + let na = "N/A, N/A, N/A" + return """ + \(runtime.asCSV), \(compileTime?.asCSV ?? na), \(parseTime?.asCSV ?? na) + """ + } } extension BenchmarkResult { @@ -263,6 +290,27 @@ struct SuiteResult { } extension SuiteResult: Codable { + func saveCSV(to url: URL) throws { + var output: [(name: String, result: BenchmarkResult)] = [] + for key in results.keys { + output.append((key, results[key]!)) + } + output.sort { + $0.name < $1.name + } + var contents = """ + name,\ + runtime_median, runTime_stddev, runTime_samples,\ + compileTime_median, compileTime_stddev, compileTime_samples,\ + parseTime_median, parseTime_stddev, parseTime_samples\n + """ + for (name, result) in output { + contents.append("\(name), \(result.asCSV))\n") + } + print("Saving result as .csv to \(url.path())") + try contents.write(to: url, atomically: true, encoding: String.Encoding.utf8) + } + func save(to url: URL) throws { let encoder = JSONEncoder() let data = try encoder.encode(self) diff --git a/Sources/RegexBenchmark/CLI.swift b/Sources/RegexBenchmark/CLI.swift index 67dc4f8e2..ee3e94dae 100644 --- a/Sources/RegexBenchmark/CLI.swift +++ b/Sources/RegexBenchmark/CLI.swift @@ -32,6 +32,9 @@ struct Runner: ParsableCommand { @Option(help: "Save comparison results as csv") var saveComparison: String? + @Option(help: "Save benchmark results as csv") + var saveCSV: String? + @Flag(help: "Quiet mode") var quiet = false @@ -91,9 +94,14 @@ swift build -c release -Xswiftc -DPROCESSOR_MEASUREMENTS_ENABLED if let loadFile = load { try runner.load(from: loadFile) + if excludeNs { + runner.results.results = runner.results.results.filter { + !$0.key.contains("_NS") + } + } } else { if excludeNs { - runner.suite = runner.suite.filter { b in !b.name.contains("NS") } + runner.suite = runner.suite.filter { b in !b.name.contains("_NS") } } runner.run() } @@ -116,5 +124,8 @@ swift build -c release -Xswiftc -DPROCESSOR_MEASUREMENTS_ENABLED if let compareFile = compareCompileTime { try runner.compareCompileTimes(against: compareFile, showChart: showChart) } + if let csvPath = saveCSV { + try runner.saveCSV(to: csvPath) + } } } diff --git a/Sources/RegexBenchmark/Utils/Time.swift b/Sources/RegexBenchmark/Utils/Time.swift index 3fe567bda..592e18058 100644 --- a/Sources/RegexBenchmark/Utils/Time.swift +++ b/Sources/RegexBenchmark/Utils/Time.swift @@ -66,6 +66,11 @@ extension Time { } extension Time: CustomStringConvertible { + /// Normalize our time to fractions of a second for CSV output + public var asCSVSeconds: String { + return String(format: "%.3g", seconds) + } + public var description: String { if self.seconds == 0 { return "0" } if self.abs() < .attosecond { return String(format: "%.3gas", seconds * 1e18) } From a64b28522c7ba721a5cbcaf614c15b538e6cc0df Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Fri, 18 Oct 2024 10:57:25 -0500 Subject: [PATCH 13/22] [NFC] Re-enable tests for capture type munging (#775) The RegexBuilder tests for working with composed regexes with capture tuple types that don't play well with DSL methods were inadvertently disabled in a previous commit (by me). This re-enables those tests. --- Tests/RegexBuilderTests/RegexDSLTests.swift | 4 ---- 1 file changed, 4 deletions(-) diff --git a/Tests/RegexBuilderTests/RegexDSLTests.swift b/Tests/RegexBuilderTests/RegexDSLTests.swift index 7443cae55..1cb53b83d 100644 --- a/Tests/RegexBuilderTests/RegexDSLTests.swift +++ b/Tests/RegexBuilderTests/RegexDSLTests.swift @@ -1831,7 +1831,6 @@ fileprivate let regexWithNonCapture = #/:(?:\d+):/# @available(SwiftStdlib 5.7, *) extension RegexDSLTests { func testLabeledCaptures_regularCapture() throws { - return // The output type of a regex with unlabeled captures is concatenated. let dslWithCapture = Regex { OneOrMore(.word) @@ -1846,7 +1845,6 @@ extension RegexDSLTests { } func testLabeledCaptures_labeledCapture() throws { - return guard #available(macOS 13, *) else { throw XCTSkip("Fix only exists on macOS 13") } @@ -1870,7 +1868,6 @@ extension RegexDSLTests { } func testLabeledCaptures_coalescingWithCapture() throws { - return let coalescingWithCapture = Regex { "e" as Character #/\u{301}(\d*)/# @@ -1887,7 +1884,6 @@ extension RegexDSLTests { } func testLabeledCaptures_bothCapture() throws { - return guard #available(macOS 13, *) else { throw XCTSkip("Fix only exists on macOS 13") } From 7febb364ee65d5a64254895d3ac89bc4d5176afe Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Sun, 20 Oct 2024 08:10:27 -0600 Subject: [PATCH 14/22] Treat capture 0 (i.e. the whole match) specially (#777) Rather than have the whole-match capture be a stored capture, we handle it specially. This speeds up processor resets as we do not need to reset a stored capture (especially when the regex has no other captures). It also speeds up the creation of save points and backtracking, as it's one less capture to save/restore. --- Sources/_StringProcessing/ByteCodeGen.swift | 18 +++++++-- .../_StringProcessing/Engine/MEBuilder.swift | 39 +++++++++++++++---- .../_StringProcessing/Engine/MECapture.swift | 14 ------- .../_StringProcessing/Engine/Registers.swift | 4 ++ .../Engine/Structuralize.swift | 34 +++++++++++----- Sources/_StringProcessing/Executor.swift | 19 ++++++--- Sources/_StringProcessing/Regex/Match.swift | 5 ++- Tests/RegexBuilderTests/CustomTests.swift | 32 +++++++++++++++ 8 files changed, 124 insertions(+), 41 deletions(-) diff --git a/Sources/_StringProcessing/ByteCodeGen.swift b/Sources/_StringProcessing/ByteCodeGen.swift index 5fb8e89c7..3820f5045 100644 --- a/Sources/_StringProcessing/ByteCodeGen.swift +++ b/Sources/_StringProcessing/ByteCodeGen.swift @@ -43,9 +43,18 @@ extension Compiler { extension Compiler.ByteCodeGen { mutating func emitRoot(_ root: DSLTree.Node) throws -> MEProgram { - // The whole match (`.0` element of output) is equivalent to an implicit - // capture over the entire regex. - try emitNode(.capture(name: nil, reference: nil, root)) + // If the whole regex is a matcher, then the whole-match value + // is the constructed value. Denote that the current value + // register is the processor's value output. + switch root { + case .matcher: + builder.denoteCurrentValueIsWholeMatchValue() + default: + break + } + + try emitNode(root) + builder.canOnlyMatchAtStart = root.canOnlyMatchAtStart() builder.buildAccept() return try builder.assemble() @@ -149,8 +158,9 @@ fileprivate extension Compiler.ByteCodeGen { guard let i = n.value else { throw Unreachable("Expected a value") } + let cap = builder.captureRegister(forBackreference: i) builder.buildBackreference( - .init(i), isScalarMode: options.semanticLevel == .unicodeScalar) + cap, isScalarMode: options.semanticLevel == .unicodeScalar) case .named(let name): try builder.buildNamedReference( name, isScalarMode: options.semanticLevel == .unicodeScalar) diff --git a/Sources/_StringProcessing/Engine/MEBuilder.swift b/Sources/_StringProcessing/Engine/MEBuilder.swift index cd43dc764..c862cfae4 100644 --- a/Sources/_StringProcessing/Engine/MEBuilder.swift +++ b/Sources/_StringProcessing/Engine/MEBuilder.swift @@ -33,10 +33,18 @@ extension MEProgram { // Registers var nextIntRegister = IntRegister(0) - var nextCaptureRegister = CaptureRegister(0) var nextValueRegister = ValueRegister(0) var nextPositionRegister = PositionRegister(0) + // Set to non-nil when a value register holds the whole-match + // value (i.e. when a regex consists entirely of a custom matcher) + var wholeMatchValue: ValueRegister? = nil + + // Note: Capture 0 (i.e. whole-match) is handled specially + // by the engine, so `n` here refers to the regex AST's `n+1` + // capture + var nextCaptureRegister = CaptureRegister(0) + // Special addresses or instructions var failAddressToken: AddressToken? = nil @@ -70,6 +78,24 @@ extension MEProgram.Builder { self.second = b } } + + // Maps the AST's named capture offset to a capture register + func captureRegister(named name: String) throws -> CaptureRegister { + guard let index = captureList.indexOfCapture(named: name) else { + throw RegexCompilationError.uncapturedReference + } + return .init(index - 1) + } + + // Map an AST's backreference number to a capture register + func captureRegister(forBackreference i: Int) -> CaptureRegister { + .init(i - 1) + } + + mutating func denoteCurrentValueIsWholeMatchValue() { + assert(wholeMatchValue == nil) + wholeMatchValue = nextValueRegister + } } extension MEProgram.Builder { @@ -337,10 +363,8 @@ extension MEProgram.Builder { } mutating func buildNamedReference(_ name: String, isScalarMode: Bool) throws { - guard let index = captureList.indexOfCapture(named: name) else { - throw RegexCompilationError.uncapturedReference - } - buildBackreference(.init(index), isScalarMode: isScalarMode) + let cap = try captureRegister(named: name) + buildBackreference(cap, isScalarMode: isScalarMode) } // TODO: Mutating because of fail address fixup, drop when @@ -401,6 +425,7 @@ extension MEProgram.Builder { regInfo.transformFunctions = transformFunctions.count regInfo.matcherFunctions = matcherFunctions.count regInfo.captures = nextCaptureRegister.rawValue + regInfo.wholeMatchValue = wholeMatchValue?.rawValue return MEProgram( instructions: InstructionList(instructions), @@ -514,8 +539,8 @@ extension MEProgram.Builder { assert(preexistingValue == nil) } if let name = name { - let index = captureList.indexOfCapture(named: name) - assert(index == nextCaptureRegister.rawValue) + let cap = try? captureRegister(named: name) + assert(cap == nextCaptureRegister) } assert(nextCaptureRegister.rawValue < captureList.captures.count) return nextCaptureRegister diff --git a/Sources/_StringProcessing/Engine/MECapture.swift b/Sources/_StringProcessing/Engine/MECapture.swift index 3dfda6b94..e18365d66 100644 --- a/Sources/_StringProcessing/Engine/MECapture.swift +++ b/Sources/_StringProcessing/Engine/MECapture.swift @@ -84,17 +84,3 @@ extension Processor { } } } - -struct MECaptureList { - var values: Array - var referencedCaptureOffsets: [ReferenceID: Int] - - func latestUntyped(from input: String) -> Array { - values.map { - guard let range = $0.range else { - return nil - } - return input[range] - } - } -} diff --git a/Sources/_StringProcessing/Engine/Registers.swift b/Sources/_StringProcessing/Engine/Registers.swift index 43fb0b8d7..eb600e971 100644 --- a/Sources/_StringProcessing/Engine/Registers.swift +++ b/Sources/_StringProcessing/Engine/Registers.swift @@ -172,6 +172,10 @@ extension MEProgram { var positionStackAddresses = 0 var savePointAddresses = 0 var captures = 0 + + // The value register holding the whole-match value, if there + // is one + var wholeMatchValue: Int? = nil } } diff --git a/Sources/_StringProcessing/Engine/Structuralize.swift b/Sources/_StringProcessing/Engine/Structuralize.swift index df109083c..d91d0f1a9 100644 --- a/Sources/_StringProcessing/Engine/Structuralize.swift +++ b/Sources/_StringProcessing/Engine/Structuralize.swift @@ -1,20 +1,36 @@ internal import _RegexParser -extension CaptureList { - @available(SwiftStdlib 5.7, *) - func createElements( - _ list: MECaptureList +@available(SwiftStdlib 5.7, *) +extension Executor { + static func createExistentialElements( + _ program: MEProgram, + matchRange: Range, + storedCaptures: [Processor._StoredCapture], + wholeMatchValue: Any? ) -> [AnyRegexOutput.ElementRepresentation] { - assert(list.values.count == captures.count) - + let capList = program.captureList + let capOffsets = program.referencedCaptureOffsets + + // Formal captures include the entire match + assert(storedCaptures.count + 1 == capList.captures.count) + var result = [AnyRegexOutput.ElementRepresentation]() - - for (i, (cap, meStored)) in zip(captures, list.values).enumerated() { + result.reserveCapacity(1 + capList.captures.count) + result.append( + AnyRegexOutput.ElementRepresentation( + optionalDepth: 0, + content: (matchRange, wholeMatchValue), + visibleInTypedOutput: capList.captures[0].visibleInTypedOutput) + ) + + for (i, (cap, meStored)) in zip( + capList.captures.dropFirst(), storedCaptures + ).enumerated() { let element = AnyRegexOutput.ElementRepresentation( optionalDepth: cap.optionalDepth, content: meStored.deconstructed, name: cap.name, - referenceID: list.referencedCaptureOffsets.first { $1 == i }?.key, + referenceID: capOffsets.first { $1 == i }?.key, visibleInTypedOutput: cap.visibleInTypedOutput ) diff --git a/Sources/_StringProcessing/Executor.swift b/Sources/_StringProcessing/Executor.swift index 6befcdbc8..46c03eb15 100644 --- a/Sources/_StringProcessing/Executor.swift +++ b/Sources/_StringProcessing/Executor.swift @@ -190,15 +190,22 @@ extension Executor { guard let endIdx = try cpu.run() else { return nil } - let capList = MECaptureList( - values: cpu.storedCaptures, - referencedCaptureOffsets: program.referencedCaptureOffsets) - let range = startPosition.. Date: Fri, 8 Nov 2024 13:11:17 -0700 Subject: [PATCH 15/22] Make generic wrappers inlinable for specialization (#784) --- Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift | 1 + Sources/_StringProcessing/Regex/Match.swift | 2 ++ 2 files changed, 3 insertions(+) diff --git a/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift b/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift index de0cabb0a..b8fdd514f 100644 --- a/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift +++ b/Sources/_StringProcessing/Algorithms/Matching/FirstMatch.swift @@ -19,6 +19,7 @@ extension BidirectionalCollection where SubSequence == Substring { /// - Returns: The first match of `regex` in the collection, or `nil` if /// there isn't a match. @available(SwiftStdlib 5.7, *) + @inlinable public func firstMatch( of r: some RegexComponent ) -> Regex.Match? { diff --git a/Sources/_StringProcessing/Regex/Match.swift b/Sources/_StringProcessing/Regex/Match.swift index 3ade1f2f2..c0a3aca3c 100644 --- a/Sources/_StringProcessing/Regex/Match.swift +++ b/Sources/_StringProcessing/Regex/Match.swift @@ -303,6 +303,7 @@ extension BidirectionalCollection where SubSequence == Substring { /// - Parameter regex: The regular expression to match. /// - Returns: The match, if one is found. If there is no match, or a /// transformation in `regex` throws an error, this method returns `nil`. + @inlinable public func wholeMatch( of regex: R ) -> Regex.Match? { @@ -314,6 +315,7 @@ extension BidirectionalCollection where SubSequence == Substring { /// - Parameter regex: The regular expression to match. /// - Returns: The match, if one is found. If there is no match, or a /// transformation in `regex` throws an error, this method returns `nil`. + @inlinable public func prefixMatch( of regex: R ) -> Regex.Match? { From d10a14a5bdd3411fc31caf5009a7f90cad270044 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Sat, 9 Nov 2024 09:15:35 -0700 Subject: [PATCH 16/22] Benchmark: add a path-filtering benchmark (#786) --- .../BenchmarkRegistration.swift | 1 + Sources/RegexBenchmark/Inputs/FSPaths.swift | 51 +++++++++++++++++++ .../RegexBenchmark/Suite/FSPathsRegex.swift | 16 ++++++ 3 files changed, 68 insertions(+) create mode 100644 Sources/RegexBenchmark/Inputs/FSPaths.swift create mode 100644 Sources/RegexBenchmark/Suite/FSPathsRegex.swift diff --git a/Sources/RegexBenchmark/BenchmarkRegistration.swift b/Sources/RegexBenchmark/BenchmarkRegistration.swift index e12502e99..a812b84be 100644 --- a/Sources/RegexBenchmark/BenchmarkRegistration.swift +++ b/Sources/RegexBenchmark/BenchmarkRegistration.swift @@ -20,6 +20,7 @@ extension BenchmarkRunner { self.addIpAddress() self.addURLWithWordBoundaries() + self.addFSPathsRegex() // -- end of registrations -- } } diff --git a/Sources/RegexBenchmark/Inputs/FSPaths.swift b/Sources/RegexBenchmark/Inputs/FSPaths.swift new file mode 100644 index 000000000..a2537c598 --- /dev/null +++ b/Sources/RegexBenchmark/Inputs/FSPaths.swift @@ -0,0 +1,51 @@ +// Successful match FSPaths +private let fsPathSuccess = #""" +./First/Second/Third/some/really/long/content.extension/more/stuff/OptionLeft +./First/Second/Third/some/really/long/content.extension/more/stuff/OptionRight +./First/Second/PrefixThird/some/really/long/content.extension/more/stuff/OptionLeft +./First/Second/PrefixThird/some/really/long/content.extension/more/stuff/OptionRight +"""# + +// Unsucessful match FSPaths. +// +// We will have far more failures than successful matches by interspersing +// this whole list between each success +private let fsPathFailure = #""" +a/b/c +/smol/path +/a/really/long/path/that/is/certainly/stored/out/of/line +./First/Second/Third/some/really/long/content.extension/more/stuff/NothingToSeeHere +./First/Second/PrefixThird/some/really/long/content.extension/more/stuff/NothingToSeeHere +./First/Second/Third/some/really/long/content.extension/more/stuff/OptionNeither +./First/Second/PrefixThird/some/really/long/content.extension/more/stuff/OptionNeither +/First/Second/Third/some/really/long/content.extension/more/stuff/OptionLeft +/First/Second/Third/some/really/long/content.extension/more/stuff/OptionRight +/First/Second/PrefixThird/some/really/long/content.extension/more/stuff/OptionLeft +/First/Second/PrefixThird/some/really/long/content.extension/more/stuff/OptionRight +./First/Second/Third/some/really/long/content/more/stuff/OptionLeft +./First/Second/Third/some/really/long/content/more/stuff/OptionRight +./First/Second/PrefixThird/some/really/long/content/more/stuff/OptionLeft +./First/Second/PrefixThird/some/really/long/content/more/stuff/OptionRight +"""# + +extension Inputs { + static let fsPathsList: [String] = { + var result: [String] = [] + let failures: [String] = fsPathFailure.split(whereSeparator: { $0.isNewline }).map { String($0) } + result.append(contentsOf: failures) + + for success in fsPathSuccess.split(whereSeparator: { $0.isNewline }) { + result.append(String(success)) + result.append(contentsOf: failures) + } + + // Scale result up a bit + result.append(contentsOf: result) + result.append(contentsOf: result) + result.append(contentsOf: result) + result.append(contentsOf: result) + + return result + + }() +} diff --git a/Sources/RegexBenchmark/Suite/FSPathsRegex.swift b/Sources/RegexBenchmark/Suite/FSPathsRegex.swift new file mode 100644 index 000000000..a0bf8f355 --- /dev/null +++ b/Sources/RegexBenchmark/Suite/FSPathsRegex.swift @@ -0,0 +1,16 @@ +import _StringProcessing + + +extension BenchmarkRunner { + mutating func addFSPathsRegex() { + let fsPathsRegex = + #"^\./First/Second/(Prefix)?Third/.*\.extension/.*(OptionLeft|OptionRight)$"# + let paths = CrossInputListBenchmark( + baseName: "FSPathsRegex", + regex: fsPathsRegex, + inputs: Inputs.fsPathsList + ) + paths.register(&self) + } +} + From 3a3626b07e17fb0d043fbe5c924033b961f5b748 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Fri, 8 Nov 2024 13:09:43 -0700 Subject: [PATCH 17/22] Update test dependencies --- Package.swift | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Package.swift b/Package.swift index 1f5e10f0a..e9198908d 100644 --- a/Package.swift +++ b/Package.swift @@ -96,7 +96,7 @@ let package = Package( swiftSettings: [availabilityDefinition]), .testTarget( name: "RegexTests", - dependencies: ["_StringProcessing", "TestSupport"], + dependencies: ["_StringProcessing", "RegexBuilder", "TestSupport"], swiftSettings: [ availabilityDefinition ]), From 9925df6afe9b93d041fdfce123e7b01b8ff15105 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Fri, 8 Nov 2024 09:42:31 -0700 Subject: [PATCH 18/22] Add UTF-8 byte matching optimization --- Sources/RegexBenchmark/Suite/URLRegex.swift | 1 + Sources/_StringProcessing/ByteCodeGen.swift | 24 ++++++++++ .../Engine/InstPayload.swift | 16 +++++-- .../Engine/Instruction.swift | 11 +++++ .../_StringProcessing/Engine/MEBuilder.swift | 11 +++-- .../_StringProcessing/Engine/MEProgram.swift | 2 +- .../_StringProcessing/Engine/Processor.swift | 48 +++++++++++++++++++ .../_StringProcessing/Engine/Registers.swift | 20 ++++---- .../_StringProcessing/MatchingOptions.swift | 10 +++- .../_StringProcessing/Utility/TypedInt.swift | 5 +- Tests/RegexTests/CompileTests.swift | 25 +++++++++- 11 files changed, 150 insertions(+), 23 deletions(-) diff --git a/Sources/RegexBenchmark/Suite/URLRegex.swift b/Sources/RegexBenchmark/Suite/URLRegex.swift index e5f00f4e7..9e28f5396 100644 --- a/Sources/RegexBenchmark/Suite/URLRegex.swift +++ b/Sources/RegexBenchmark/Suite/URLRegex.swift @@ -12,3 +12,4 @@ extension BenchmarkRunner { url.register(&self) } } + diff --git a/Sources/_StringProcessing/ByteCodeGen.swift b/Sources/_StringProcessing/ByteCodeGen.swift index 3820f5045..ae8a1c8b7 100644 --- a/Sources/_StringProcessing/ByteCodeGen.swift +++ b/Sources/_StringProcessing/ByteCodeGen.swift @@ -117,6 +117,30 @@ fileprivate extension Compiler.ByteCodeGen { } mutating func emitQuotedLiteral(_ s: String) { + // ASCII is normalization-invariant, so is the safe subset for + // us to optimize + if optimizationsEnabled, + !options.usesCanonicalEquivalence || s.utf8.allSatisfy(\._isASCII), + !s.isEmpty + { + + // TODO: Make an optimizations configuration struct, where + // we can enable/disable specific optimizations and change + // thresholds + let longThreshold = 5 + + // Longer content will be matched against UTF-8 in contiguous + // memory + // + // TODO: case-insensitive variant (just add/subtract from + // ASCII value) + if s.utf8.count >= longThreshold, !options.isCaseInsensitive { + let boundaryCheck = options.semanticLevel == .graphemeCluster + builder.buildMatchUTF8(Array(s.utf8), boundaryCheck: boundaryCheck) + return + } + } + guard options.semanticLevel == .graphemeCluster else { for char in s { for scalar in char.unicodeScalars { diff --git a/Sources/_StringProcessing/Engine/InstPayload.swift b/Sources/_StringProcessing/Engine/InstPayload.swift index 41293c6f2..550fa4915 100644 --- a/Sources/_StringProcessing/Engine/InstPayload.swift +++ b/Sources/_StringProcessing/Engine/InstPayload.swift @@ -43,7 +43,7 @@ extension Instruction.Payload { // and variables case string(StringRegister) - case sequence(SequenceRegister) + case utf8(UTF8Register) case position(PositionRegister) case optionalString(StringRegister?) case int(IntRegister) @@ -168,10 +168,18 @@ extension Instruction.Payload { return (scalar, caseInsensitive: caseInsensitive, boundaryCheck: boundaryCheck) } - init(sequence: SequenceRegister) { - self.init(sequence) + init(utf8: UTF8Register, boundaryCheck: Bool) { + self.init(boundaryCheck ? 1 : 0, utf8) } - var sequence: SequenceRegister { + var matchUTF8Payload: (UTF8Register, boundaryCheck: Bool) { + let pair: (UInt64, UTF8Register) = interpretPair() + return (pair.1, pair.0 == 1) + } + + init(utf8: UTF8Register) { + self.init(utf8) + } + var utf8: UTF8Register { interpret() } diff --git a/Sources/_StringProcessing/Engine/Instruction.swift b/Sources/_StringProcessing/Engine/Instruction.swift index 011174660..80bfd9b05 100644 --- a/Sources/_StringProcessing/Engine/Instruction.swift +++ b/Sources/_StringProcessing/Engine/Instruction.swift @@ -112,6 +112,17 @@ extension Instruction { /// Operands: Scalar value to match against and booleans case matchScalar + /// Match directly (binary semantics) against a series of UTF-8 bytes + /// + /// NOTE: Compiler should ensure to only emit this instruction when normalization + /// is not required. E.g., scalar-semantic mode or when the matched portion is entirely ASCII + /// (which is invariant under NFC). Similary, this is case-sensitive. + /// + /// TODO: should we add case-insensitive? + /// + /// matchUTF8(_: UTF8Register, boundaryCheck: Bool) + case matchUTF8 + /// Match a character or a scalar against a set of valid ascii values stored in a bitset /// /// matchBitset(_: AsciiBitsetRegister, isScalar: Bool) diff --git a/Sources/_StringProcessing/Engine/MEBuilder.swift b/Sources/_StringProcessing/Engine/MEBuilder.swift index c862cfae4..2225592e3 100644 --- a/Sources/_StringProcessing/Engine/MEBuilder.swift +++ b/Sources/_StringProcessing/Engine/MEBuilder.swift @@ -20,7 +20,7 @@ extension MEProgram { var enableMetrics = false var elements = TypedSetVector() - var sequences = TypedSetVector<[Input.Element], _SequenceRegister>() + var utf8Contents = TypedSetVector<[UInt8], _UTF8Register>() var asciiBitsets: [DSLTree.CustomCharacterClass.AsciiBitset] = [] var consumeFunctions: [ConsumeFunction] = [] @@ -198,6 +198,11 @@ extension MEProgram.Builder { .match, .init(element: elements.store(e), isCaseInsensitive: isCaseInsensitive))) } + mutating func buildMatchUTF8(_ utf8: Array, boundaryCheck: Bool) { + instructions.append(.init(.matchUTF8, .init( + utf8: utf8Contents.store(utf8), boundaryCheck: boundaryCheck))) + } + mutating func buildMatchScalar(_ s: Unicode.Scalar, boundaryCheck: Bool) { instructions.append(.init(.matchScalar, .init(scalar: s, caseInsensitive: false, boundaryCheck: boundaryCheck))) } @@ -416,7 +421,7 @@ extension MEProgram.Builder { var regInfo = MEProgram.RegisterInfo() regInfo.elements = elements.count - regInfo.sequences = sequences.count + regInfo.utf8Contents = utf8Contents.count regInfo.ints = nextIntRegister.rawValue regInfo.values = nextValueRegister.rawValue regInfo.positions = nextPositionRegister.rawValue @@ -430,7 +435,7 @@ extension MEProgram.Builder { return MEProgram( instructions: InstructionList(instructions), staticElements: elements.stored, - staticSequences: sequences.stored, + staticUTF8Contents: utf8Contents.stored, staticBitsets: asciiBitsets, staticConsumeFunctions: consumeFunctions, staticTransformFunctions: transformFunctions, diff --git a/Sources/_StringProcessing/Engine/MEProgram.swift b/Sources/_StringProcessing/Engine/MEProgram.swift index e144cf932..a5b24315c 100644 --- a/Sources/_StringProcessing/Engine/MEProgram.swift +++ b/Sources/_StringProcessing/Engine/MEProgram.swift @@ -23,7 +23,7 @@ struct MEProgram { var instructions: InstructionList var staticElements: [Input.Element] - var staticSequences: [[Input.Element]] + var staticUTF8Contents: [[UInt8]] var staticBitsets: [DSLTree.CustomCharacterClass.AsciiBitset] var staticConsumeFunctions: [ConsumeFunction] var staticTransformFunctions: [TransformFunction] diff --git a/Sources/_StringProcessing/Engine/Processor.swift b/Sources/_StringProcessing/Engine/Processor.swift index 520f6cf66..89ca29597 100644 --- a/Sources/_StringProcessing/Engine/Processor.swift +++ b/Sources/_StringProcessing/Engine/Processor.swift @@ -320,6 +320,24 @@ extension Processor { return true } + // TODO: bytes should be a Span or RawSpan + mutating func matchUTF8( + _ bytes: Array, + boundaryCheck: Bool + ) -> Bool { + guard let next = input.matchUTF8( + bytes, + at: currentPosition, + limitedBy: end, + boundaryCheck: boundaryCheck + ) else { + signalFailure() + return false + } + currentPosition = next + return true + } + // If we have a bitset we know that the CharacterClass only matches against // ascii characters, so check if the current input element is ascii then // check if it is set in the bitset @@ -542,6 +560,15 @@ extension Processor { controller.step() } + case .matchUTF8: + let (utf8Reg, boundaryCheck) = payload.matchUTF8Payload + let utf8Content = registers[utf8Reg] + if matchUTF8( + utf8Content, boundaryCheck: boundaryCheck + ) { + controller.step() + } + case .matchBitset: let (isScalar, reg) = payload.bitsetPayload let bitset = registers[reg] @@ -752,6 +779,27 @@ extension String { return idx } + func matchUTF8( + _ bytes: Array, + at pos: Index, + limitedBy end: Index, + boundaryCheck: Bool + ) -> Index? { + var cur = pos + for b in bytes { + guard cur < end, self.utf8[cur] == b else { return nil } + self.utf8.formIndex(after: &cur) + } + + guard cur <= end else { return nil } + + if boundaryCheck && !isOnGraphemeClusterBoundary(cur) { + return nil + } + + return cur + } + func matchASCIIBitset( _ bitset: DSLTree.CustomCharacterClass.AsciiBitset, at pos: Index, diff --git a/Sources/_StringProcessing/Engine/Registers.swift b/Sources/_StringProcessing/Engine/Registers.swift index eb600e971..15463d676 100644 --- a/Sources/_StringProcessing/Engine/Registers.swift +++ b/Sources/_StringProcessing/Engine/Registers.swift @@ -24,11 +24,9 @@ extension Processor { // Verbatim elements to compare against var elements: [Element] - // Verbatim sequences to compare against - // - // TODO: Degenericize Processor and store Strings - var sequences: [[Element]] = [] - + // Verbatim bytes to compare against + var utf8Contents: [[UInt8]] + var bitsets: [DSLTree.CustomCharacterClass.AsciiBitset] var consumeFunctions: [MEProgram.ConsumeFunction] @@ -55,9 +53,6 @@ extension Processor { extension Processor.Registers { typealias Input = String - subscript(_ i: SequenceRegister) -> [Input.Element] { - sequences[i.rawValue] - } subscript(_ i: IntRegister) -> Int { get { ints[i.rawValue] } set { @@ -82,6 +77,9 @@ extension Processor.Registers { subscript(_ i: ElementRegister) -> Input.Element { elements[i.rawValue] } + subscript(_ i: UTF8Register) -> [UInt8] { + utf8Contents[i.rawValue] + } subscript( _ i: AsciiBitsetRegister ) -> DSLTree.CustomCharacterClass.AsciiBitset { @@ -110,8 +108,8 @@ extension Processor.Registers { self.elements = program.staticElements assert(elements.count == info.elements) - self.sequences = program.staticSequences - assert(sequences.count == info.sequences) + self.utf8Contents = program.staticUTF8Contents + assert(utf8Contents.count == info.utf8Contents) self.bitsets = program.staticBitsets assert(bitsets.count == info.bitsets) @@ -156,7 +154,7 @@ extension MutableCollection { extension MEProgram { struct RegisterInfo { var elements = 0 - var sequences = 0 + var utf8Contents = 0 var bools = 0 var strings = 0 var bitsets = 0 diff --git a/Sources/_StringProcessing/MatchingOptions.swift b/Sources/_StringProcessing/MatchingOptions.swift index a679339e9..793c6c82d 100644 --- a/Sources/_StringProcessing/MatchingOptions.swift +++ b/Sources/_StringProcessing/MatchingOptions.swift @@ -125,7 +125,15 @@ extension MatchingOptions { ? .graphemeCluster : .unicodeScalar } - + + /// Whether matching needs to honor canonical equivalence. + /// + /// Currently, this is synonymous with grapheme-cluster semantics, but could + /// become its own option in the future + var usesCanonicalEquivalence: Bool { + semanticLevel == .graphemeCluster + } + var usesNSRECompatibleDot: Bool { stack.last!.contains(.nsreCompatibleDot) } diff --git a/Sources/_StringProcessing/Utility/TypedInt.swift b/Sources/_StringProcessing/Utility/TypedInt.swift index b85282eab..a6bae0684 100644 --- a/Sources/_StringProcessing/Utility/TypedInt.swift +++ b/Sources/_StringProcessing/Utility/TypedInt.swift @@ -122,8 +122,9 @@ enum _SavePointAddress {} typealias ElementRegister = TypedInt<_ElementRegister> enum _ElementRegister {} -typealias SequenceRegister = TypedInt<_SequenceRegister> -enum _SequenceRegister {} +/// The register number for a sequence of UTF-8 bytes +typealias UTF8Register = TypedInt<_UTF8Register> +enum _UTF8Register {} /// The register number for a stored boolean value /// diff --git a/Tests/RegexTests/CompileTests.swift b/Tests/RegexTests/CompileTests.swift index 05212388d..7ea38490a 100644 --- a/Tests/RegexTests/CompileTests.swift +++ b/Tests/RegexTests/CompileTests.swift @@ -41,6 +41,7 @@ enum DecodedInstr { case matchAnyNonNewline case matchBitset case matchBuiltin + case matchUTF8 case consumeBy case assertBy case matchBy @@ -141,6 +142,8 @@ extension DecodedInstr { return .captureValue case .matchBuiltin: return .matchBuiltin + case .matchUTF8: + return .matchUTF8 } } } @@ -443,10 +446,30 @@ extension RegexTests { contains: [.matchScalarUnchecked], doesNotContain: [.match, .consumeBy, .matchScalar]) expectProgram( - for: "aaa\u{301}", + for: "a\u{301}", semanticLevel: .unicodeScalar, contains: [.matchScalarUnchecked], doesNotContain: [.match, .consumeBy, .matchScalar]) + expectProgram( + for: "abcdefg", + semanticLevel: .unicodeScalar, + contains: [.matchUTF8], + doesNotContain: [.match, .consumeBy, .matchScalar]) + expectProgram( + for: "abcdefg", + semanticLevel: .graphemeCluster, + contains: [.matchUTF8], + doesNotContain: [.match, .consumeBy, .matchScalar]) + expectProgram( + for: "aaa\u{301}", + semanticLevel: .unicodeScalar, + contains: [.matchUTF8], + doesNotContain: [.match, .consumeBy, .matchScalar]) + expectProgram( + for: "aaa\u{301}", + semanticLevel: .graphemeCluster, + contains: [.match], + doesNotContain: [.matchUTF8, .consumeBy]) } func testCaseInsensitivityCompilation() { From 7842c86674fcd746bdfee5007de2d30d7f1cc044 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Sat, 9 Nov 2024 09:50:01 -0700 Subject: [PATCH 19/22] Benchmark: Add not-found and all-found path variants --- Sources/RegexBenchmark/Inputs/FSPaths.swift | 43 +++++++++++++------ .../RegexBenchmark/Suite/FSPathsRegex.swift | 19 ++++++-- 2 files changed, 47 insertions(+), 15 deletions(-) diff --git a/Sources/RegexBenchmark/Inputs/FSPaths.swift b/Sources/RegexBenchmark/Inputs/FSPaths.swift index a2537c598..78b6e4875 100644 --- a/Sources/RegexBenchmark/Inputs/FSPaths.swift +++ b/Sources/RegexBenchmark/Inputs/FSPaths.swift @@ -1,5 +1,5 @@ // Successful match FSPaths -private let fsPathSuccess = #""" +private let pathSuccess = #""" ./First/Second/Third/some/really/long/content.extension/more/stuff/OptionLeft ./First/Second/Third/some/really/long/content.extension/more/stuff/OptionRight ./First/Second/PrefixThird/some/really/long/content.extension/more/stuff/OptionLeft @@ -10,7 +10,7 @@ private let fsPathSuccess = #""" // // We will have far more failures than successful matches by interspersing // this whole list between each success -private let fsPathFailure = #""" +private let pathFailure = #""" a/b/c /smol/path /a/really/long/path/that/is/certainly/stored/out/of/line @@ -28,24 +28,43 @@ a/b/c ./First/Second/PrefixThird/some/really/long/content/more/stuff/OptionRight """# +private func listify(_ s: String) -> [String] { + s.split(whereSeparator: { $0.isNewline }).map { String($0) } +} + +private let pathSuccessList: [String] = { listify(pathSuccess) }() +private let pathFailureList: [String] = { listify(pathFailure) }() + +private func scale(_ input: [String]) -> [String] { + let threshold = 1_000 + var result = input + while result.count < threshold { + result.append(contentsOf: input) + } + return result +} + extension Inputs { static let fsPathsList: [String] = { - var result: [String] = [] - let failures: [String] = fsPathFailure.split(whereSeparator: { $0.isNewline }).map { String($0) } - result.append(contentsOf: failures) + var result = pathFailureList + result.append(contentsOf: pathFailureList) - for success in fsPathSuccess.split(whereSeparator: { $0.isNewline }) { + for success in pathSuccessList { result.append(String(success)) - result.append(contentsOf: failures) + result.append(contentsOf: pathFailureList) + result.append(contentsOf: pathFailureList) } // Scale result up a bit - result.append(contentsOf: result) - result.append(contentsOf: result) - result.append(contentsOf: result) - result.append(contentsOf: result) + return scale(result) + + }() - return result + static let fsPathsNotFoundList: [String] = { + scale(pathFailureList) + }() + static let fsPathsFoundList: [String] = { + scale(pathFailureList) }() } diff --git a/Sources/RegexBenchmark/Suite/FSPathsRegex.swift b/Sources/RegexBenchmark/Suite/FSPathsRegex.swift index a0bf8f355..f029e9348 100644 --- a/Sources/RegexBenchmark/Suite/FSPathsRegex.swift +++ b/Sources/RegexBenchmark/Suite/FSPathsRegex.swift @@ -5,12 +5,25 @@ extension BenchmarkRunner { mutating func addFSPathsRegex() { let fsPathsRegex = #"^\./First/Second/(Prefix)?Third/.*\.extension/.*(OptionLeft|OptionRight)$"# - let paths = CrossInputListBenchmark( + + CrossInputListBenchmark( baseName: "FSPathsRegex", regex: fsPathsRegex, inputs: Inputs.fsPathsList - ) - paths.register(&self) + ).register(&self) + + CrossInputListBenchmark( + baseName: "FSPathsRegexNotFound", + regex: fsPathsRegex, + inputs: Inputs.fsPathsNotFoundList + ).register(&self) + + CrossInputListBenchmark( + baseName: "FSPathsRegexFound", + regex: fsPathsRegex, + inputs: Inputs.fsPathsFoundList + ).register(&self) + } } From b0653e2b4ba7f0a21a64533acd82f688aa974248 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Sat, 9 Nov 2024 11:14:07 -0700 Subject: [PATCH 20/22] Speed up processor initialization Moves register initialization to the program where it can be done once. Speeds up use cases where the same regex is applied to many small inputs. --- .../_StringProcessing/Engine/MEBuilder.swift | 43 +++++---- .../_StringProcessing/Engine/MEProgram.swift | 24 +++-- .../_StringProcessing/Engine/Processor.swift | 14 +-- .../_StringProcessing/Engine/Registers.swift | 88 ++++++------------- Sources/_StringProcessing/Executor.swift | 4 +- 5 files changed, 70 insertions(+), 103 deletions(-) diff --git a/Sources/_StringProcessing/Engine/MEBuilder.swift b/Sources/_StringProcessing/Engine/MEBuilder.swift index 2225592e3..1a26421eb 100644 --- a/Sources/_StringProcessing/Engine/MEBuilder.swift +++ b/Sources/_StringProcessing/Engine/MEBuilder.swift @@ -419,34 +419,33 @@ extension MEProgram.Builder { inst.opcode, payload) } - var regInfo = MEProgram.RegisterInfo() - regInfo.elements = elements.count - regInfo.utf8Contents = utf8Contents.count - regInfo.ints = nextIntRegister.rawValue - regInfo.values = nextValueRegister.rawValue - regInfo.positions = nextPositionRegister.rawValue - regInfo.bitsets = asciiBitsets.count - regInfo.consumeFunctions = consumeFunctions.count - regInfo.transformFunctions = transformFunctions.count - regInfo.matcherFunctions = matcherFunctions.count - regInfo.captures = nextCaptureRegister.rawValue - regInfo.wholeMatchValue = wholeMatchValue?.rawValue - - return MEProgram( + let regs = Processor.Registers( + elements: elements.stored, + utf8Contents: utf8Contents.stored, + bitsets: asciiBitsets, + consumeFunctions: consumeFunctions, + transformFunctions: transformFunctions, + matcherFunctions: matcherFunctions, + numInts: nextIntRegister.rawValue, + numValues: nextValueRegister.rawValue, + numPositions: nextPositionRegister.rawValue + ) + + let storedCaps = Array( + repeating: Processor._StoredCapture(), count: nextCaptureRegister.rawValue) + + let meProgram = MEProgram( instructions: InstructionList(instructions), - staticElements: elements.stored, - staticUTF8Contents: utf8Contents.stored, - staticBitsets: asciiBitsets, - staticConsumeFunctions: consumeFunctions, - staticTransformFunctions: transformFunctions, - staticMatcherFunctions: matcherFunctions, - registerInfo: regInfo, + wholeMatchValueRegister: wholeMatchValue, enableTracing: enableTracing, enableMetrics: enableMetrics, captureList: captureList, referencedCaptureOffsets: referencedCaptureOffsets, initialOptions: initialOptions, - canOnlyMatchAtStart: canOnlyMatchAtStart) + canOnlyMatchAtStart: canOnlyMatchAtStart, + registers: regs, + storedCaptures: storedCaps) + return meProgram } mutating func reset() { self = Self() } diff --git a/Sources/_StringProcessing/Engine/MEProgram.swift b/Sources/_StringProcessing/Engine/MEProgram.swift index a5b24315c..a9df6bedd 100644 --- a/Sources/_StringProcessing/Engine/MEProgram.swift +++ b/Sources/_StringProcessing/Engine/MEProgram.swift @@ -21,15 +21,7 @@ struct MEProgram { (Input, Input.Index, Range) throws -> (Input.Index, Any)? var instructions: InstructionList - - var staticElements: [Input.Element] - var staticUTF8Contents: [[UInt8]] - var staticBitsets: [DSLTree.CustomCharacterClass.AsciiBitset] - var staticConsumeFunctions: [ConsumeFunction] - var staticTransformFunctions: [TransformFunction] - var staticMatcherFunctions: [MatcherFunction] - - var registerInfo: RegisterInfo + var wholeMatchValueRegister: ValueRegister? var enableTracing: Bool var enableMetrics: Bool @@ -39,18 +31,22 @@ struct MEProgram { var initialOptions: MatchingOptions var canOnlyMatchAtStart: Bool + + // We store the initial register state in the program, so that + // processors can be spun up quicker (useful for running same regex + // over many, many smaller inputs). + var registers: Processor.Registers + var storedCaptures: [Processor._StoredCapture] + } extension MEProgram: CustomStringConvertible { var description: String { + // TODO: Re-instate better pretty-printing functionality + var result = """ - Elements: \(staticElements) """ - if !staticConsumeFunctions.isEmpty { - result += "Consume functions: \(staticConsumeFunctions)" - } - // TODO: Extract into formatting code for idx in instructions.indices { diff --git a/Sources/_StringProcessing/Engine/Processor.swift b/Sources/_StringProcessing/Engine/Processor.swift index 89ca29597..0be1ba41a 100644 --- a/Sources/_StringProcessing/Engine/Processor.swift +++ b/Sources/_StringProcessing/Engine/Processor.swift @@ -49,6 +49,7 @@ struct Processor { let subjectBounds: Range let matchMode: MatchMode + let instructions: InstructionList // MARK: Update-only state @@ -100,6 +101,9 @@ extension Processor { } extension Processor { + // TODO: This has lots of retain/release traffic. We really just + // want to borrow the program and most of its static stuff. The only + // thing we need an actual copy of is the modifyable-resettable state init( program: MEProgram, input: Input, @@ -120,10 +124,10 @@ extension Processor { self.currentPosition = searchBounds.lowerBound - // Initialize registers with end of search bounds - self.registers = Registers(program, searchBounds.upperBound) - self.storedCaptures = Array( - repeating: .init(), count: program.registerInfo.captures) + // Initialize registers from stored starting state + self.registers = program.registers + + self.storedCaptures = program.storedCaptures _checkInvariants() } @@ -137,7 +141,7 @@ extension Processor { self.controller = Controller(pc: 0) - self.registers.reset(sentinel: searchBounds.upperBound) + self.registers.reset() if !self.savePoints.isEmpty { self.savePoints.removeAll(keepingCapacity: true) diff --git a/Sources/_StringProcessing/Engine/Registers.swift b/Sources/_StringProcessing/Engine/Registers.swift index 15463d676..b586baecf 100644 --- a/Sources/_StringProcessing/Engine/Registers.swift +++ b/Sources/_StringProcessing/Engine/Registers.swift @@ -47,6 +47,31 @@ extension Processor { var values: [Any] var positions: [Input.Index] + + init( + elements: [Element], + utf8Contents: [[UInt8]], + bitsets: [DSLTree.CustomCharacterClass.AsciiBitset], + consumeFunctions: [MEProgram.ConsumeFunction], + transformFunctions: [MEProgram.TransformFunction], + matcherFunctions: [MEProgram.MatcherFunction], + isDirty: Bool = false, + numInts: Int, + numValues: Int, + numPositions: Int + ) { + self.elements = elements + self.utf8Contents = utf8Contents + self.bitsets = bitsets + self.consumeFunctions = consumeFunctions + self.transformFunctions = transformFunctions + self.matcherFunctions = matcherFunctions + self.isDirty = isDirty + self.ints = Array(repeating: 0, count: numInts) + self.values = Array(repeating: SentinelValue(), count: numValues) + self.positions = Array( + repeating: Self.sentinelIndex, count: numPositions) + } } } @@ -97,42 +122,11 @@ extension Processor.Registers { } extension Processor.Registers { - static let sentinelIndex = "".startIndex - - init( - _ program: MEProgram, - _ sentinel: String.Index - ) { - let info = program.registerInfo - - self.elements = program.staticElements - assert(elements.count == info.elements) - - self.utf8Contents = program.staticUTF8Contents - assert(utf8Contents.count == info.utf8Contents) - - self.bitsets = program.staticBitsets - assert(bitsets.count == info.bitsets) - - self.consumeFunctions = program.staticConsumeFunctions - assert(consumeFunctions.count == info.consumeFunctions) - - self.transformFunctions = program.staticTransformFunctions - assert(transformFunctions.count == info.transformFunctions) - - self.matcherFunctions = program.staticMatcherFunctions - assert(matcherFunctions.count == info.matcherFunctions) - - self.ints = Array(repeating: 0, count: info.ints) - - self.values = Array( - repeating: SentinelValue(), count: info.values) - self.positions = Array( - repeating: Processor.Registers.sentinelIndex, - count: info.positions) + static var sentinelIndex: String.Index { + "".startIndex } - mutating func reset(sentinel: Input.Index) { + mutating func reset() { guard isDirty else { return } @@ -151,32 +145,6 @@ extension MutableCollection { } } -extension MEProgram { - struct RegisterInfo { - var elements = 0 - var utf8Contents = 0 - var bools = 0 - var strings = 0 - var bitsets = 0 - var consumeFunctions = 0 - var transformFunctions = 0 - var matcherFunctions = 0 - var ints = 0 - var floats = 0 - var positions = 0 - var values = 0 - var instructionAddresses = 0 - var classStackAddresses = 0 - var positionStackAddresses = 0 - var savePointAddresses = 0 - var captures = 0 - - // The value register holding the whole-match value, if there - // is one - var wholeMatchValue: Int? = nil - } -} - extension Processor.Registers: CustomStringConvertible { var description: String { func formatRegisters( diff --git a/Sources/_StringProcessing/Executor.swift b/Sources/_StringProcessing/Executor.swift index 46c03eb15..38c317f0e 100644 --- a/Sources/_StringProcessing/Executor.swift +++ b/Sources/_StringProcessing/Executor.swift @@ -193,8 +193,8 @@ extension Executor { let range = startPosition.. Date: Sat, 9 Nov 2024 13:32:16 -0700 Subject: [PATCH 21/22] Remove the unused callstack --- .../_StringProcessing/Engine/Backtracking.swift | 13 +------------ Sources/_StringProcessing/Engine/Processor.swift | 15 +++------------ Sources/_StringProcessing/Engine/Tracing.swift | 2 +- Sources/_StringProcessing/Utility/Protocols.swift | 3 --- Sources/_StringProcessing/Utility/Traced.swift | 10 ---------- Sources/_StringProcessing/Utility/TypedInt.swift | 5 ----- 6 files changed, 5 insertions(+), 43 deletions(-) diff --git a/Sources/_StringProcessing/Engine/Backtracking.swift b/Sources/_StringProcessing/Engine/Backtracking.swift index 11e2db0e4..9078c2c19 100644 --- a/Sources/_StringProcessing/Engine/Backtracking.swift +++ b/Sources/_StringProcessing/Engine/Backtracking.swift @@ -21,13 +21,6 @@ extension Processor { // points. We should try to separate out the concerns better. var isScalarSemantics: Bool - // The end of the call stack, so we can slice it off - // when failing inside a call. - // - // NOTE: Alternatively, also place return addresses on the - // save point stack - var stackEnd: CallStackAddress - // FIXME: Save minimal info (e.g. stack position and // perhaps current start) var captureEnds: [_StoredCapture] @@ -41,12 +34,11 @@ extension Processor { var destructure: ( pc: InstructionAddress, pos: Position?, - stackEnd: CallStackAddress, captureEnds: [_StoredCapture], intRegisters: [Int], PositionRegister: [Input.Index] ) { - return (pc, pos, stackEnd, captureEnds, intRegisters, posRegisters) + return (pc, pos, captureEnds, intRegisters, posRegisters) } // Whether this save point is quantified, meaning it has a range of @@ -85,7 +77,6 @@ extension Processor { pos: currentPosition, quantifiedRange: nil, isScalarSemantics: false, - stackEnd: .init(callStack.count), captureEnds: storedCaptures, intRegisters: registers.ints, posRegisters: registers.positions) @@ -99,7 +90,6 @@ extension Processor { pos: nil, quantifiedRange: nil, isScalarSemantics: false, - stackEnd: .init(callStack.count), captureEnds: storedCaptures, intRegisters: registers.ints, posRegisters: registers.positions) @@ -114,7 +104,6 @@ extension Processor { pos: nil, quantifiedRange: range, isScalarSemantics: isScalarSemantics, - stackEnd: .init(callStack.count), captureEnds: storedCaptures, intRegisters: registers.ints, posRegisters: registers.positions) diff --git a/Sources/_StringProcessing/Engine/Processor.swift b/Sources/_StringProcessing/Engine/Processor.swift index 0be1ba41a..0bf19b829 100644 --- a/Sources/_StringProcessing/Engine/Processor.swift +++ b/Sources/_StringProcessing/Engine/Processor.swift @@ -82,8 +82,6 @@ struct Processor { var savePoints: [SavePoint] = [] - var callStack: [InstructionAddress] = [] - var storedCaptures: Array<_StoredCapture> var state: State = .inProgress @@ -146,9 +144,6 @@ extension Processor { if !self.savePoints.isEmpty { self.savePoints.removeAll(keepingCapacity: true) } - if !self.callStack.isEmpty { - self.callStack.removeAll(keepingCapacity: true) - } for idx in storedCaptures.indices { storedCaptures[idx] = .init() @@ -167,7 +162,6 @@ extension Processor { _checkInvariants() guard self.controller == Controller(pc: 0), self.savePoints.isEmpty, - self.callStack.isEmpty, self.storedCaptures.allSatisfy({ $0.range == nil }), self.state == .inProgress, self.failureReason == nil @@ -383,10 +377,9 @@ extension Processor { state = .fail return } - let (pc, pos, stackEnd, capEnds, intRegisters, posRegisters): ( + let (pc, pos, capEnds, intRegisters, posRegisters): ( pc: InstructionAddress, pos: Position?, - stackEnd: CallStackAddress, captureEnds: [_StoredCapture], intRegisters: [Int], PositionRegister: [Input.Index] @@ -398,17 +391,15 @@ extension Processor { // pos instead of removing it if savePoints[idx].isQuantified { savePoints[idx].takePositionFromQuantifiedRange(input) - (pc, pos, stackEnd, capEnds, intRegisters, posRegisters) = savePoints[idx].destructure + (pc, pos, capEnds, intRegisters, posRegisters) = savePoints[idx].destructure } else { - (pc, pos, stackEnd, capEnds, intRegisters, posRegisters) = savePoints.removeLast().destructure + (pc, pos, capEnds, intRegisters, posRegisters) = savePoints.removeLast().destructure } - assert(stackEnd.rawValue <= callStack.count) assert(capEnds.count == storedCaptures.count) controller.pc = pc currentPosition = pos ?? currentPosition - callStack.removeLast(callStack.count - stackEnd.rawValue) registers.ints = intRegisters registers.positions = posRegisters diff --git a/Sources/_StringProcessing/Engine/Tracing.swift b/Sources/_StringProcessing/Engine/Tracing.swift index 90445d5ec..b67cbb6a5 100644 --- a/Sources/_StringProcessing/Engine/Tracing.swift +++ b/Sources/_StringProcessing/Engine/Tracing.swift @@ -128,7 +128,7 @@ extension Processor.SavePoint { } } return """ - pc: \(self.pc), pos: \(posStr), stackEnd: \(stackEnd) + pc: \(self.pc), pos: \(posStr) """ } } diff --git a/Sources/_StringProcessing/Utility/Protocols.swift b/Sources/_StringProcessing/Utility/Protocols.swift index 24ffbcf70..fc2be291c 100644 --- a/Sources/_StringProcessing/Utility/Protocols.swift +++ b/Sources/_StringProcessing/Utility/Protocols.swift @@ -34,9 +34,6 @@ protocol ProcessorProtocol { var isAcceptState: Bool { get } var isFailState: Bool { get } - // Provide to get call stack formatting, default empty - var callStack: Array { get } - // Provide to get save point formatting, default empty var savePoints: Array { get } diff --git a/Sources/_StringProcessing/Utility/Traced.swift b/Sources/_StringProcessing/Utility/Traced.swift index 198564fe1..6f3e42b69 100644 --- a/Sources/_StringProcessing/Utility/Traced.swift +++ b/Sources/_StringProcessing/Utility/Traced.swift @@ -18,7 +18,6 @@ protocol Traced { protocol TracedProcessor: ProcessorProtocol, Traced { // Empty defaulted - func formatCallStack() -> String // empty default func formatSavePoints() -> String // empty default func formatRegisters() -> String // empty default @@ -52,14 +51,6 @@ extension TracedProcessor { if isTracingEnabled { printTrace() } } - // Helpers for the conformers - func formatCallStack() -> String { - if !callStack.isEmpty { - return "call stack: \(callStack)\n" - } - return "" - } - func formatSavePoints() -> String { if !savePoints.isEmpty { var result = "save points:\n" @@ -158,7 +149,6 @@ extension TracedProcessor { func formatTrace() -> String { var result = "\n--- cycle \(cycleCount) ---\n" - result += formatCallStack() result += formatSavePoints() result += formatRegisters() result += formatInput() diff --git a/Sources/_StringProcessing/Utility/TypedInt.swift b/Sources/_StringProcessing/Utility/TypedInt.swift index a6bae0684..f5cf44a02 100644 --- a/Sources/_StringProcessing/Utility/TypedInt.swift +++ b/Sources/_StringProcessing/Utility/TypedInt.swift @@ -99,10 +99,6 @@ enum _Distance {} typealias InstructionAddress = TypedInt<_InstructionAddress> enum _InstructionAddress {} -/// A position in the call stack, i.e. for save point restores -typealias CallStackAddress = TypedInt<_CallStackAddress> -enum _CallStackAddress {} - /// A position in a position stack, i.e. for NFA simulation typealias PositionStackAddress = TypedInt<_PositionStackAddress> enum _PositionStackAddress {} @@ -111,7 +107,6 @@ enum _PositionStackAddress {} typealias SavePointStackAddress = TypedInt<_SavePointAddress> enum _SavePointAddress {} - // MARK: - Registers /// The register number for a stored element From 35497fb94738a59f10e0cbfdc90e93b041593cc8 Mon Sep 17 00:00:00 2001 From: Nate Cook Date: Tue, 7 Jan 2025 11:40:09 -0600 Subject: [PATCH 22/22] Work around change in integer literal inference (#798) * Work around change in integer literal inference Related to https://github.com/swiftlang/swift/issues/78371, but doesn't need to be reverted after that issue is fixed. * Fix availability for capture tests --- .../Participants/HandWrittenParticipant.swift | 3 ++- Sources/_StringProcessing/Regex/Match.swift | 2 ++ Tests/RegexBuilderTests/RegexDSLTests.swift | 13 ++++++------- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/Sources/Exercises/Participants/HandWrittenParticipant.swift b/Sources/Exercises/Participants/HandWrittenParticipant.swift index e7228b333..ef11d5e2b 100644 --- a/Sources/Exercises/Participants/HandWrittenParticipant.swift +++ b/Sources/Exercises/Participants/HandWrittenParticipant.swift @@ -60,7 +60,8 @@ private func graphemeBreakPropertyData( } // For testing our framework - if forceFailure, lower == Unicode.Scalar(0x07FD) { + let failureSigil = Unicode.Scalar(0x07FD as UInt32)! + if forceFailure, lower == failureSigil { return nil } diff --git a/Sources/_StringProcessing/Regex/Match.swift b/Sources/_StringProcessing/Regex/Match.swift index c0a3aca3c..82ed2c257 100644 --- a/Sources/_StringProcessing/Regex/Match.swift +++ b/Sources/_StringProcessing/Regex/Match.swift @@ -41,6 +41,8 @@ extension Regex.Match { from: anyRegexOutput.input ) guard let output = typeErasedMatch as? Output else { + print(typeErasedMatch) + print(Output.self) fatalError("Internal error: existential cast failed") } return output diff --git a/Tests/RegexBuilderTests/RegexDSLTests.swift b/Tests/RegexBuilderTests/RegexDSLTests.swift index 1cb53b83d..3a0c63a98 100644 --- a/Tests/RegexBuilderTests/RegexDSLTests.swift +++ b/Tests/RegexBuilderTests/RegexDSLTests.swift @@ -1845,8 +1845,8 @@ extension RegexDSLTests { } func testLabeledCaptures_labeledCapture() throws { - guard #available(macOS 13, *) else { - throw XCTSkip("Fix only exists on macOS 13") + guard #available(macOS 14.0, *) else { + throw XCTSkip("Fix only exists on macOS 14") } // The output type of a regex with a labeled capture is dropped. let dslWithLabeledCapture = Regex { @@ -1884,8 +1884,8 @@ extension RegexDSLTests { } func testLabeledCaptures_bothCapture() throws { - guard #available(macOS 13, *) else { - throw XCTSkip("Fix only exists on macOS 13") + guard #available(macOS 14.0, *) else { + throw XCTSkip("Fix only exists on macOS 14") } // Only the output type of a regex with a labeled capture is dropped, // outputs of other regexes in the same DSL are concatenated. @@ -1910,9 +1910,8 @@ extension RegexDSLTests { } func testLabeledCaptures_tooManyCapture() throws { - return - guard #available(macOS 13, *) else { - throw XCTSkip("Fix only exists on macOS 13") + guard #available(macOS 14.0, *) else { + throw XCTSkip("Fix only exists on macOS 14") } // The output type of a regex with too many captures is dropped. // "Too many" means the left and right output types would add up to >= 10.