Implement Data reading and writing #377

parkera · 2024-01-12T18:13:00Z

Implements file reading and writing for Data. For now use String for path instead of URL.

parkera · 2024-01-12T18:13:09Z

jmschonfeld · 2024-01-12T18:14:44Z

Benchmarks/Package.swift

+        .executableTarget(
+            name: "DataIOBenchmarks",
+            dependencies: [
+                .product(name: "FoundationEssentials", package: "swift-foundation"),


I think you likely want swift-foundation-local here instead of swift-foundation

Indeed, thanks.

hassila · 2024-01-12T18:50:41Z

Benchmarks/Benchmarks/DataIO/DataIO.swift

+                teardown: {
+                    cleanup(at: readMe)
+                }
+              )


There's an initializer with trailing setup/teardown closures that you probably would prefer for readability...

See https://swiftpackageindex.com/ordo-one/package-benchmark/1.22.0/documentation/benchmark/writingbenchmarks#Customize-startupteardown-code-for-benchmarks

By the way, do you have any advice on one-target-many-benchmarks vs many-targets-one-benchmark?

That's a good question and the answer is "it depends", but a few reflections:

I think it in general makes sense to group benchmarks together in one target which you want to check together (ie. they depend on the same class, set of classes, of module / component which is under test) just as a convenience for the engineer who wants to validate performance iteratively while making changes to a specific part of the software. For CI the structure just doesn't matter as one would typically run all of the benchmarks anyway, so the structure is just optimising for a) ease-of-running during interactive development and b) ease-of-updating/changing benchmarks as they are related.

It is similar to unit tests in nature that you usually want to have tests that are related to each other in a "suite" (one target for benchmarks).

The Benchmark package supports filtering/skipping of both targets and benchmarks with regexes:

--filter <filter> Benchmarks matching the regexp filter that should be run --skip <skip> Benchmarks matching the regexp filter that should be skipped --target <target> Benchmark targets matching the regexp filter that should be run --skip-target <skip-target> Benchmark targets matching the regexp filter that should be skipped

Which makes it easy to run partial sets of targets/benchmarks if using a thoughtful naming convention too.

One difference is that you might have benchmarks that are both similar to unit tests, but also much more complex benchmarks that are more similar to integration tests, so the answer depends a bit also on the flavour of benchmarks you do.

For longer more complex benchmarks "integration test"-style, it's a good idea to do as swift-nio did and just break out the actual code under test into separate files/targets and just have the benchmark file itself be benchmark configuration and basically a driver - see e.g. https://github.com/apple/swift-nio/blob/main/Benchmarks/Benchmarks/NIOPosixBenchmarks/Benchmarks.swift

While simpler benchmarks can be done inline...

So more concretely for Foundation, I'd probably consider a structure with targets with a naming convention, something along the lines:

FoundationEssentials
FoundationEssentials-Date
FoundationEssentials-Serialization
FoundationEssentials-Predicate
FoundationInternationalization
FoundationNetworking
FoundationNetworking-URLSession

Etc, basically a common targets (e.g. FoundationEssentials) where most benchmarks for that module reside, but then having additional targets for whatever types that might need more complex sets of benchmarks (I just picked some examples above how it could look)) - this allows for regex filtering if wanting to run benchmarks for a given part of Foundation easily with e.g. swift package benchmark --target "FoundationEssentials.*" but doesn't make the benchmarks to balloon out of control in terms of size in a single file. But thats me, YMMV :-)

Thanks for that detailed response. I think the approach we'll take in the short term is to just focus on getting a base level of tests in. Once we see how/if/when we use them, we can always refactor it to match requirements for CI or easier at-desk usage.

karwa · 2024-01-14T11:32:44Z

Sources/FoundationEssentials/Data/Data+Reading.swift

+        throw CocoaError(.fileReadInvalidFileName)
+    }
+
+    let fd = try inPath.withFileSystemRepresentation { inPathFileSystemRep in


I wonder if withFileSystemRepresentation is necessary here and elsewhere, even if it only has an effect on Darwin platforms.

I actually think that method should be deprecated. Fundamentally, Foundation should not be trying to second-guess how the filesystem is going to transform a file name. Even the operating system doesn't really know what kind of transformations are going to happen, because that's handled by the (possibly third-party) filesystem driver. You just give it a file name, and if it needs to be transformed, the driver will do it.

AFAICT, other libraries which wrap the POSIX open function do not perform this transformation on Darwin platforms (e.g. NodeJS doesn't), so why is it necessary for Foundation to do it?

At the very least, I think this needs documenting, but as explained I think we should investigate avoiding this work altogether.

I added withFileSystemRepresentation in a prior PR in order to mirror the functionality of the existing APIs that Foundation provides on String/NSString (but with a swifty interface). I think that we should use withFileSystemRepresentation here and other places when calling down to C APIs that accept paths for a few reasons. First, it maintains compatibility with existing Foundation implementations. The existing Foundation framework performs this transformation and changing the implementation as we port it to swift could lead to compatibility-breaking behavior, so a safer bet is to maintain the compatibility here.

But perhaps more importantly, I think it's still the correct thing to do. You are right that the transformations can be handled by the file system driver and that different file systems can behave differently within the same OS, so this implementation isn't completely accurate. That being said, I think this is still the best choice for Darwin. While APFS handles unicode normalization, HFS+ (which is also commonly used on Darwin platforms) does not and failing to normalize strings here (using this special character table) can result in incorrect paths being used. If we don't normalize HFS+ paths, we might not find the file we're looking for due to a mismatch which could be surprising to some developers. In the worst case, we perform a normalization that something like APFS would have done for us, but in the best case we ensure that we normalize for file systems where that is a requirement.

Another benefit is that withFileSystemRepresentation is portable - for example, my understanding is that Windows should use a UTF-16 representation and this allows us to have one entry point for conversion to file system representations that can vary based on the platform.

I'm happy to add some documentation to that function as a followup to explain this for maintainers since I can see how that history/context might not be clear from the API itself.

I noticed that withFileSystemRepresentation was ported to Swift. Even though I believe it should be deprecated, there may still be value porting it to the new library since it must remain part of Foundation's ABI.

While APFS handles unicode normalization, HFS+ (which is also commonly used on Darwin platforms) does not and failing to normalize strings here (using this special character table) can result in incorrect paths being used.

I wonder if you could explain this a bit more or point to some public documentation, if there is any, because it doesn't add up to me.

If the POSIX functions don't handle this transformation internally somewhere, it means that I'd be able to use them to create a filename which is not normalised (e.g. using the open function's O_CREAT flag). And if I can do that, it means the filesystem doesn't actually normalise filenames at all -- that normalisation is not a feature of the HFS+ filesystem but rather of the Foundation framework. That seems to be what you're saying, but it contradicts the information I can find online, which says that normalisation is indeed a feature of the filesystem.

Similarly, if it is not handled by the driver, it means I can do the following:

Take a filename A

Using Foundation, create the file named A and write some data to it. Because of this transformation, Foundation will actually write to filename X

Using C/C++/NodeJS/etc, call open with the same filename A. Because they don't perform that transformation, and you're saying it's required, they won't be able to find the file.

That seems to me like a serious problem for the operating system. Essentially it makes the POSIX functions unusable, requiring that all IO happens through Foundation so that it gets this required transformation.

So yeah, can you see why this doesn't seem right to me?

Despite the surprising behavior that was my understanding of how it worked. I had some older documentation I was looking to explain this but sadly I can't seem to find that document, and writing a sample app operating on an HFS volume actually seems to prove (from what I can tell) that the HFS+ driver does automatically decompose when creating/reading from the file. So after taking a further look, it does seem that perhaps this normalization might not have been originally necessary (or perhaps this is an artifact of a time when HFS didn't perform that normalization for us). I'll see if I can gather some more info there with examples to thoroughly document it.

That being said - although it looks like I was incorrect about the necessity for interaction with HFS, the compatibility justification still stands for file systems that are normalization-preserving like APFS. If an app exists today that stores a file name and reads/writes to that file using Foundation APIs, the filename will always be the fully decomposed format. Changing this implementation to avoid the decomposition would be a behavioral difference on an APFS volume where normalization is preserved, and I think we'd likely see fallout from changing Foundations' behavior here despite it being different behavior than other libraries that might call `open without this normalization.

I'll do some more investigation on the necessity of this (in case other clients calling open might be incorrect in not localizing) but I think that can be handled separately since I think we need to land this Data implementation as-is.

Thanks for investigating. I hope that this Foundation reimplementation effort can provide an opportunity to reexamine some of these behaviours. I appreciate that this may take time, but I still like to raise the issue so people know about it and look in to it.

I'll just give a quick response to this point:

That being said - although it looks like I was incorrect about the necessity for interaction with HFS, the compatibility justification still stands for file systems that are normalization-preserving like APFS. If an app exists today that stores a file name and reads/writes to that file using Foundation APIs, the filename will always be the fully decomposed format. Changing this implementation to avoid the decomposition would be a behavioral difference on an APFS volume where normalization is preserved, and I think we'd likely see fallout from changing Foundations' behavior here despite it being different behavior than other libraries that might call `open without this normalization.

While APFS is normalisation-preserving, I believe it is still normalisation-aware. So if you save a file caf\u{00E9} (NFC), you can open it using cafe\u{0301} (NFD), and it won't allow both versions to exist in the same folder.

The only visible difference between HFS+ and APFS in this matter is if you create a file, then iterate the contents of its containing directory and look for the file using a Unicode-unaware search such as memcmp. HFS+ will have normalised the filename, so you'll need to look for the "file system representation" in order to get a match, while APFS will have stored the filename as you gave it.

The problem is that this whole process is ill-advised. Even if you can open a file using the name "X", that doesn't necessarily mean you will find a file literally named "X" if you iterate the directory's contents. The filesystem might transform the filename in any number of ways we can't predict (e.g. a case-insensitive filesystem might normalise everything to uppercase, and some filesystems need to truncate filenames). Users shouldn't be looking up files in that way.

That's why I argue that withFileSystemRepresentation should be deprecated. We should put an end to this idea that a directory will contain a particular byte-string, because it isn't true. Once applications no longer operate with that assumption, we wouldn't need to do this normalisation here, even on APFS -- because this normalisation means we aren't really respecting the APFS designers intent to have their filesystem be normalisation-preserving, and much of the effort that is put in to normalisation-aware comparisons at the OS/FS level ends up being for naught. Every subsystem is doing a lot of wasted work.

Anyway, that's enough for now. I'd be very interested to know what your investigations uncover; please do @ me here or on the Swift forums if possible.

karwa · 2024-01-14T11:44:15Z

It is unfortunate that this does not use swift-system -- that library is supposed to expose OS APIs such as open and read in their full fidelity, which platform-abstraction libraries such as Foundation can use as a safe, idiomatic interface to the OS (hence the name "system"). Its maintainers at Apple have invested a significant amount of time on it.

It would be much nicer if this patch could introduce a set of functions to integrate Data with FilePath (even as an internal implementation detail), building on the functionality already present in swift-system, and then layer its public API on top by converting strings and URLs to FilePaths.

karwa · 2024-01-14T12:49:29Z

Sources/FoundationEssentials/Data/Data+Reading.swift

+#endif
+    } else {
+        // We've verified above that fileSize will fit in `Int`
+        guard let bytes = malloc(Int(fileSize)) else {


It's unfortunate that this always allocates. Data can store its contents in an inline buffer.

It's not huge - on 64-bit systems, the inline buffer is 14 bytes, but for very small files it could be useful.

I suspect the overhead of doing the I/O is much higher than a <=14 byte malloc, but your point is still valid. I think plumbing through the possibility of using inline storage would require some internal Data initializers, which we can follow up with.

Yeah I suppose the majority of the benefit would be that resulting Data object has fixed-size storage, which potentially can allow for some more efficient operations (e.g. unrolled or SIMD comparison, simpler deinitialisation).

parkera · 2024-01-16T17:39:55Z

It is unfortunate that this does not use swift-system -- that library is supposed to expose OS APIs such as open and read in their full fidelity, which platform-abstraction libraries such as Foundation can use as a safe, idiomatic interface to the OS (hence the name "system"). Its maintainers at Apple have invested a significant amount of time on it.

It would be much nicer if this patch could introduce a set of functions to integrate Data with FilePath (even as an internal implementation detail), building on the functionality already present in swift-system, and then layer its public API on top by converting strings and URLs to FilePaths.

This is a long-term goal. Our shorter term one is to provide a drop in replacement implementation for this that works on all platforms (including Darwin), and the best way to do that is to make sure we replicate the existing functionality exactly.

Once we address some limitations in swift system (including some that are internal to Darwin builds but affect us here) we intend to figure out how to expose FilePath based API via Foundation.

parkera · 2024-01-16T17:46:57Z

@swift-ci test

parkera · 2024-01-16T19:33:41Z

@swift-ci test

parkera · 2024-01-17T20:38:01Z

@swift-ci test

parkera requested review from iCharlesHu and jmschonfeld January 12, 2024 18:13

jmschonfeld reviewed Jan 12, 2024

View reviewed changes

hassila reviewed Jan 12, 2024

View reviewed changes

karwa reviewed Jan 14, 2024

View reviewed changes

parkera added 2 commits January 17, 2024 12:37

Implement Data reading and writing

64bb8e1

Use swift-foundation-local for dependency name in benchmark

21a95a8

parkera force-pushed the parkera/dataio branch from 208afd2 to 21a95a8 Compare January 17, 2024 20:37

jmschonfeld approved these changes Jan 17, 2024

View reviewed changes

parkera merged commit 4499edd into swiftlang:main Jan 17, 2024

parkera deleted the parkera/dataio branch January 17, 2024 21:04

itingliu mentioned this pull request Mar 14, 2024

Data + IO #221

Closed

iCharlesHu mentioned this pull request Mar 15, 2024

Provide reading/writing for Data #38

Closed

Implement Data reading and writing #377

Implement Data reading and writing #377

Uh oh!

Conversation

parkera commented Jan 12, 2024

Uh oh!

parkera commented Jan 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parkera Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karwa Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karwa Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karwa commented Jan 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parkera commented Jan 16, 2024

Uh oh!

parkera commented Jan 16, 2024

Uh oh!

parkera commented Jan 16, 2024

Uh oh!

parkera commented Jan 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

parkera Jan 12, 2024 •

edited

Loading

karwa Jan 16, 2024 •

edited

Loading

karwa Jan 16, 2024 •

edited

Loading