-
Notifications
You must be signed in to change notification settings - Fork 207
Implement Data reading and writing #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@swift-ci test |
Benchmarks/Package.swift
Outdated
| .executableTarget( | ||
| name: "DataIOBenchmarks", | ||
| dependencies: [ | ||
| .product(name: "FoundationEssentials", package: "swift-foundation"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you likely want swift-foundation-local here instead of swift-foundation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks.
| teardown: { | ||
| cleanup(at: readMe) | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an initializer with trailing setup/teardown closures that you probably would prefer for readability...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, do you have any advice on one-target-many-benchmarks vs many-targets-one-benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question and the answer is "it depends", but a few reflections:
I think it in general makes sense to group benchmarks together in one target which you want to check together (ie. they depend on the same class, set of classes, of module / component which is under test) just as a convenience for the engineer who wants to validate performance iteratively while making changes to a specific part of the software. For CI the structure just doesn't matter as one would typically run all of the benchmarks anyway, so the structure is just optimising for a) ease-of-running during interactive development and b) ease-of-updating/changing benchmarks as they are related.
It is similar to unit tests in nature that you usually want to have tests that are related to each other in a "suite" (one target for benchmarks).
The Benchmark package supports filtering/skipping of both targets and benchmarks with regexes:
--filter <filter>
Benchmarks matching the regexp filter that should be run
--skip <skip>
Benchmarks matching the regexp filter that should be skipped
--target <target>
Benchmark targets matching the regexp filter that should be run
--skip-target <skip-target>
Benchmark targets matching the regexp filter that should be skipped
Which makes it easy to run partial sets of targets/benchmarks if using a thoughtful naming convention too.
One difference is that you might have benchmarks that are both similar to unit tests, but also much more complex benchmarks that are more similar to integration tests, so the answer depends a bit also on the flavour of benchmarks you do.
For longer more complex benchmarks "integration test"-style, it's a good idea to do as swift-nio did and just break out the actual code under test into separate files/targets and just have the benchmark file itself be benchmark configuration and basically a driver - see e.g. https://github.com/apple/swift-nio/blob/main/Benchmarks/Benchmarks/NIOPosixBenchmarks/Benchmarks.swift
While simpler benchmarks can be done inline...
So more concretely for Foundation, I'd probably consider a structure with targets with a naming convention, something along the lines:
FoundationEssentials
FoundationEssentials-Date
FoundationEssentials-Serialization
FoundationEssentials-Predicate
FoundationInternationalization
FoundationNetworking
FoundationNetworking-URLSession
Etc, basically a common targets (e.g. FoundationEssentials) where most benchmarks for that module reside, but then having additional targets for whatever types that might need more complex sets of benchmarks (I just picked some examples above how it could look)) - this allows for regex filtering if wanting to run benchmarks for a given part of Foundation easily with e.g. swift package benchmark --target "FoundationEssentials.*" but doesn't make the benchmarks to balloon out of control in terms of size in a single file. But thats me, YMMV :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that detailed response. I think the approach we'll take in the short term is to just focus on getting a base level of tests in. Once we see how/if/when we use them, we can always refactor it to match requirements for CI or easier at-desk usage.
| throw CocoaError(.fileReadInvalidFileName) | ||
| } | ||
|
|
||
| let fd = try inPath.withFileSystemRepresentation { inPathFileSystemRep in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if withFileSystemRepresentation is necessary here and elsewhere, even if it only has an effect on Darwin platforms.
I actually think that method should be deprecated. Fundamentally, Foundation should not be trying to second-guess how the filesystem is going to transform a file name. Even the operating system doesn't really know what kind of transformations are going to happen, because that's handled by the (possibly third-party) filesystem driver. You just give it a file name, and if it needs to be transformed, the driver will do it.
AFAICT, other libraries which wrap the POSIX open function do not perform this transformation on Darwin platforms (e.g. NodeJS doesn't), so why is it necessary for Foundation to do it?
At the very least, I think this needs documenting, but as explained I think we should investigate avoiding this work altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added withFileSystemRepresentation in a prior PR in order to mirror the functionality of the existing APIs that Foundation provides on String/NSString (but with a swifty interface). I think that we should use withFileSystemRepresentation here and other places when calling down to C APIs that accept paths for a few reasons. First, it maintains compatibility with existing Foundation implementations. The existing Foundation framework performs this transformation and changing the implementation as we port it to swift could lead to compatibility-breaking behavior, so a safer bet is to maintain the compatibility here.
But perhaps more importantly, I think it's still the correct thing to do. You are right that the transformations can be handled by the file system driver and that different file systems can behave differently within the same OS, so this implementation isn't completely accurate. That being said, I think this is still the best choice for Darwin. While APFS handles unicode normalization, HFS+ (which is also commonly used on Darwin platforms) does not and failing to normalize strings here (using this special character table) can result in incorrect paths being used. If we don't normalize HFS+ paths, we might not find the file we're looking for due to a mismatch which could be surprising to some developers. In the worst case, we perform a normalization that something like APFS would have done for us, but in the best case we ensure that we normalize for file systems where that is a requirement.
Another benefit is that withFileSystemRepresentation is portable - for example, my understanding is that Windows should use a UTF-16 representation and this allows us to have one entry point for conversion to file system representations that can vary based on the platform.
I'm happy to add some documentation to that function as a followup to explain this for maintainers since I can see how that history/context might not be clear from the API itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that withFileSystemRepresentation was ported to Swift. Even though I believe it should be deprecated, there may still be value porting it to the new library since it must remain part of Foundation's ABI.
While APFS handles unicode normalization, HFS+ (which is also commonly used on Darwin platforms) does not and failing to normalize strings here (using this special character table) can result in incorrect paths being used.
I wonder if you could explain this a bit more or point to some public documentation, if there is any, because it doesn't add up to me.
If the POSIX functions don't handle this transformation internally somewhere, it means that I'd be able to use them to create a filename which is not normalised (e.g. using the open function's O_CREAT flag). And if I can do that, it means the filesystem doesn't actually normalise filenames at all -- that normalisation is not a feature of the HFS+ filesystem but rather of the Foundation framework. That seems to be what you're saying, but it contradicts the information I can find online, which says that normalisation is indeed a feature of the filesystem.
Similarly, if it is not handled by the driver, it means I can do the following:
- Take a filename A
- Using Foundation, create the file named A and write some data to it. Because of this transformation, Foundation will actually write to filename X
- Using C/C++/NodeJS/etc, call
openwith the same filename A. Because they don't perform that transformation, and you're saying it's required, they won't be able to find the file.
That seems to me like a serious problem for the operating system. Essentially it makes the POSIX functions unusable, requiring that all IO happens through Foundation so that it gets this required transformation.
So yeah, can you see why this doesn't seem right to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the surprising behavior that was my understanding of how it worked. I had some older documentation I was looking to explain this but sadly I can't seem to find that document, and writing a sample app operating on an HFS volume actually seems to prove (from what I can tell) that the HFS+ driver does automatically decompose when creating/reading from the file. So after taking a further look, it does seem that perhaps this normalization might not have been originally necessary (or perhaps this is an artifact of a time when HFS didn't perform that normalization for us). I'll see if I can gather some more info there with examples to thoroughly document it.
That being said - although it looks like I was incorrect about the necessity for interaction with HFS, the compatibility justification still stands for file systems that are normalization-preserving like APFS. If an app exists today that stores a file name and reads/writes to that file using Foundation APIs, the filename will always be the fully decomposed format. Changing this implementation to avoid the decomposition would be a behavioral difference on an APFS volume where normalization is preserved, and I think we'd likely see fallout from changing Foundations' behavior here despite it being different behavior than other libraries that might call `open without this normalization.
I'll do some more investigation on the necessity of this (in case other clients calling open might be incorrect in not localizing) but I think that can be handled separately since I think we need to land this Data implementation as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating. I hope that this Foundation reimplementation effort can provide an opportunity to reexamine some of these behaviours. I appreciate that this may take time, but I still like to raise the issue so people know about it and look in to it.
I'll just give a quick response to this point:
That being said - although it looks like I was incorrect about the necessity for interaction with HFS, the compatibility justification still stands for file systems that are normalization-preserving like APFS. If an app exists today that stores a file name and reads/writes to that file using Foundation APIs, the filename will always be the fully decomposed format. Changing this implementation to avoid the decomposition would be a behavioral difference on an APFS volume where normalization is preserved, and I think we'd likely see fallout from changing Foundations' behavior here despite it being different behavior than other libraries that might call `open without this normalization.
While APFS is normalisation-preserving, I believe it is still normalisation-aware. So if you save a file caf\u{00E9} (NFC), you can open it using cafe\u{0301} (NFD), and it won't allow both versions to exist in the same folder.
The only visible difference between HFS+ and APFS in this matter is if you create a file, then iterate the contents of its containing directory and look for the file using a Unicode-unaware search such as memcmp. HFS+ will have normalised the filename, so you'll need to look for the "file system representation" in order to get a match, while APFS will have stored the filename as you gave it.
The problem is that this whole process is ill-advised. Even if you can open a file using the name "X", that doesn't necessarily mean you will find a file literally named "X" if you iterate the directory's contents. The filesystem might transform the filename in any number of ways we can't predict (e.g. a case-insensitive filesystem might normalise everything to uppercase, and some filesystems need to truncate filenames). Users shouldn't be looking up files in that way.
That's why I argue that withFileSystemRepresentation should be deprecated. We should put an end to this idea that a directory will contain a particular byte-string, because it isn't true. Once applications no longer operate with that assumption, we wouldn't need to do this normalisation here, even on APFS -- because this normalisation means we aren't really respecting the APFS designers intent to have their filesystem be normalisation-preserving, and much of the effort that is put in to normalisation-aware comparisons at the OS/FS level ends up being for naught. Every subsystem is doing a lot of wasted work.
Anyway, that's enough for now. I'd be very interested to know what your investigations uncover; please do @ me here or on the Swift forums if possible.
|
It is unfortunate that this does not use It would be much nicer if this patch could introduce a set of functions to integrate |
| #endif | ||
| } else { | ||
| // We've verified above that fileSize will fit in `Int` | ||
| guard let bytes = malloc(Int(fileSize)) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that this always allocates. Data can store its contents in an inline buffer.
It's not huge - on 64-bit systems, the inline buffer is 14 bytes, but for very small files it could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the overhead of doing the I/O is much higher than a <=14 byte malloc, but your point is still valid. I think plumbing through the possibility of using inline storage would require some internal Data initializers, which we can follow up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I suppose the majority of the benefit would be that resulting Data object has fixed-size storage, which potentially can allow for some more efficient operations (e.g. unrolled or SIMD comparison, simpler deinitialisation).
This is a long-term goal. Our shorter term one is to provide a drop in replacement implementation for this that works on all platforms (including Darwin), and the best way to do that is to make sure we replicate the existing functionality exactly. Once we address some limitations in swift system (including some that are internal to Darwin builds but affect us here) we intend to figure out how to expose |
|
@swift-ci test |
1 similar comment
|
@swift-ci test |
208afd2 to
21a95a8
Compare
|
@swift-ci test |
Implements file reading and writing for
Data. For now useStringfor path instead ofURL.