[capabilities] add capabilities method to ObjectStore trait#732
[capabilities] add capabilities method to ObjectStore trait#732vitoordaz wants to merge 14 commits into
Conversation
|
@alamb does this change seems reasonable to you? |
|
cc: @tustvold |
|
Thank you for this -- I am not likely to have a chance to review this in the next week or so -- when I am trying to get the next release ready I hopefully can find time Between now and then perhaps someone else will be able to help review this -- specifically what I think is needed is someone to think through the implications of adding this API |
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for the PR! The general direction LGTM. A couple of comments
I also looked up the ordering documentation and implementation details for the three stores that advertise OrderedListing in this PR:
-
GCS: The Cloud Storage docs explicitly say objects are ordered lexicographically by name:
https://docs.cloud.google.com/storage/docs/listing-objects -
Azure: The List Blobs REST API docs say blobs are listed alphabetically in the response body, with uppercase letters listed first:
https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobsHowever, there is an important caveat for accounts with hierarchical namespace (HNS) enabled. Azure documents that
/is treated as the lowest sort order, and that this difference only applies to recursive listing. SinceObjectStore::listis recursive, this may not be exactly equivalent toobject_store::Path::Ordfor all valid paths. For example, paths containing characters that sort before/in Rust string ordering, such as space, !, %, -, or ., may be ordered differently by Azure HNS recursive listing.So I think Azure should either only advertise OrderedListing when the account is not HNS-enabled, or the capability documentation should be weakened to mean “ordered according to the backend’s documented object-name ordering” rather than “ordered by Path::Ord.”
-
Memory: the implementation uses BTreeMap<Path, Entry> and list() iterates the map with range((prefix)..). Since BTreeMap iterates keys in order, the memory store should return ordered results by implementation.
Based on this, advertising ordered listing for GCS and Memory seems straightforward. Azure also provides ordered results, but HNS-enabled accounts have different documented recursive listing semantics, so I would avoid advertising this capability for Azure HNS if the contract is strict Path::Ord ordering.
|
@kevinjqliu Here is how I think about this. If user is using vanilla object store implementation than default capabilities for each type of ObjectStore should be enough and as you pointed out these APIs rarely change. If user is using ObjectStore that implements some of the public object store (for example ObjectStore that implements AWS S3 API) then they can override default values using config. There are edge cases even in case of S3, object list order depends on bucket type (for directory buckets list is not ordered). For these cases we can use conservative default values. |
yea that makes sense. i think azure blob with HNS enabled is another edge case, maybe we can be conservative here and unset OrderedListing for azure. |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM!
I found a few good documentation pages that i think we should link for the respective object stores.
To summarize the PR, we're setting OrderedListing capability for memory and gcs, but not for aws and azure to be conservative.
awsdirectory bucket does not guarantee ordering.azureHNS enabled account does not guarantee lexicographical ordering.
We can address those as follow ups. Would be great to set the capability once we determine its not one of the edge cases
kevinjqliu
left a comment
There was a problem hiding this comment.
A couple things flagged by AI
| /// OrderedListing capability depends on the bucket type, it's not enabled for directory bucket. | ||
| /// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html |
There was a problem hiding this comment.
| /// OrderedListing capability depends on the bucket type, it's not enabled for directory bucket. | |
| /// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html | |
| // OrderedListing capability depends on the bucket type, it's not enabled for directory bucket. | |
| // https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html |
saw this in the commit diff, use // to align with other get_default_capabilities functions
|
i saw Andrew's comment on the original issue thread, #675 (comment) lets discuss there before proceeding with this PR |
I'm starting with a single capability for ordered list results. GCP and Azure list objects always return ordered results. For AWS it depends on a bucket type, for directory buckets results are not ordered. I'm thinking about adding new config option for indicating whether S3 bucket is a directory bucket or not. But for now we can say that AWS results are not ordered.
|
@kevinjqliu do we have decision regarding this feature? |
|
i like the current state of the repo but i would defer to #675 in general i like the idea. a couple things i'd like to see:
|
I'm starting with a single capability for ordered list results.
Memory, GCP, and Azure list objects always return ordered results. Local filesystem storage uses
WalkDirwhich relies on OSreaddirsyscall that does not guarantee lexicographic order because it depends on file system.For AWS, it depends on the bucket type; for directory buckets, the results are not ordered.
I'm thinking about adding a new config option for indicating whether the S3 bucket is a directory bucket or not. But for now, we can say that AWS results are not ordered.
Which issue does this PR close?
Rationale for this change
This will allow object store users to write more efficient code by leveraging underlying object store features.
What changes are included in this PR?
A new
capabilitiesmethod is added to the ObjectStore trait.Are there any user-facing changes?
This change is backward compatible.