Skip to content

[capabilities] add capabilities method to ObjectStore trait#732

Open
vitoordaz wants to merge 14 commits into
apache:mainfrom
vitoordaz:capabilities
Open

[capabilities] add capabilities method to ObjectStore trait#732
vitoordaz wants to merge 14 commits into
apache:mainfrom
vitoordaz:capabilities

Conversation

@vitoordaz

@vitoordaz vitoordaz commented May 23, 2026

Copy link
Copy Markdown
Contributor

I'm starting with a single capability for ordered list results.

Memory, GCP, and Azure list objects always return ordered results. Local filesystem storage uses WalkDir which relies on OS readdir syscall that does not guarantee lexicographic order because it depends on file system.

For AWS, it depends on the bucket type; for directory buckets, the results are not ordered.

I'm thinking about adding a new config option for indicating whether the S3 bucket is a directory bucket or not. But for now, we can say that AWS results are not ordered.

Which issue does this PR close?

Rationale for this change

This will allow object store users to write more efficient code by leveraging underlying object store features.

What changes are included in this PR?

A new capabilities method is added to the ObjectStore trait.

Are there any user-facing changes?

This change is backward compatible.

@vitoordaz

Copy link
Copy Markdown
Contributor Author

@alamb does this change seems reasonable to you?

@vitoordaz

Copy link
Copy Markdown
Contributor Author

cc: @tustvold

@alamb

alamb commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Thank you for this -- I am not likely to have a chance to review this in the next week or so -- when I am trying to get the next release ready I hopefully can find time

Between now and then perhaps someone else will be able to help review this -- specifically what I think is needed is someone to think through the implications of adding this API

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The general direction LGTM. A couple of comments

I also looked up the ordering documentation and implementation details for the three stores that advertise OrderedListing in this PR:

  • GCS: The Cloud Storage docs explicitly say objects are ordered lexicographically by name:
    https://docs.cloud.google.com/storage/docs/listing-objects

  • Azure: The List Blobs REST API docs say blobs are listed alphabetically in the response body, with uppercase letters listed first:
    https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs

    However, there is an important caveat for accounts with hierarchical namespace (HNS) enabled. Azure documents that / is treated as the lowest sort order, and that this difference only applies to recursive listing. Since ObjectStore::list is recursive, this may not be exactly equivalent to object_store::Path::Ord for all valid paths. For example, paths containing characters that sort before / in Rust string ordering, such as space, !, %, -, or ., may be ordered differently by Azure HNS recursive listing.

    So I think Azure should either only advertise OrderedListing when the account is not HNS-enabled, or the capability documentation should be weakened to mean “ordered according to the backend’s documented object-name ordering” rather than “ordered by Path::Ord.”

  • Memory: the implementation uses BTreeMap<Path, Entry> and list() iterates the map with range((prefix)..). Since BTreeMap iterates keys in order, the memory store should return ordered results by implementation.

Based on this, advertising ordered listing for GCS and Memory seems straightforward. Azure also provides ordered results, but HNS-enabled accounts have different documented recursive listing semantics, so I would avoid advertising this capability for Azure HNS if the contract is strict Path::Ord ordering.

Comment thread src/lib.rs Outdated
Comment thread src/capabilities.rs Outdated
Comment thread src/aws/builder.rs Outdated
Comment thread src/capabilities.rs
Comment thread src/integration.rs
Comment thread src/aws/builder.rs Outdated
@vitoordaz

Copy link
Copy Markdown
Contributor Author

@kevinjqliu Here is how I think about this. If user is using vanilla object store implementation than default capabilities for each type of ObjectStore should be enough and as you pointed out these APIs rarely change.

If user is using ObjectStore that implements some of the public object store (for example ObjectStore that implements AWS S3 API) then they can override default values using config.

There are edge cases even in case of S3, object list order depends on bucket type (for directory buckets list is not ordered). For these cases we can use conservative default values.

@kevinjqliu

Copy link
Copy Markdown
Contributor

There are edge cases even in case of S3, object list order depends on bucket type (for directory buckets list is not ordered). For these cases we can use conservative default values.

yea that makes sense. i think azure blob with HNS enabled is another edge case, maybe we can be conservative here and unset OrderedListing for azure.
Alternatively, we can make capabilities take into account specific configurations like HNS for azure. For example, Azure would have the OrderedListing capability when HNS is disabled.

@vitoordaz vitoordaz requested a review from kevinjqliu June 10, 2026 04:26

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I found a few good documentation pages that i think we should link for the respective object stores.
To summarize the PR, we're setting OrderedListing capability for memory and gcs, but not for aws and azure to be conservative.

  • aws directory bucket does not guarantee ordering.
  • azure HNS enabled account does not guarantee lexicographical ordering.

We can address those as follow ups. Would be great to set the capability once we determine its not one of the edge cases

Comment thread src/gcp/mod.rs
Comment thread src/azure/mod.rs
Comment thread src/aws/mod.rs Outdated
Comment thread src/aws/builder.rs

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things flagged by AI

Comment thread src/capabilities.rs Outdated
Comment thread src/capabilities.rs
Comment thread src/capabilities.rs Outdated

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread src/aws/mod.rs Outdated
Comment on lines +82 to +83
/// OrderedListing capability depends on the bucket type, it's not enabled for directory bucket.
/// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// OrderedListing capability depends on the bucket type, it's not enabled for directory bucket.
/// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
// OrderedListing capability depends on the bucket type, it's not enabled for directory bucket.
// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

saw this in the commit diff, use // to align with other get_default_capabilities functions

Comment thread src/capabilities.rs Outdated
@kevinjqliu

Copy link
Copy Markdown
Contributor

i saw Andrew's comment on the original issue thread, #675 (comment)

lets discuss there before proceeding with this PR

vitoordaz added 14 commits June 10, 2026 17:05
I'm starting with a single capability for ordered list results.

GCP and Azure list objects always return ordered results. For AWS it depends on a bucket type, for directory buckets results are not ordered. I'm thinking about adding new config option for indicating whether S3 bucket is a directory bucket or not. But for now we can say that AWS results are not ordered.
@vitoordaz

Copy link
Copy Markdown
Contributor Author

@kevinjqliu do we have decision regarding this feature?

@kevinjqliu

Copy link
Copy Markdown
Contributor

i like the current state of the repo but i would defer to #675

in general i like the idea. a couple things i'd like to see:

  • capabilities are conservative, instance-specific guarantees
  • Absence of a capability means “unknown or unsupported,” not “definitely unsupported.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose underlying object store capabilities (e.g. ordered listing, negative ranges)

3 participants