Skip to content

PLU-347: feat(box): add ACL permissions metadata to Box connector#704

Open
danielle-unstructured-io wants to merge 2 commits intomainfrom
feature/box-acl-permissions
Open

PLU-347: feat(box): add ACL permissions metadata to Box connector#704
danielle-unstructured-io wants to merge 2 commits intomainfrom
feature/box-acl-permissions

Conversation

@danielle-unstructured-io
Copy link
Copy Markdown

@danielle-unstructured-io danielle-unstructured-io commented May 5, 2026

Summary

  • Extends the Box connector to populate permissions_data on FileDataSourceMetadata, consistent with the Confluence and Google Drive implementations
  • Permissions are fetched at index time in BoxIndexer.run() so they're available to all downstream pipeline stages
  • BoxDownloader.run() retains a fallback for standalone usage (CLI, integration tests without the SND plugin layer)

What changed

unstructured_ingest/processes/connectors/fsspec/box.py

  • Added BOX_ROLE_MAPPING — maps Box collaboration roles to [read], [read, update], or [read, update, delete]; uploader excluded (write-only)
  • Added module-level _normalize_collaborations(), _get_collaborations_for_folder(), _get_permissions_for_file() helpers
  • BoxIndexer.run() override — initializes a Box SDK client once, then for each indexed file walks path_collection ancestor folders (LRU-cached, max 5 entries) plus direct file collaborations to build normalized permissions
  • BoxDownloader.run() fallback — only fetches permissions if permissions_data is None (i.e., indexer wasn't run)
  • BoxConnectionConfig.get_box_client() — returns an authenticated boxsdk.Client via JWT

Tests

  • 21 unit tests covering BOX_ROLE_MAPPING, _normalize_collaborations, and _get_permissions_for_file (all mock-based)
  • Integration test + expected-results fixtures for both top-folder and second-tier subfolder files

Test plan

  • All 21 unit tests pass
  • Verified end-to-end in SND: root-level and subfolder Box files both have permissions_data populated with correct user IDs
  • Integration tests pass in CI

Closes PLU-347

🤖 Generated with Claude Code


Note

Medium Risk
Adds Box SDK calls during indexing/downloading to fetch and attach ACL-derived permissions_data, which can impact correctness, performance, and error handling when talking to Box APIs.

Overview
Box source metadata is extended to include ACL permissions. The Box connector now derives permissions_data from Box collaborations (including inherited folder permissions) and attaches it to FileDataSourceMetadata.

Permissions are fetched at index time via a new authenticated boxsdk client and normalized through new helpers/role mappings with a small LRU cache and a configurable cap (max_num_metadata_permissions); the downloader retains a fallback to populate permissions when the indexer wasn’t used.

Adds Box integration coverage and fixtures for top-level vs nested folder ACL behavior, plus unit tests for role-to-operation mapping and permission normalization/caching. Documentation image links were also updated to use local pipeline.png/sequence.png paths.

Reviewed by Cursor Bugbot for commit 99299bf. Bugbot is set up for automated code reviews on this repo. Configure here.

danielle-unstructured-io and others added 2 commits May 1, 2026 14:42
Fetches collaborations from Box API (direct + inherited via path_collection
ancestor walk) and normalizes into permissions_data on FileDataSourceMetadata,
consistent with Confluence and Google Drive connectors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 99299bf. Configure here.

BOX_ROLE_MAPPING: dict[str, list[str]] = {
"owner": ["read", "update", "delete"],
"co-owner": ["read", "update", "delete"],
"editor": ["read", "update"],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editor delete permission omitted

Medium Severity

editor collaborations are normalized without delete, but Box Editors can delete files and folders. permissions_data therefore understates delete access for valid Box users and groups.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 99299bf. Configure here.

type_key = entity_type + "s"
for op in operations:
normalized[op][type_key].add(entity_id)
total[0] += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access-only collaborations overgrant permissions

High Severity

is_access_only Box collaborations are normalized like regular permissions. During ancestor-folder walks, users or groups with access only to another nested item can be added to unrelated files’ permissions_data, overgranting downstream ACLs.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 99299bf. Configure here.

Copy link
Copy Markdown
Contributor

@awalker4 awalker4 May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have claude take a look at this one. (This being the Cursor comment about overgranted permissions)

@@ -0,0 +1,5 @@
{
"directory_structure": [
"unstructured_aqpewcxk/Billing issue - Example 1.pdf"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomized download paths in fixtures

Medium Severity

The Box fixtures assert captured unstructured_* temp-directory suffixes. FsspecDownloader creates a fresh random suffix each run, so directory-structure validation will fail even when downloads succeed.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 99299bf. Configure here.

try:
file_data.metadata.permissions_data = _get_permissions_for_file(
client, file_id, cache
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permission cap is ignored

Medium Severity

BoxIndexer.run() always uses _get_permissions_for_file()’s default limit, so BoxDownloaderConfig.max_num_metadata_permissions is ignored in normal pipelines where the downloader fallback does not run.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 99299bf. Configure here.

Copy link
Copy Markdown
Contributor

@awalker4 awalker4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will just need to add a new block to the changelog for whatever the next version is. This value also goes into __version__.py

type_key = entity_type + "s"
for op in operations:
normalized[op][type_key].add(entity_id)
total[0] += 1
Copy link
Copy Markdown
Contributor

@awalker4 awalker4 May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have claude take a look at this one. (This being the Cursor comment about overgranted permissions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants