Skip to content

feat(): Support context document import from local filesystem and GitHub#16903

Open
jjoyce0510 wants to merge 7 commits into
masterfrom
jj--support-context-document-import-oss
Open

feat(): Support context document import from local filesystem and GitHub#16903
jjoyce0510 wants to merge 7 commits into
masterfrom
jj--support-context-document-import-oss

Conversation

@jjoyce0510
Copy link
Copy Markdown
Collaborator

@jjoyce0510 jjoyce0510 commented Apr 3, 2026

Summary

Adding a PR to introduce support for uploading context docs for OSS DataHub.

This is support for ONE-TIME sync from GitHub and local filesystem. We DO produce consistent URNs, enabling you to effectively overwrite / update existing docs again if the contents change at source! The goal is to UNBLOCK folks from getting context docs onboarded.

See the UX below for the full experience.

Screenshots

Screenshot 2026-04-06 at 12 13 18 PM Screenshot 2026-04-06 at 11 44 02 AM Screenshot 2026-04-06 at 11 43 39 AM Screenshot 2026-04-06 at 11 34 40 AM Screenshot 2026-04-06 at 11 34 30 AM Screenshot 2026-04-06 at 11 34 14 AM Screenshot 2026-04-06 at 11 34 08 AM

Status

Ready for review.

@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Apr 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Bundle Report

Changes will increase total bundle size by 528.59kB (2.33%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 23.24MB 528.59kB (2.33%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 528.59kB 12.99MB 4.24%

Files in assets/index-*.js:

  • ./src/alchemy-components/components/Input/utils.ts → Total Size: 164 bytes

  • ./src/app/analytics/event.ts → Total Size: 15.63kB

  • ./src/alchemy-components/components/FileDragAndDropArea/FileDragAndDropArea.tsx → Total Size: 3.38kB

  • ./src/app/context/import/hooks/useImportFromGitHub.ts → Total Size: 1.07kB

  • ./src/app/context/import/sources/GitHubImportSource.tsx → Total Size: 2.87kB

  • ./src/app/context/import/ImportParentSelector.tsx → Total Size: 4.07kB

  • ./src/app/context/import/ImportDocumentsButton.tsx → Total Size: 1.01kB

  • ./src/app/context/ContextSidebar.tsx → Total Size: 11.22kB

  • ./src/app/context/import/ImportDocumentsModal.tsx → Total Size: 9.51kB

  • ./src/app/context/import/import.types.ts → Total Size: 419 bytes

  • ./src/app/context/import/sources/FileUploadSource.tsx → Total Size: 2.89kB

  • ./src/app/document/hooks/useLoadDocumentTree.ts → Total Size: 3.56kB

  • ./src/app/context/import/hooks/useImportFromFiles.ts → Total Size: 2.52kB

@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous Bot commented Apr 3, 2026

🔴 Meticulous spotted visual differences in 1 of 1606 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit 4484385 feat(ingestion): add github-docs source for importing documents from Git.... This comment will update as new commits are pushed.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Copy link
Copy Markdown
Collaborator

@alexsku alexsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Type: behavior change (new feature)
Size: +4432 / -60 across 49 files

This PR adds the ability to import context documents into DataHub from two sources: local file upload (via browser-side text extraction) and GitHub repositories (via the GitHub REST API). It introduces a new DocumentImportService backend, three new GraphQL resolvers (import from files, import from GitHub, preview from GitHub), a React modal UI with source selection, and parent document hierarchy support. It also adds ignore_above to keyword mappings in ES to prevent indexing failures on long text values.

Key invariants:

  1. Authorization: All import operations require MANAGE_DOCUMENTS platform privilege
  2. Idempotency: Documents are created with deterministic IDs derived from source IDs, so re-imports update rather than duplicate
  3. Hierarchy: GitHub folder structure is preserved as parent-child document relationships; candidates must be ordered parents-first
  4. The GitHub token is passed per-request and not stored server-side

Risk Assessment

Risk Medium
Blast radius Document entities only; ES mapping change affects all entities with TEXT/TEXT_PARTIAL fields
Rollback Safe for import feature (new code, no migrations). ES mapping change (ignore_above) is additive and backward-compatible
Rollout Ship it — import resolvers are null-guarded if factory bean is absent
CI Failing: build-and-test (except_metadata_ingestion, UTC)

Blocking Issues

None found.

High-Priority Issues

[HIGH] GitHub token passed through GraphQL — potential for logging/persistence

Category: security | Confidence: medium
Location: datahub-graphql-core/src/main/resources/documents.graphql:871githubToken: String!

The GitHub PAT is passed as a plain String! in the GraphQL input. While it's used only server-side and not stored, GraphQL query logging, APM tracing, or error serialization could inadvertently capture the token in server logs. The same ImportDocumentsFromGitHubInput type is reused for both the previewDocumentsFromGitHub query and the importDocumentsFromGitHub mutation, meaning the token is sent even for read-only preview operations.

Suggested fix: Ensure GraphQL request logging is configured to redact githubToken. Consider whether the preview query truly needs the token in the same input type (it does for private repos, so this is likely acceptable). Document the logging risk.
How to validate: Check if the DataHub GMS logging configuration logs full GraphQL variables. Search for query logging middleware.

[HIGH] No limit on number of files fetched from GitHub — unbounded server-side work

Category: performance | Confidence: high
Location: metadata-service/services/src/main/java/com/linkedin/metadata/service/docimport/GitHubDocumentSource.java:166-187fetchDocuments()

The fetchDocuments method iterates over all matching files from the GitHub tree API and fetches each file's content one-by-one via individual HTTP requests (line 167: fetchFileContent). For a large repo with thousands of matching files, this creates an unbounded number of sequential HTTP requests from the GMS server, tying up a GraphQL worker thread for potentially minutes. There's no configurable limit or pagination.

Suggested fix: Add a maxFiles parameter (defaulting to something like 500) and stop fetching after the limit. Log a warning when truncated.
How to validate: Point the import at a large repo (e.g., a docs monorepo with 1000+ .md files) and observe GMS behavior.


Other Issues

  • [MEDIUM / edge-case] GitHubDocumentSource.java:391matchesFilters uses startsWith(pathPrefix) without a / boundary. Prefix "doc" would match "document/README.md". Fix: append / to non-empty prefix before the startsWith check, or use filePath.startsWith(pathPrefix + "/").

  • [MEDIUM / edge-case] GitHubDocumentSource.java:90-92 — The GitHub tree API returns a truncated: true field when the tree is too large (>100K entries). This is not checked. If truncated, the user silently gets an incomplete file list. Fix: check tree.path("truncated").asBoolean() and warn the user.

  • [MEDIUM / performance] DocumentImportService.java:190entityClient.exists() is called for every candidate inside a loop. For a batch import of N documents, this makes N individual existence checks. Fix: consider batching existence checks or accepting that the first import will always be "create".

  • [MEDIUM / correctness] DocumentImportService.java:275-283makeDocumentId can produce collisions. Two different source IDs that differ only in characters replaced by - (e.g., "docs/a b" and "docs/a-b") will produce the same document ID. File-upload source IDs like upload.my file and upload.my-file would collide.

  • [MEDIUM / dx] ImportDocumentsModal.tsx:210-212<Text color="gray" colorLevel={1700}> passes color directly to alchemy <Text> components. Per the project's theming guidance, colors should come from theme tokens rather than being passed as props to alchemy components.

  • [LOW / operability] GitHubDocumentSource.java:362-386executeGitHubGet returns null on non-200 responses but doesn't distinguish between 401 (bad token), 403 (rate limit), and 404 (repo not found). A more specific error message would help users troubleshoot.


What's Missing

  • No limit on imported file count: The GitHub import has no cap on the number of files processed. Both the frontend and backend should enforce a maximum.
  • No cancellation support: The GitHub import is synchronous on the GraphQL thread. Consider async execution with progress tracking for large imports.
  • Bundle size increase: Codecov reports +528KB (2.33%) from the mammoth DOCX library. Consider lazy-loading it only when the import modal is opened.

Test Plan

Invariant Covered? Suggested test
Auth check on all import endpoints Yes — resolver tests verify AuthorizationException
Idempotent re-import (same URN) Yes — smoke test test_import_file_upload_idempotent
Parent-child hierarchy Yes — smoke test test_import_file_upload_with_parent
makeDocumentId collision No Unit test: verify distinct source IDs don't collide
Large repo truncation handling No Integration test: mock truncated tree response
ES ignore_above behavior Partial — FieldTypeMapperTest covers mapping generation Could add integration test indexing a >8191 char keyword

Questions for Author

  1. Is the CI build failure (build-and-test) unrelated to this PR, or does it need to be fixed before merge?
  2. Was a file count limit for GitHub imports intentionally omitted, or is it planned for a follow-up?
  3. Should the mammoth dependency be lazy-loaded to reduce the bundle impact for users who don't use the import feature?
  4. The ES ignore_above / keyword mapping changes seem orthogonal to the document import feature — was this bundled intentionally due to document text fields hitting the limit, or could it be split into a separate PR for cleaner rollback?

…GitHub repos

Adds a new Python ingestion source that imports documents from GitHub
repositories as native DataHub documents (or external references),
preserving folder hierarchy as parent-child relationships. Uses the
GitHub REST API with SecretStr token handling, stateful ingestion for
stale entity removal, and configurable path/extension filtering.

Includes unit tests (63 tests), connector documentation, and recipe.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants