chore(source-pokeapi): CDK prerelease proof test (InferredSchemaLoader) (do not merge)#69253
Conversation
…7224 This commit updates the base Docker image to use the prerelease CDK version that includes the new InferredSchemaLoader feature. This is the first commit in a two-commit proof-of-concept test. Related: airbytehq/airbyte-python-cdk#831 Co-Authored-By: AJ Steers <aj@airbyte.io>
Original prompt from AJ Steers |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Helpful Resources
PR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
|
This commit replaces the InlineSchemaLoader with the new InferredSchemaLoader on the pokemon stream. The InferredSchemaLoader will read 1 sample record at discover time and infer the schema using genson. This demonstrates the end-to-end functionality of the InferredSchemaLoader feature from CDK PR airbytehq/airbyte-python-cdk#831. Expected behavior: - Discover should return a non-empty schema inferred from the sample record - Schema may include nullable unions (e.g., ["string", "null"]) from genson - Read operations should continue to work with the inferred schema Related: airbytehq/airbyte-python-cdk#831 Co-Authored-By: AJ Steers <aj@airbyte.io>
|
…r config InferredSchemaLoader should reuse the stream's existing retriever, not embed a new one. The nested retriever configuration was causing JSON schema validation errors. Also increased record_sample_size from 1 to 50 for better schema coverage. Co-Authored-By: AJ Steers <aj@airbyte.io>
Show/Hide Detailed Report: CI Failure Root Cause AnalysisInvestigation SummaryI've investigated the CI test failures and identified the root cause. The prerelease Docker image is correctly built with the InferredSchemaLoader feature, but there's a schema validation issue with how the manifest is configured. Findings1. Prerelease Docker Image Verification ✅The prerelease Docker image is correctly built:
2. Configuration IssueCommit 2 (fbcb883) used this configuration: schema_loader:
type: InferredSchemaLoader
record_sample_size: 1
retriever:
type: SimpleRetriever
requester:
$ref: "#/definitions/base_requester" # ← Problem: $ref with sibling properties
path: /{{config['pokemon_name']}}
http_method: GET
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []Commit 3 (1ea0213) incorrectly removed the retriever entirely, which made validation fail because 3. Root CauseThe issue with Commit 2 is likely the Recommended FixInline the requester configuration instead of using schema_loader:
type: InferredSchemaLoader
record_sample_size: 50
retriever:
type: SimpleRetriever
requester:
type: HttpRequester
url_base: https://pokeapi.co/api/v2/pokemon
path: /{{config['pokemon_name']}}
http_method: GET
record_selector:
type: RecordSelector
extractor:
type: DpathExtractor
field_path: []Next StepsAaron ("AJ") Steers (@aaronsteers) - Should I:
This POC has successfully validated that:
|
|
Closing this as stale — it's been open since Nov 2025 with no human activity. Feel free to reopen if needed. |
What
This is a proof-of-concept PR to test the new
InferredSchemaLoaderfeature from Python CDK prerelease version7.4.2.post15.dev19207237224. This PR demonstrates end-to-end functionality by switching source-pokeapi to use runtime schema inference instead of a static schema.Related:
How
Three-commit strategy:
7.4.2.post15.dev19207237224InlineSchemaLoaderwithInferredSchemaLoaderon the pokemon stream (initial attempt with configuration error)Key changes:
metadata.yaml: Updated baseImage from7.4.1to prerelease7.4.2.post15.dev19207237224manifest.yaml: Switched pokemon stream from InlineSchemaLoader (static schema reference) to InferredSchemaLoader (runtime inference)record_sample_size: 50to infer schema from up to 50 sample recordsConfiguration Fix (Commit 3):
The initial InferredSchemaLoader configuration incorrectly nested a
retrieverinside theschema_loader, causing JSON schema validation errors. InferredSchemaLoader should reuse the stream's existing retriever, not embed a new one.Review guide
Critical items to verify:
CI Test Results - Standard tests may fail with schema validation errors if the test harness uses a stable airbyte-cdk that doesn't include InferredSchemaLoader. This is expected for a prerelease POC. Check if:
discoverandcheckcommands work (they use the prerelease CDK in the image)Schema Quality - Compare the inferred schema against the original static schema:
airbyte-integrations/connectors/source-pokeapi/manifest.yaml(lines 28-30): New InferredSchemaLoader config#/schemas/pokemon(removed in this PR)["string", "null"]record_sample_size: 50and PokeAPI returning single records, the inferred schema may miss optional fieldsSingle Record Limitation - PokeAPI's pokemon endpoint (
/{{config['pokemon_name']}}) returns a single Pokemon by name, not a list. This means:record_sample_size: 50somewhat meaninglessFiles to review:
airbyte-integrations/connectors/source-pokeapi/metadata.yaml- CDK version bump (line 18)airbyte-integrations/connectors/source-pokeapi/manifest.yaml- Schema loader replacement (lines 28-30)User Impact
For this proof-of-concept:
If this approach were adopted in production:
Can this PR be safely reverted and rolled back?
This is a proof-of-concept PR on a single connector with no production impact. The changes are isolated to source-pokeapi and use a prerelease CDK version.