Skip to content

Commit 8be3ff4

Browse files
authored
Merge pull request #2888 from oracle-devrel/lakehouse-update
AI Skills added for Lakehouse
2 parents b32b9ff + 9dbaa22 commit 8be3ff4

53 files changed

Lines changed: 3098 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
Copyright (c) 2026 Oracle and/or its affiliates.
2+
3+
The Universal Permissive License (UPL), Version 1.0
4+
5+
Subject to the condition set forth below, permission is hereby granted to any
6+
person obtaining a copy of this software, associated documentation and/or data
7+
(collectively the "Software"), free of charge and under any and all copyright
8+
rights in the Software, and any and all patent rights owned or freely
9+
licensable by each licensor hereunder covering either (i) the unmodified
10+
Software as contributed to or provided by such licensor, or (ii) the Larger
11+
Works (as defined below), to deal in both
12+
13+
(a) the Software, and
14+
(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
15+
one is included with the Software (each a "Larger Work" to which the Software
16+
is contributed by such licensors),
17+
18+
without restriction, including without limitation the rights to copy, create
19+
derivative works of, display, perform, and distribute the Software and make,
20+
use, sell, offer for sale, import, export, have made, and have sold the
21+
Software and the Larger Work(s), and to sublicense the foregoing rights on
22+
either these or other terms.
23+
24+
This license is subject to the following condition:
25+
The above copyright notice and either this complete permission notice or at
26+
a minimum a reference to the UPL must be included in all copies or
27+
substantial portions of the Software.
28+
29+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
30+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
31+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
32+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
33+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
34+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
35+
SOFTWARE.
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Autonomous AI Lakehouse Skills
2+
3+
This repository contains two community-driven Skills designed to accelerate common workflows when working with Oracle Autonomous AI Lakehouse:
4+
5+
- **AI Lakehouse Ops Skill**
6+
- **AI Lakehouse Data Loader Skill**
7+
8+
These Skills help users automate and simplify operational and data-loading tasks related to AI Lakehouse environments.
9+
10+
**IMPORTANT NOTE:** These Skills are not official Oracle products and are provided as community-driven resources to assist in expediting AI Lakehouse-related workflows. While they aim to be helpful, they are not guaranteed to cover all scenarios or provide complete accuracy. Users are strongly encouraged to consult the official Oracle documentation for definitive guidance and support.
11+
12+
Reviewed: 06.05.2026
13+
14+
# Included Skills
15+
16+
## AI Lakehouse Ops Skill
17+
18+
The **AI Lakehouse Ops Skill** is designed to assist with operational tasks related to Oracle Autonomous AI Lakehouse environments.
19+
20+
It helps users review operational inputs, summarize environment status, identify potential issues, and generate guidance for troubleshooting or validating AI Lakehouse-related configurations.
21+
22+
Typical use cases include:
23+
24+
- Reviewing AI Lakehouse operational logs or diagnostic information
25+
- Summarizing environment status
26+
- Identifying possible configuration or runtime issues
27+
- Generating operational reports
28+
- Providing troubleshooting guidance
29+
- Suggesting next steps for validation or investigation
30+
31+
This Skill is intended to support operations, enablement, and troubleshooting activities, especially in non-production or test environments.
32+
33+
## AI Lakehouse Data Loader Skill
34+
35+
The **AI Lakehouse Data Loader Skill** is designed to assist with loading, preparing, and validating data for AI Lakehouse workflows.
36+
37+
It helps users understand data-loading requirements, prepare data ingestion steps, review source data inputs, and generate guidance for moving data into an AI Lakehouse environment.
38+
39+
Typical use cases include:
40+
41+
- Preparing data-loading workflows
42+
- Reviewing source files, metadata, or schemas
43+
- Generating data ingestion guidance
44+
- Identifying possible data preparation issues
45+
- Creating step-by-step data loading instructions
46+
- Summarizing loaded or pending datasets
47+
- Supporting demos, prototypes, and enablement scenarios
48+
49+
This Skill is intended to accelerate data onboarding and ingestion-related tasks for AI Lakehouse use cases.
50+
51+
# When to use this asset?
52+
53+
Use these Skills when working with Oracle Autonomous AI Lakehouse and you need assistance with either operational tasks or data-loading workflows.
54+
55+
Use the **AI Lakehouse Ops Skill** when you need to:
56+
57+
- Review operational information
58+
- Analyze logs or configuration snippets
59+
- Generate an operations-oriented summary
60+
- Troubleshoot AI Lakehouse-related issues
61+
- Document findings or recommended next steps
62+
63+
Use the **AI Lakehouse Data Loader Skill** when you need to:
64+
65+
- Prepare data for AI Lakehouse usage
66+
- Review data-loading inputs
67+
- Generate ingestion steps
68+
- Summarize datasets, schemas, or metadata
69+
- Validate data readiness before loading
70+
71+
These Skills are intended for advisory and productivity purposes. They should not replace official Oracle documentation, formal validation, testing, or Oracle support.
72+
73+
# How to use this asset?
74+
75+
Each Skill can be used independently depending on the task. It is built for any Agent like OpenAI or Claude.
76+
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
name: autonomous-data-loader
3+
description: generate and safely execute oracle autonomous ai lakehouse data loading and oci object storage lakehouse access workflows using dbms_cloud. use when the user wants to list oci object storage files, choose files or prefixes to load, create conservative csv staging tables, generate or run copy_data or copy_collection, tune dbms_cloud format options, load json documents into soda collections, create external tables to query apache iceberg data stored in oci object storage using direct metadata.json or hadoop catalog patterns, monitor user_load_operations or dba_load_operations, inspect logfile_table or badfile_table, troubleshoot rejected rows, reconcile loads, or profile staged data after loading. this skill is mcp-first with generate-only fallback and is scoped to oci object storage and dbms_cloud-based workflows.
4+
---
5+
6+
# Autonomous AI Lakehouse Data Loader
7+
8+
## Purpose
9+
10+
Use this skill to help users load data from OCI Object Storage into Oracle Autonomous AI Lakehouse with `DBMS_CLOUD`, and to create external-table access to Apache Iceberg data stored in OCI Object Storage. The skill is designed for a portable Agent Skill workflow: it can generate SQL/PLSQL for manual execution, or execute through an available MCP SQL tool when connected to the target Autonomous database.
11+
12+
## Core Scope
13+
14+
Handle these workflows:
15+
16+
- Discover objects in OCI Object Storage with `DBMS_CLOUD.LIST_OBJECTS`.
17+
- Normalize Object Storage URIs and choose a file, selected file list, prefix, wildcard, or regex pattern.
18+
- Prefer existing `DBMS_CLOUD` credential names and never request secrets in chat.
19+
- Check whether target tables exist before generating `COPY_DATA`.
20+
- Generate and optionally execute `DBMS_CLOUD.COPY_DATA` for supported file loads into existing relational tables.
21+
- Generate and optionally execute `DBMS_CLOUD.COPY_COLLECTION` for JSON documents into SODA collections.
22+
- For CSV without an existing target table, offer conservative staging from the CSV header using `VARCHAR2(4000)` columns.
23+
- Generate format options for CSV, JSON, Parquet, ORC, and Avro. Treat XML as version-specific and verify official documentation before generating XML load workflows.
24+
- Create and validate external tables that query Apache Iceberg data stored in OCI Object Storage, using only the direct `metadata.json` and HadoopCatalog-on-OCI patterns documented for Autonomous AI Database.
25+
- Monitor and reconcile loads with native `USER_LOAD_OPERATIONS` or `DBA_LOAD_OPERATIONS`.
26+
- Inspect `LOGFILE_TABLE` and `BADFILE_TABLE` after failures or rejected rows.
27+
- Profile staged data after load and propose curated DDL only as a proposal.
28+
29+
Do not make Data Pump or `DBMS_CLOUD_PIPELINE` part of any default workflow. Do not add non-OCI Iceberg providers such as Unity, Polaris, AWS Glue, S3, Azure, or GCS to the default workflow.
30+
31+
## Execution Model
32+
33+
Default to MCP-enabled execution when a SQL execution tool is available. If no MCP SQL tool is available, use generate-only mode.
34+
35+
### MCP-enabled mode
36+
37+
- Use the available MCP SQL execution tool for read-only inspection queries.
38+
- Do not assume a specific tool name. Prefer the SQL tool connected to the target Autonomous AI Lakehouse database.
39+
- Execute read-only checks directly when useful: dictionary queries, `LIST_OBJECTS`, load-history queries, log and badfile inspection, Iceberg external-table sanity checks such as `COUNT(*)`.
40+
- For mutating operations, generate the SQL/PLSQL first, explain the impact, and require approval before execution.
41+
42+
### Generate-only mode
43+
44+
- Generate SQL/PLSQL and ask the user to execute it manually in their preferred Oracle client.
45+
- Ask the user to paste results back when the next step depends on inspection output.
46+
47+
## Approval Policy
48+
49+
Support two approval styles:
50+
51+
- **Strict approval**: ask before every mutating operation. This is the default.
52+
- **Batch approval**: show the complete non-destructive mutating plan first, then execute the approved plan. Use only when the user asks for batch approval or clearly approves the entire plan.
53+
54+
Always require strict approval for destructive operations, even when batch approval is active.
55+
56+
Mutating operations include:
57+
58+
- `CREATE TABLE`, `ALTER TABLE`, `CREATE COLLECTION` patterns, and similar DDL.
59+
- `DBMS_CLOUD.COPY_DATA`.
60+
- `DBMS_CLOUD.COPY_COLLECTION`.
61+
- `DBMS_CLOUD.COPY_COLLECTION` may create a missing SODA collection; treat it as mutating even before rows or documents are loaded.
62+
- `DBMS_CLOUD.CREATE_CREDENTIAL`.
63+
- `DBMS_CLOUD.CREATE_EXTERNAL_TABLE` for Iceberg access.
64+
- `DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE` for Iceberg Object Storage ACL setup.
65+
- `INSERT`, `UPDATE`, `DELETE`, `MERGE`.
66+
67+
Destructive operations include:
68+
69+
- `DROP TABLE`.
70+
- `TRUNCATE TABLE`.
71+
- `ALTER TABLE DROP COLUMN`.
72+
- `DELETE` without a narrowly scoped predicate.
73+
- Replacing, truncating, or recreating an existing staging table.
74+
75+
Prefer non-destructive alternatives, such as a new staging table name, before recommending destructive cleanup.
76+
77+
## Guardrails
78+
79+
- Never ask users to paste secrets, API keys, auth tokens, private keys, or passwords into the prompt.
80+
- Prefer an existing `DBMS_CLOUD` credential name.
81+
- If a credential is missing, generate a `CREATE_CREDENTIAL` template with placeholders and warn users to replace placeholders outside the chat.
82+
- Do not infer a final CSV business schema from a filename, bucket, or folder alone.
83+
- For CSV without a target table, ask whether the user wants conservative staging, a user-provided schema, or profiling first.
84+
- Do not mix formats in a single `COPY_DATA` operation.
85+
- Do not load from a whole prefix until object discovery shows the files are homogeneous enough.
86+
- Exclude marker/control files such as `_SUCCESS`, `.crc`, manifests, readme files, and zero-byte files unless the user explicitly requests otherwise.
87+
- Treat generated curated DDL as proposed until the user approves it.
88+
- For Iceberg workflows, keep the scope to OCI Object Storage only and generate only external table access patterns; do not treat Iceberg as a `COPY_DATA` load.
89+
- For Iceberg direct metadata, warn that the table points to a specific `metadata.json` snapshot and may need to be recreated after snapshot or schema changes.
90+
- For Iceberg HadoopCatalog on OCI, require the lakehouse folder URI and `iceberg_table_path`.
91+
- Warn users about documented Iceberg limitations before creating an external table: fixed external-table schema, no query-time time travel, unsupported merge-on-read delete files, and provider/version-specific restrictions.
92+
93+
## Workflow Decision Tree
94+
95+
1. Identify the source request:
96+
- bucket or prefix discovery: use `references/object-discovery-and-selection.md`.
97+
- direct relational table load: use `references/copy-data.md`.
98+
- JSON document collection load: use `references/copy-collection-json.md`.
99+
- Apache Iceberg data stored in OCI Object Storage: use `references/iceberg-oci-object-storage.md`.
100+
- failed load or rejected rows: use `references/monitoring-and-troubleshooting.md`.
101+
- CSV with no target table: use `references/csv-staging-and-profiling.md`.
102+
103+
2. Collect minimum inputs:
104+
- `credential_name` or instruction to create one.
105+
- OCI Object Storage URI, bucket/prefix, or exact file URI.
106+
- target table or collection name, unless the user wants discovery/planning only.
107+
- format or enough evidence to infer the format from selected object names.
108+
- for Iceberg: external table name, credential name, OCI Object Storage URI for `metadata.json` or lakehouse folder, and optionally `iceberg_table_path` for HadoopCatalog.
109+
110+
3. Run read-only pre-checks when MCP is available:
111+
- list object candidates with `DBMS_CLOUD.LIST_OBJECTS`.
112+
- check target table or collection existence.
113+
- inspect target columns when loading into a relational table.
114+
- inspect recent load history when troubleshooting.
115+
- inspect Iceberg metadata file or lakehouse folder candidates when building an Iceberg external table.
116+
117+
4. Plan the load or access pattern:
118+
- choose exact file list, prefix/wildcard, or regex pattern.
119+
- select `COPY_DATA`, `COPY_COLLECTION`, or `CREATE_EXTERNAL_TABLE` for Iceberg query access.
120+
- select format options or Iceberg access protocol configuration.
121+
- decide direct load versus user-named staging.
122+
123+
5. For mutating operations:
124+
- present the SQL/PLSQL.
125+
- explain the risk.
126+
- ask for strict or batch approval.
127+
- execute only after approval if MCP is available.
128+
129+
6. After execution:
130+
- query `USER_LOAD_OPERATIONS` or `DBA_LOAD_OPERATIONS` for load operations.
131+
- for Iceberg external tables, run a read-only sanity check such as `SELECT COUNT(*)` and inspect table columns.
132+
- reconcile status, operation id, start/update time, log table, badfile table, and row counts where possible.
133+
- if failed or rejected rows are present, switch to troubleshooting.
134+
- if CSV staging was used, offer post-load profiling and curated DDL proposal.
135+
136+
## Response Style
137+
138+
Be flexible and concise. Do not force every answer into a rigid template. For mutating or destructive operations, always clearly show:
139+
140+
- what will change,
141+
- the SQL/PLSQL involved,
142+
- whether approval is required,
143+
- how to monitor the result,
144+
- and how to troubleshoot failures.
145+
146+
For Iceberg external-table access, clearly state that the operation creates query access to data in Object Storage; it does not copy the Iceberg data into Autonomous.
147+
148+
## References
149+
150+
Use these files when relevant:
151+
152+
- `references/oracle-docs-index.md` for official Oracle documentation links.
153+
- `references/version-notes.md` for the v0.1 scope and release notes.
154+
- `references/minimum-inputs.md` for minimum required inputs by workflow.
155+
- `references/object-discovery-and-selection.md` for object listing and file selection.
156+
- `references/source-and-credentials.md` for OCI Object Storage credentials and URI patterns.
157+
- `references/copy-data.md` for relational table loads with `DBMS_CLOUD.COPY_DATA`.
158+
- `references/copy-collection-json.md` for JSON document loads into SODA collections.
159+
- `references/format-options.md` for format option guidance.
160+
- `references/iceberg-oci-object-storage.md` for querying Iceberg data in OCI Object Storage with external tables.
161+
- `references/csv-staging-and-profiling.md` for CSV staging and post-load profiling.
162+
- `references/monitoring-and-troubleshooting.md` for load monitoring, logs, badfiles, and retry guidance.
163+
- `references/mcp-execution.md` for MCP-first execution behavior.
164+
165+
Use examples in `examples/` only when the user needs a concrete pattern.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
interface:
2+
display_name: "Autonomous AI Lakehouse Data Loader"
3+
short_description: "Load OCI Object Storage data and create Iceberg query access with DBMS_CLOUD."
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Example: CSV conservative staging
2+
3+
User request:
4+
5+
> There is a CSV in this bucket. I do not have a table. Load it.
6+
7+
Recommended response:
8+
9+
- Explain that CSV does not provide reliable types.
10+
- List objects first.
11+
- Read or ask for the header.
12+
- Ask the user for a staging table name.
13+
- Create a staging table with all CSV columns as `VARCHAR2(4000)`.
14+
- Load with `COPY_DATA` and `skipheaders => 1`.
15+
- Profile after load.
16+
17+
Assume header:
18+
19+
```text
20+
order_id,customer_id,order_date,amount,currency
21+
```
22+
23+
Proposed staging table:
24+
25+
```sql
26+
create table STG_SALES_CSV (
27+
ORDER_ID varchar2(4000),
28+
CUSTOMER_ID varchar2(4000),
29+
ORDER_DATE varchar2(4000),
30+
AMOUNT varchar2(4000),
31+
CURRENCY varchar2(4000)
32+
);
33+
```
34+
35+
Load:
36+
37+
```sql
38+
declare
39+
l_operation_id number;
40+
begin
41+
dbms_cloud.copy_data(
42+
table_name => 'STG_SALES_CSV',
43+
credential_name => 'OBJ_STORE_CRED',
44+
file_uri_list => 'https://objectstorage.<region>.oraclecloud.com/n/<namespace>/b/<bucket>/o/raw/sales/sales.csv',
45+
format => json_object(
46+
'type' value 'csv',
47+
'skipheaders' value 1,
48+
'delimiter' value ',',
49+
'quote' value '"',
50+
'blankasnull' value 'true',
51+
'rejectlimit' value 100,
52+
'enablelogs' value 'true',
53+
'logretention' value 7
54+
),
55+
operation_id => l_operation_id
56+
);
57+
dbms_output.put_line('operation_id=' || l_operation_id);
58+
end;
59+
/
60+
```
61+
62+
Post-load profiling:
63+
64+
```sql
65+
select count(*) as row_count from STG_SALES_CSV;
66+
67+
select
68+
max(length(ORDER_ID)) as order_id_max_len,
69+
max(length(CUSTOMER_ID)) as customer_id_max_len,
70+
max(length(CURRENCY)) as currency_max_len
71+
from STG_SALES_CSV;
72+
```

0 commit comments

Comments
 (0)