Merge pull request #80 from marancibia/main

jpmc-github · web-flow · commit 7a265d7da851 · 2026-04-09T08:00:29.000-07:00
New delta sharing example
diff --git a/apex/Ask-Oracle/README.md b/apex/Ask-Oracle/README.md
@@ -11,7 +11,7 @@ models.
 
 ## Ask Oracle overview
 
-Now in version 3, the Ask Oracle chatbot supports a broad range of
+Now in version 4, the Ask Oracle chatbot supports a broad range of
 functionality for Select AI:
 
 - **Chat** -- direct interaction with the LLM specified in your selected
diff --git a/delta-sharing/README.md b/delta-sharing/README.md
@@ -0,0 +1,44 @@
+# Delta Sharing
+
+## Overview
+
+Oracle Autonomous Database supports versioned shares through the open Delta Sharing protocol. Providers publish data from Autonomous Database, and recipients access shares using a JSON profile and query Parquet data for a selected version window. Oracle also provides a fully scriptable sharing workflow through the `DBMS_SHARE` package.
+
+## Files
+
+- `./change-data-feed/Oracle Delta Sharing CDF.ipynb` — Python code that compares two versions of a Delta Share and prints the raw change rows returned for that version window.
+
+## What the notebook shows
+
+This notebook demonstrates a file-based CDF-style workflow for versioned Delta Shares published by Oracle Autonomous Database.
+
+The notebook:
+
+- authenticates with a Delta Sharing profile
+- requests changes between `START_VERSION` and `END_VERSION`
+- downloads only the Parquet/action files returned for that version window
+- displays raw rows together with `_commit_version` and `_change_type`
+
+This makes it useful for validating what changed between two published versions of a share without scanning the full share.
+
+## Important behavior
+
+This sample operates at file level.
+
+When a file changes between two versions:
+
+- rows from the previous file can appear as `delete`
+- rows from the replacement file can appear as `insert`
+
+As a result, unchanged rows inside a replaced file can appear as matching delete/insert pairs. The notebook intentionally shows the raw output so downstream logic can derive the net inserts, deletes, and updates.
+
+In practice, this means that if a share is large but only a small incremental change was published, the notebook reads only the files returned for the requested version window rather than scanning the full share.
+
+## References
+
+- [Overview of the Data Share Tool](https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/overview-adp-share.html)
+- [Manage Shares with DBMS_SHARE](https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/manage-shares.html)
+- [DBMS_SHARE Constants](https://docs.oracle.com/en/cloud/paas/autonomous-database/serverless/adbsb/dbms-share-package-constants.html)
+- [High-Level Steps for Receiving Shares for Versioned Data](https://docs.oracle.com/en/database/oracle/sql-developer-web/sdwfd/high-level-steps-recieving-data-shares-versioned-data.html)
+- [Sharing Data from On-Premise Oracle Databases](https://blogs.oracle.com/autonomous-ai-database/sharing-data-from-onpremise-oracle-databases)
+- [Seamless, Open Data Sharing Between Oracle Autonomous Database and Databricks](https://blogs.oracle.com/autonomous-ai-database/open-data-sharing-between-oracle-and-databricks)
diff --git a/delta-sharing/change-data-feed/Oracle Delta Sharing CDF.ipynb b/delta-sharing/change-data-feed/Oracle Delta Sharing CDF.ipynb
@@ -0,0 +1,250 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "47d95cd3-9819-4136-8c6e-da12f04a96ac",
+     "showTitle": true,
+     "tableResultSettingsMap": {},
+     "title": "Overview"
+    }
+   },
+   "source": [
+    "# Oracle Delta Sharing CDF Smoke Test\n",
+    "\n",
+    "This notebook reads **Change Data Feed (CDF)** from an Oracle Autonomous Database\n",
+    "that publishes a Delta Sharing endpoint, and displays the raw change rows.\n",
+    "\n",
+    "## Why a custom REST approach?\n",
+    "\n",
+    "| Problem | Root cause | Workaround |\n",
+    "|---|---|---|\n",
+    "| `spark.read.format(\"deltaSharing\")` throws `InvocationTargetException` | Spark's Java Delta Sharing connector is incompatible with Oracle endpoints on serverless compute | Use the Python `delta_sharing` REST client instead |\n",
+    "| `load_table_changes_as_pandas()` throws `KeyError: '_commit_timestamp'` | Oracle's file-level CDF omits the `_commit_timestamp` column that the library expects | Call the REST API directly and parse with `DeltaSharingReader._to_pandas()` |\n",
+    "| `spark.read.parquet(*urls)` throws `UNSUPPORTED_FILE_SYSTEM` | Spark can't read HTTPS pre-signed URLs from Oracle object storage | Download via HTTP with `_to_pandas()`, then convert to Spark DataFrame |\n",
+    "\n",
+    "## Oracle's file-level CDF\n",
+    "\n",
+    "Oracle implements CDF at the **file level**, not the row level. When *any* row in a\n",
+    "data file changes, Oracle replaces the **entire file**. The CDF response therefore\n",
+    "shows:\n",
+    "- **All rows from the old file** as `delete`\n",
+    "- **All rows from the new file** as `insert`\n",
+    "\n",
+    "Unchanged rows appear as matching DELETE + INSERT pairs (file-level artifacts).\n",
+    "To derive the true net changes, cancel out identical pairs — this notebook shows\n",
+    "the raw output so you can observe this behavior directly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "55fe12eb-a206-4f2b-a37d-97fe0d113123",
+     "showTitle": true,
+     "tableResultSettingsMap": {},
+     "title": "Install dependencies"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "%pip install delta-sharing -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "351d8f42-1508-4d2d-9973-22ef9849f51f",
+     "showTitle": true,
+     "tableResultSettingsMap": {},
+     "title": "Read CDF and display raw changes"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# ---------------------------------------------------------------------------\n",
+    "# Oracle Delta Sharing CDF Smoke Test\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# Reads raw CDF (Change Data Feed) for a single version window from an\n",
+    "# Oracle Autonomous Database Delta Sharing endpoint and prints the rows.\n",
+    "#\n",
+    "# This script does NOT write anything to a Delta table — it is purely\n",
+    "# diagnostic, useful for inspecting what Oracle's file-level CDF returns\n",
+    "# before applying changes downstream.\n",
+    "# ---------------------------------------------------------------------------\n",
+    "\n",
+    "import json\n",
+    "import time\n",
+    "\n",
+    "import pandas as pd\n",
+    "import delta_sharing\n",
+    "from delta_sharing.protocol import DeltaSharingProfile, Table\n",
+    "from delta_sharing.reader import CdfOptions, DeltaSharingReader, to_converters\n",
+    "from delta_sharing.rest_client import DataSharingRestClient\n",
+    "from pyspark.sql import functions as F\n",
+    "\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# Configuration  —  replace placeholders with your own values\n",
+    "# ---------------------------------------------------------------------------\n",
+    "\n",
+    "# Path to the .share profile file on your Databricks workspace.\n",
+    "# The profile contains the Oracle sharing endpoint URL and credentials.\n",
+    "# See: https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#profile-file-format\n",
+    "PROFILE_PATH = \"/Workspace/Users/<your-email>/<your-profile>.share\"\n",
+    "\n",
+    "# Coordinates of the shared table (as listed by delta_sharing.SharingClient).\n",
+    "SHARE_NAME  = \"<YOUR_SHARE>\"\n",
+    "SCHEMA_NAME = \"<YOUR_SCHEMA>\"\n",
+    "TABLE_NAME  = \"<YOUR_TABLE>\"\n",
+    "\n",
+    "# CDF version window to inspect.\n",
+    "# Set both to the same value to see a single version's changes.\n",
+    "START_VERSION = 1\n",
+    "END_VERSION   = 2\n",
+    "\n",
+    "# How many sample rows to print.\n",
+    "SHOW_SAMPLE_ROWS = 50\n",
+    "\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# Read CDF via Delta Sharing REST API\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# We bypass the high-level library functions (load_table_changes_as_pandas,\n",
+    "# spark.read.format(\"deltaSharing\")) because:\n",
+    "#   1. Oracle omits _commit_timestamp → KeyError in the library\n",
+    "#   2. Spark's Java connector throws InvocationTargetException on serverless\n",
+    "#   3. Pre-signed HTTPS URLs can't be read by spark.read.parquet()\n",
+    "#\n",
+    "# Instead we:\n",
+    "#   a) Call list_table_changes() on the REST client to get pre-signed URLs\n",
+    "#   b) Download each Parquet file via HTTP with _to_pandas()\n",
+    "#   c) Concatenate into a single pandas DataFrame, then convert to Spark\n",
+    "# ---------------------------------------------------------------------------\n",
+    "\n",
+    "table_url = f\"{PROFILE_PATH}#{SHARE_NAME}.{SCHEMA_NAME}.{TABLE_NAME}\"\n",
+    "print(f\"[info] table_url={table_url}\")\n",
+    "print(f\"[info] requested CDF window={START_VERSION} -> {END_VERSION}\")\n",
+    "\n",
+    "t0 = time.time()\n",
+    "\n",
+    "profile   = DeltaSharingProfile.read_from_file(PROFILE_PATH)\n",
+    "rest      = DataSharingRestClient(profile)\n",
+    "table_obj = Table(name=TABLE_NAME, share=SHARE_NAME, schema=SCHEMA_NAME)\n",
+    "\n",
+    "cdf_opts  = CdfOptions(starting_version=START_VERSION, ending_version=END_VERSION)\n",
+    "response  = rest.list_table_changes(table_obj, cdf_opts)\n",
+    "\n",
+    "# Parse the schema returned by Oracle so _to_pandas() can cast columns.\n",
+    "schema_json = json.loads(response.metadata.schema_string)\n",
+    "converters  = to_converters(schema_json)\n",
+    "\n",
+    "# Download every Parquet action file returned in the CDF response.\n",
+    "# Each action corresponds to one physical file on Oracle object storage.\n",
+    "pdfs = []\n",
+    "for action in response.actions:\n",
+    "    pdfs.append(\n",
+    "        DeltaSharingReader._to_pandas(\n",
+    "            action,\n",
+    "            converters,\n",
+    "            True,   # for_cdf — adds _change_type / _commit_version columns\n",
+    "            None,   # limit\n",
+    "            True,   # use_delta_format\n",
+    "        )\n",
+    "    )\n",
+    "\n",
+    "elapsed = time.time() - t0\n",
+    "\n",
+    "# ---------------------------------------------------------------------------\n",
+    "# Display results\n",
+    "# ---------------------------------------------------------------------------\n",
+    "\n",
+    "if not pdfs:\n",
+    "    print(f\"[info] elapsed_seconds={elapsed:.2f}\")\n",
+    "    print(\"[result] empty CDF response for requested version window\")\n",
+    "else:\n",
+    "    cdf_pdf = pd.concat(pdfs, ignore_index=True)\n",
+    "    df = spark.createDataFrame(cdf_pdf)\n",
+    "\n",
+    "    # Safety filter: ensure we only look at the requested version range.\n",
+    "    if \"_commit_version\" in df.columns:\n",
+    "        df = df.filter(\n",
+    "            (F.col(\"_commit_version\") >= F.lit(START_VERSION))\n",
+    "            & (F.col(\"_commit_version\") <= F.lit(END_VERSION))\n",
+    "        )\n",
+    "\n",
+    "    raw_count = df.count()\n",
+    "    elapsed = time.time() - t0\n",
+    "\n",
+    "    print(f\"[info] elapsed_seconds={elapsed:.2f}\")\n",
+    "    print(f\"[info] raw_row_count={raw_count:,}\")\n",
+    "    print(f\"[info] columns={df.columns}\")\n",
+    "\n",
+    "    if raw_count == 0:\n",
+    "        print(\"[result] empty CDF response for requested version window\")\n",
+    "    else:\n",
+    "        # --- Summary: row counts by version and change type ----------------\n",
+    "        if \"_commit_version\" in df.columns and \"_change_type\" in df.columns:\n",
+    "            print(\"[result] rows by commit version and change type\")\n",
+    "            (\n",
+    "                df.groupBy(\"_commit_version\", \"_change_type\")\n",
+    "                  .count()\n",
+    "                  .orderBy(\"_commit_version\", \"_change_type\")\n",
+    "                  .show(100, truncate=False)\n",
+    "            )\n",
+    "\n",
+    "        # --- Sample rows (truncated for readability) -----------------------\n",
+    "        # NOTE: Because Oracle uses file-level CDF, you will typically see\n",
+    "        #   more rows than the actual number of changed rows.  For example,\n",
+    "        #   inserting 1 row into a file that already has 2 rows produces:\n",
+    "        #     2 deletes  (old file: all existing rows)\n",
+    "        #     3 inserts  (new file: existing rows + the new row)\n",
+    "        #   The 2 matching DELETE+INSERT pairs are unchanged — only the\n",
+    "        #   extra INSERT is the real change.\n",
+    "        print(\"[result] sample rows\")\n",
+    "        order_cols = []\n",
+    "        if \"_commit_version\" in df.columns:\n",
+    "            order_cols.append(F.col(\"_commit_version\").desc())\n",
+    "        if \"_change_type\" in df.columns:\n",
+    "            order_cols.append(F.col(\"_change_type\"))\n",
+    "\n",
+    "        if order_cols:\n",
+    "            df.orderBy(*order_cols).show(SHOW_SAMPLE_ROWS, truncate=True)\n",
+    "        else:\n",
+    "            df.show(SHOW_SAMPLE_ROWS, truncate=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "application/vnd.databricks.v1+notebook": {
+   "computePreferences": null,
+   "dashboards": [],
+   "environmentMetadata": null,
+   "inputWidgetPreferences": null,
+   "language": "python",
+   "notebookMetadata": {
+    "pythonIndentUnit": 4
+   },
+   "notebookName": "Oracle Delta Sharing CDF Smoke Test GitHub",
+   "widgets": {}
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}