Azure
diff --git a/‎sdk/cosmos/azure-cosmos-spark_3/Samples/Python/NYC-Taxi-Data/04_ThroughputBucket.ipynb‎
Lines changed: 348 additions & 0 deletions b/‎sdk/cosmos/azure-cosmos-spark_3/Samples/Python/NYC-Taxi-Data/04_ThroughputBucket.ipynb‎
Lines changed: 348 additions & 0 deletions
@@ -0,0 +1,348 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Secrets**\n\nThe secrets below  like the Cosmos account key are retrieved from a secret scope. If you don't have defined a secret scope for a Cosmos Account you want to use when going through this sample you can find the instructions on how to create one here:\n- Here you can [Create a new secret scope](./#secrets/createScope) for the current Databricks workspace\n  - See how you can create an [Azure Key Vault backed secret scope](https://docs.microsoft.com/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope) \n  - See how you can create a [Databricks backed secret scope](https://docs.microsoft.com/azure/databricks/security/secrets/secret-scopes#create-a-databricks-backed-secret-scope)\n- And here you can find information on how to [add secrets to your Spark configuration](https://docs.microsoft.com/azure/databricks/security/secrets/secrets#read-a-secret)\nIf you don't want to use secrets at all you can of course also just assign the values in clear-text below - but for obvious reasons we recommend the usage of secrets."
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "cosmosEndpoint = spark.conf.get(\"spark.cosmos.accountEndpoint\")\ncosmosMasterKey = spark.conf.get(\"spark.cosmos.accountKey\")"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "b2c3d4e5-f6a7-8901-bcde-f12345678901"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Preparation - creating the Cosmos DB container to ingest the data into**"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "c3d4e5f6-a7b8-9012-cdef-123456789012"
+        }
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Configure the Catalog API to be used"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "d4e5f6a7-b8c9-0123-defa-234567890123"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import uuid\nspark.conf.set(\"spark.sql.catalog.cosmosCatalog\", \"com.azure.cosmos.spark.CosmosCatalog\")\nspark.conf.set(\"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint\", cosmosEndpoint)\nspark.conf.set(\"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey\", cosmosMasterKey)\nspark.conf.set(\"spark.sql.catalog.cosmosCatalog.spark.cosmos.views.repositoryPath\", \"/viewDefinitions\" + str(uuid.uuid4()))\n"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "e5f6a7b8-c9d0-1234-efab-345678901234"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "And execute the command to create the new container with a throughput of up-to 100,000 RU (Autoscale - so 10,000 - 100,000 RU based on scale) and only system properties (like /id) being indexed.\n\n**Note:** Unlike SDK-based throughput control, throughput buckets do NOT require a separate metadata container (ThroughputControl) because they are managed server-side."
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "f6a7b8c9-d0e1-2345-fabc-456789012345"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "%sql\nCREATE DATABASE IF NOT EXISTS cosmosCatalog.SampleDatabase;\n\nCREATE TABLE IF NOT EXISTS cosmosCatalog.SampleDatabase.GreenTaxiRecords\nUSING cosmos.oltp\nTBLPROPERTIES(partitionKeyPath = '/id', autoScaleMaxThroughput = '100000', indexingPolicy = 'OnlySystemProperties');\n\nCREATE TABLE IF NOT EXISTS cosmosCatalog.SampleDatabase.GreenTaxiRecordsCFSink\nUSING cosmos.oltp\nTBLPROPERTIES(partitionKeyPath = '/id', autoScaleMaxThroughput = '100000', indexingPolicy = 'OnlySystemProperties');"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "a7b8c9d0-e1f2-3456-abcd-567890123456"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Preparation - loading data source \"[NYC Taxi & Limousine Commission - green taxi trip records](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/)\"**\n\nThe green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. This data set has over 80 million records (>8 GB) of data and is available via a publicly accessible Azure Blob Storage Account located in the East-US Azure region."
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "b8c9d0e1-f2a3-4567-bcde-678901234567"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import datetime\nimport time\nimport uuid\nfrom pyspark.sql.functions import udf\nfrom pyspark.sql.types import StringType, LongType\n\nprint(\"Starting preparation: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\n# Azure storage access info\nblob_account_name = \"azureopendatastorage\"\nblob_container_name = \"nyctlc\"\nblob_relative_path = \"green\"\nblob_sas_token = r\"\"\n# Allow SPARK to read from Blob remotely\nwasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)\nspark.conf.set(\n  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),\n  blob_sas_token)\nprint('Remote blob path: ' + wasbs_path)\n# SPARK read parquet, note that it won't load any data yet by now\n# NOTE - if you want to experiment with larger dataset sizes - consider switching to Option B (commenting code \n# for Option A/uncommenting code for option B) the lines below or increase the value passed into the \n# limit function restricting the dataset size below\n\n#------------------------------------------------------------------------------------\n# Option A - with limited dataset size\n#------------------------------------------------------------------------------------\ndf_rawInputWithoutLimit = spark.read.parquet(wasbs_path)\npartitionCount = df_rawInputWithoutLimit.rdd.getNumPartitions()\ndf_rawInput = df_rawInputWithoutLimit.limit(1_000_000).repartition(partitionCount)\ndf_rawInput.persist()\n\n#------------------------------------------------------------------------------------\n# Option B - entire dataset\n#------------------------------------------------------------------------------------\n#df_rawInput = spark.read.parquet(wasbs_path)\n\n# Adding an id column with unique values\nuuidUdf= udf(lambda : str(uuid.uuid4()),StringType())\nnowUdf= udf(lambda : int(time.time() * 1000),LongType())\ndf_input_withId = df_rawInput \\\n  .withColumn(\"id\", uuidUdf()) \\\n  .withColumn(\"insertedAt\", nowUdf()) \\\n\nprint('Register the DataFrame as a SQL temporary view: source')\ndf_input_withId.createOrReplaceTempView('source')\nprint(\"Finished preparation: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "c9d0e1f2-a3b4-5678-cdef-789012345678"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "** Sample - ingesting the NYC Green Taxi data into Cosmos DB using throughput bucket**\n\nThroughput buckets provide server-side throughput control. Instead of using the SDK-based global throughput control\n(which requires a separate metadata container), you configure a `throughputBucket` value between 1 and 5.\n\nThis is simpler to configure because it does not require a separate throughput control metadata container.\nFor more information, see [Throughput Buckets](https://learn.microsoft.com/azure/cosmos-db/throughput-buckets?tabs=dotnet)."
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "d0e1f2a3-b4c5-6789-defa-890123456789"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import uuid\nimport datetime\n\nprint(\"Starting ingestion: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\n\nwriteCfg = {\n  \"spark.cosmos.accountEndpoint\": cosmosEndpoint,\n  \"spark.cosmos.accountKey\": cosmosMasterKey,\n  \"spark.cosmos.database\": \"SampleDatabase\",\n  \"spark.cosmos.container\": \"GreenTaxiRecords\",\n  \"spark.cosmos.write.strategy\": \"ItemOverwrite\",\n  \"spark.cosmos.write.bulk.enabled\": \"true\",\n  \"spark.cosmos.throughputControl.enabled\": \"true\",\n  \"spark.cosmos.throughputControl.name\": \"NYCGreenTaxiDataIngestion\",\n  \"spark.cosmos.throughputControl.throughputBucket\": \"5\",\n}\n\ndf_NYCGreenTaxi_Input = spark.sql('SELECT * FROM source')\n\ndf_NYCGreenTaxi_Input \\\n  .write \\\n  .format(\"cosmos.oltp\") \\\n  .mode(\"Append\") \\\n  .options(**writeCfg) \\\n  .save()\n\nprint(\"Finished ingestion: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "e1f2a3b4-c5d6-7890-efab-901234567890"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Getting the reference record count**"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "f2a3b4c5-d6e7-8901-fabc-012345678901"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "count_source = spark.sql('SELECT * FROM source').count()\nprint(\"Number of records in source: \", count_source) "
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "a3b4c5d6-e7f8-9012-abcd-123456789012"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Sample - validating the record count via query**"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "b4c5d6e7-f8a9-0123-bcde-234567890123"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from pyspark.sql.types import *\nimport pyspark.sql.functions as F\n\nprint(\"Starting validation via query: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\nreadCfg = {\n  \"spark.cosmos.accountEndpoint\": cosmosEndpoint,\n  \"spark.cosmos.accountKey\": cosmosMasterKey,\n  \"spark.cosmos.database\": \"SampleDatabase\",\n  \"spark.cosmos.container\": \"GreenTaxiRecords\",\n  \"spark.cosmos.read.partitioning.strategy\": \"Restrictive\",\n  \"spark.cosmos.read.inferSchema.enabled\" : \"false\",\n  \"spark.cosmos.read.customQuery\" : \"SELECT COUNT(0) AS Count FROM c\"\n}\n\ncount_query_schema=StructType(fields=[StructField(\"Count\", LongType(), True)])\nquery_df = spark.read.format(\"cosmos.oltp\").schema(count_query_schema).options(**readCfg).load()\ncount_query = query_df.select(F.sum(\"Count\").alias(\"TotalCount\")).first()[\"TotalCount\"]\nprint(\"Number of records retrieved via query: \", count_query) \nprint(\"Finished validation via query: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\n\nassert count_source == count_query"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "c5d6e7f8-a9b0-1234-cdef-345678901234"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Sample - validating the record count via change feed**"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "d6e7f8a9-b0c1-2345-defa-456789012345"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print(\"Starting validation via change feed: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\nchangeFeedCfg = {\n  \"spark.cosmos.accountEndpoint\": cosmosEndpoint,\n  \"spark.cosmos.accountKey\": cosmosMasterKey,\n  \"spark.cosmos.database\": \"SampleDatabase\",\n  \"spark.cosmos.container\": \"GreenTaxiRecords\",\n  \"spark.cosmos.read.partitioning.strategy\": \"Restrictive\",\n  \"spark.cosmos.read.inferSchema.enabled\" : \"false\",\n  \"spark.cosmos.changeFeed.startFrom\" : \"Beginning\",\n  \"spark.cosmos.changeFeed.mode\" : \"Incremental\"\n}\nchangeFeed_df = spark.read.format(\"cosmos.oltp.changeFeed\").options(**changeFeedCfg).load()\ncount_changeFeed = changeFeed_df.count()\nprint(\"Number of records retrieved via change feed: \", count_changeFeed) \nprint(\"Finished validation via change feed: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\n\nassert count_source == count_changeFeed"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "e7f8a9b0-c1d2-3456-efab-567890123456"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Sample - bulk deleting documents with throughput bucket and validating document count afterwards**"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "f8a9b0c1-d2e3-4567-fabc-678901234567"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import math\n\nprint(\"Starting to identify to be deleted documents: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\nreadCfg = {\n  \"spark.cosmos.accountEndpoint\": cosmosEndpoint,\n  \"spark.cosmos.accountKey\": cosmosMasterKey,\n  \"spark.cosmos.database\": \"SampleDatabase\",\n  \"spark.cosmos.container\": \"GreenTaxiRecords\",\n  \"spark.cosmos.read.partitioning.strategy\": \"Restrictive\",\n  \"spark.cosmos.read.inferSchema.enabled\" : \"false\",\n}\n\ntoBeDeleted_df = spark.read.format(\"cosmos.oltp\").options(**readCfg).load().limit(100_000)\nprint(\"Number of records to be deleted: \", toBeDeleted_df.count()) \n\nprint(\"Starting to bulk delete documents: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\ndeleteCfg = {\n  \"spark.cosmos.accountEndpoint\": cosmosEndpoint,\n  \"spark.cosmos.accountKey\": cosmosMasterKey,\n  \"spark.cosmos.database\": \"SampleDatabase\",\n  \"spark.cosmos.container\": \"GreenTaxiRecords\",\n  \"spark.cosmos.write.strategy\": \"ItemDelete\",\n  \"spark.cosmos.write.bulk.enabled\": \"true\",\n  \"spark.cosmos.throughputControl.enabled\": \"true\",\n  \"spark.cosmos.throughputControl.name\": \"NYCGreenTaxiDataDelete\",\n  \"spark.cosmos.throughputControl.throughputBucket\": \"1\",\n}\ntoBeDeleted_df \\\n        .write \\\n        .format(\"cosmos.oltp\") \\\n        .mode(\"Append\") \\\n        .options(**deleteCfg) \\\n        .save()\nprint(\"Finished deleting documents: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\n\nprint(\"Starting count validation via query: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\ncount_query_schema=StructType(fields=[StructField(\"Count\", LongType(), True)])\nreadCfg[\"spark.cosmos.read.customQuery\"] = \"SELECT COUNT(0) AS Count FROM c\"\nquery_df = spark.read.format(\"cosmos.oltp\").schema(count_query_schema).options(**readCfg).load()\ncount_query = query_df.select(F.sum(\"Count\").alias(\"TotalCount\")).first()[\"TotalCount\"]\nprint(\"Number of records retrieved via query: \", count_query) \nprint(\"Finished count validation via query: \", datetime.datetime.utcnow().strftime(\"%Y-%m-%d %H:%M:%S.%f\"))\n\nassert max(0, count_source - 100_000) == count_query"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "a9b0c1d2-e3f4-5678-abcd-789012345678"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Sample - showing the existing Containers**"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "b0c1d2e3-f4a5-6789-bcde-890123456789"
+        }
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "%sql\nSHOW TABLES FROM cosmosCatalog.SampleDatabase"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "c1d2e3f4-a5b6-7890-cdef-901234567890"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "df_Tables = spark.sql('SHOW TABLES FROM cosmosCatalog.SampleDatabase')\nassert df_Tables.count() == 2"
+      ],
+      "metadata": {
+        "application/vnd.databricks.v1+cell": {
+          "title": "",
+          "showTitle": false,
+          "inputWidgets": {},
+          "nuid": "d2e3f4a5-b6c7-8901-defa-012345678901"
+        }
+      },
+      "outputs": [],
+      "execution_count": 0
+    }
+  ],
+  "metadata": {
+    "application/vnd.databricks.v1+notebook": {
+      "notebookName": "04_ThroughputBucket",
+      "dashboards": [],
+      "notebookMetadata": {
+        "pythonIndentUnit": 2
+      },
+      "language": "python",
+      "widgets": {},
+      "notebookOrigID": 86486029782770
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}