Skip to content

Commit 3daeca4

Browse files
authored
review and changed the databricks notebooks (#1324)
1 parent 25a338d commit 3daeca4

4 files changed

Lines changed: 10 additions & 29 deletions

File tree

examples/databricks/01-SettingUp-Zingg.ipynb

Lines changed: 7 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
}
1414
},
1515
"source": [
16-
"#Part 1 of the Zingg notebook\n",
16+
"# Part 1 of the Zingg notebook\n",
1717
"## It is responsible for initializing the Zingg environment, which includes the following steps:\n",
1818
"- **Environment Setup:** Loads all necessary libraries and dependencies required for Zingg to run.\n",
1919
"- **Path Setup:** Defines and sets up all relevant file paths, such as model directory, input data locations, output directories.\n",
@@ -33,20 +33,20 @@
3333
}
3434
},
3535
"source": [
36-
"## Example Notebook For Training and Running Zingg Enterprise Entity Resolution Workflow on Databricks\n",
36+
"## Example Notebook For Training and Running Zingg Entity Resolution Workflow on Databricks\n",
3737
"This notebook runs the Zingg Febrl Example on Databricks. Please refer to the\n",
3838
"\n",
39-
"- Enterprise Zingg Python API\n",
39+
"- Zingg Python API\n",
4040
"- Zingg Official Documentation for details.\n",
4141
"\n",
42-
"_This notebook has been tested on 16.4 LTS DBR version (Spark 3.5.2, scala 2.12)_\n",
42+
"_This notebook has been tested on 16.4 LTS DBR version (Spark 3.5.5, scala 2.12)_\n",
4343
"\n",
4444
"## Create a Spark Cluster and Install Zingg\n",
4545
"# \n",
46-
"- Go to the Clusters tab, hit Create Cluster, and give it a name like “Zingg-Enterprise.”\n",
46+
"- Go to the Clusters tab, hit Create Cluster, and give it a name like “Zingg.”\n",
4747
"- Set the runtime version to a current LTS (Long-Term Support) version for compatibility.\n",
48-
"- Next, you’ll need to install Zingg. For this, we will be need the latest Zingg JAR file and the license.\n",
49-
"- Create a Volume (managed volume) inside the schema and add the zingg-opensource-spark-0.8.0.jar and the zingg_license.jar to it.\n",
48+
"- Next, you’ll need to install Zingg. For this, we will be need the latest Zingg JAR file.\n",
49+
"- Create a Volume (managed volume) inside the schema and add the zingg-0.6.0-spark-3.5.5.jar to it.\n",
5050
"- Upload the file: Open the cluster details, navigate to the Libraries section, and click Install New.\n",
5151
"- Select the Volumes option and upload JAR from the specific path -> /Volumes/catalog_name/schema_name/volume_name/path_to_file (Zingg JAR)\n",
5252
"\n",
@@ -445,25 +445,6 @@
445445
"args.setOutput(outputPipe)"
446446
]
447447
},
448-
{
449-
"cell_type": "markdown",
450-
"metadata": {
451-
"application/vnd.databricks.v1+cell": {
452-
"cellMetadata": {},
453-
"inputWidgets": {},
454-
"nuid": "b96dc60f-089d-4c83-9961-568c2100fbd6",
455-
"showTitle": false,
456-
"tableResultSettingsMap": {},
457-
"title": ""
458-
}
459-
},
460-
"source": [
461-
"## Configure the statistics output path\n",
462-
"Here we configure the stats path\n",
463-
"\n",
464-
"Please make sure the path/name contains the placeholder \"**_$ZINGG_DYNAMIC_STAT_NAME_**\""
465-
]
466-
},
467448
{
468449
"cell_type": "markdown",
469450
"metadata": {

examples/databricks/02-LabelTrainingData.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
}
3333
},
3434
"source": [
35-
"#Part 2: FindTrainingData and Label Phase\n",
35+
"# Part 2: FindTrainingData and Label Phase\n",
3636
"## We have completed setting up Zingg in the previous step. In this part, we will run the **_FindTrainingData_** and **_Label_** phases. \n",
3737
"This involves generating candidate record pairs for training, presenting them for manual labeling, and saving the labeled data for use in model training. This step is essential for building a high-quality training dataset for entity resolution.\n",
3838
"\n",

examples/databricks/04-GenerateDocument.ipynb renamed to examples/databricks/03-GenerateDocument.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
}
1414
},
1515
"source": [
16-
"# Part 4: Documenting the model\n",
16+
"# Part 3: Documenting the model\n",
1717
"## We have completed setting up Zingg and labeled the training data in the previous steps. In this part, we will run the **_generateDocs_** phase. \n",
1818
"#### This phase processes the labeled data to create the readable documentation about the training data, including those marked as matches, as well as non-matches. \n",
1919
"\n",

examples/databricks/03-TrainAndMatch.ipynb renamed to examples/databricks/04-TrainAndMatch.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
}
1414
},
1515
"source": [
16-
"# Part 3: Train and Match Phase\n",
16+
"# Part 4: Train and Match Phase\n",
1717
"## We have completed setting up Zingg, labeled the training data, and generated the model documents in the previous steps. In this part, we will run the **_Train_** and **_Match_** phases. \n",
1818
"#### This involves training the entity resolution model using the labeled data and then applying the trained model to match records in your dataset. This step is crucial for identifying and matching similar entities across your data sources."
1919
]

0 commit comments

Comments
 (0)