Skip to content

Commit 3ab5f7c

Browse files
authored
Merge pull request #1 from databrickslabs/0.1_release
Release 0.1
2 parents adf911f + 89a9e82 commit 3ab5f7c

15 files changed

Lines changed: 741 additions & 245 deletions

CONTRIBUTING.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
We happily welcome contributions to *PROJECT NAME*. We use GitHub Issues to track community reported issues and GitHub Pull Requests for accepting changes.
1+
We happily welcome contributions to *Databricks Labs - Rules Engine*.
2+
We use GitHub Issues to track community reported issues and GitHub Pull Requests for accepting changes.

NOTICE

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,5 @@
1-
[Project Name]
2-
3-
Copyright (2018) Databricks, Inc.
4-
5-
6-
This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file.
1+
Databricks Labs - Rules Engine
72

3+
Copyright (2018) Databricks, Inc.
84

9-
Additionally, this Software contains code from the following open source projects:
10-
11-
[Project Name - License]
5+
This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file.

README.md

Lines changed: 175 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,184 @@
1-
# PROJECT NAME
2-
Standard Project Template for Databricks Labs Projects
1+
# Rules Engine
2+
Simplified Validation for Production Workloads
33

44
## Project Description
5-
Short description of project's purpose
5+
As pipelines move from bronze to gold, it's very common that some level of governance be performed in
6+
Silver or at various places in the pipeline. The need for business rule validation is very common.
7+
Databricks recognizes this and as such is building Delta Pipelines with Expectations are coming soon
8+
and will likely reduce the need for a rules engine like this but it's possible that simple rules are needed
9+
where Delta Pipelines are a bit overkill or not in line with the overall workload. After the release
10+
of Delta Pipelines and Expectations the code base will be reviewed and adjusted appropriately. That may mean to
11+
extend Expectations, add a simplified wrapper, or both, or none. We'll have to wait and see what Delta
12+
Pipelines and Expectations looks like when it's released.
613

7-
## Project Support
8-
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
14+
Introducing Databricks Labs - Rules Engine, a simple solution for validating data in dataframes before you
15+
move the data to production and/or in-line (coming soon).
916

10-
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
17+
![Alt Text](images/Rules_arch.png)
18+
## Using The Rules Engine In Your Project
19+
* Pull the latest release from the releases
20+
* Add it as a dependency (will be in Maven eventually)
21+
* Reference it in your imports
1122

23+
## Getting Started
24+
A list of usage examples is available in the `demo` folder of this repo in [html](demo/Rules_Engine_Examples.html)
25+
and as a [Databricks Notebook DBC](demo/Rules_Engine_Examples.dbc).
1226

13-
## Building the Project
14-
Instructions for how to build the project
27+
The process simple:
28+
* Define Rules
29+
* Build a RuleSet from your Dataframe using your Rules you built
30+
```scala
31+
import com.databricks.labs.validation.utils.Structures._
32+
import com.databricks.labs.validation._
33+
```
34+
35+
As of version 0.1 There are three primary rule types
36+
* Boundary Rules
37+
* Categorical Rules (Strings and Numerical)
38+
* Date Rules (in progress)
39+
40+
Rules can be composed of:
41+
* simple column references `col("my_column_name")`
42+
* complex columns `col("Revenue") - col("Cost")`
43+
* aggregate columns `min("ColumnName")`
44+
45+
Rules can be applied to simple DataFrames or grouped Dataframes. To use a grouped dataframe simply pass
46+
your dataframe into the RuleSet and pass one or more columns in as `by` columns. This will apply the rule
47+
at the group level which can be helpful at times.
48+
49+
### Simple Rule
50+
`val validateRetailPrice = Rule("Retail_Price_Validation", col("retail_price"), Bounds(0.0, 6.99))`
51+
52+
### List of Rules
53+
NOTE: While validations can be performed on aggregate cols (whether the DF is grouped or not) aggregate columns
54+
only return a single value - as such the failed count will be set to 1 for failures so for aggregate columns
55+
the `Invalid_Count` is rendered somewhat useless. Better granularity can be seen in the report when not using
56+
aggregates.
57+
```scala
58+
val specializedRules = Array(
59+
// Example of aggregate column
60+
Rule("Reasonable_sku_counts", count(col("sku")), Bounds(lower = 20.0, upper = 200.0)),
61+
// Example of calculated column from optimized UDF
62+
Rule("Max_allowed_discount",
63+
max(getDiscountPercentage(col("retail_price"), col("scan_price"))),
64+
Bounds(upper = 90.0)),
65+
// Example distinct values rule
66+
Rule("Unique_Skus", countDistinct("sku"), Bounds(upper = 1.0))
67+
)
68+
```
69+
70+
### MinMax Rules
71+
It's very common to build rules to validate min and max allowable values so there's a helper function
72+
to speed up this process. It really only makes sense to use minmax when specifying both an upper and a lower bound
73+
in the Bounds object. Using this method in the example below will only require three lines of code instead of the 6
74+
if each rule was built manually
75+
```scala
76+
val minMaxPriceDefs = Array(
77+
MinMaxRuleDef("MinMax_Sku_Price", col("retail_price"), Bounds(0.0, 29.99)),
78+
MinMaxRuleDef("MinMax_Scan_Price", col("scan_price"), Bounds(0.0, 29.99)),
79+
MinMaxRuleDef("MinMax_Cost", col("cost"), Bounds(0.0, 12.0))
80+
)
81+
82+
// Generate the array of Rules from the minmax generator
83+
val minMaxPriceRules = RuleSet.generateMinMaxRules(minMaxPriceDefs: _*)
84+
```
85+
OR -- simply add the list of minmax rules or simple individual rule definitions
86+
to an existing RuleSet (if not using builder pattern)
87+
```scala
88+
val someRuleSet = RuleSet(df)
89+
someRuleSet.addMinMaxRules(minMaxPriceDefs: _*)
90+
someRuleSet.addMinMaxRules("Retail_Price_Validation", col("retail_price"), Bounds(0.0, 6.99))
91+
```
92+
93+
### Categorical Rules
94+
There are two types of categorical rules which are used to validate against a pre-defined list of valid
95+
values. Currently (as of 0.1) accepted categorical types are String, Double, Int, Long
96+
```scala
97+
val catNumerics = Array(
98+
Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs),
99+
Rule("Valid_Skus", col("sku"), Lookups.validSkus)
100+
)
101+
102+
val catStrings = Array(
103+
Rule("Valid_Regions", col("region"), Lookups.validRegions)
104+
)
105+
```
15106

16-
## Deploying / Installing the Project
17-
Instructions for how to deploy the project, or install it
107+
### Validation
108+
Now that you have some rules built up... it's time to build the ruleset and validate it. As mentioned above,
109+
the dataframe can be simple or groupBy column[s] can be passed in (as string) to perform validation at the
110+
grouped level.
111+
```scala
112+
val (rulesReport, passed) = RuleSet(df)
113+
.add(specializedRules)
114+
.add(minMaxPriceRules)
115+
.add(catNumerics)
116+
.add(catStrings)
117+
.validate()
18118

19-
## Releasing the Project
20-
Instructions for how to release a version of the project
119+
val (rulesReport, passed) = RuleSet(df, Array("store_id"))
120+
.add(specializedRules)
121+
.add(minMaxPriceRules)
122+
.add(catNumerics)
123+
.add(catStrings)
124+
.validate()
125+
```
126+
The validation returns two items, a boolean (true/false) as to whether all rules passed or not. If a single rule
127+
fails the `passed` value above will return false. The `rulesReport` is a summary of which rules failed and,
128+
if the input column was not an aggregate column, the number of failed records. An image of the report is below.
129+
![Alt Text](images/rulesReport.png)
21130

22-
## Using the Project
23-
Simple examples on how to use the project
131+
## Next Steps
132+
Clearly, this is just a start. This is a small package and, as such, a GREAT place to start if you've never
133+
contributed to a project before. Please feel free to fork the repo and/or submit PRs. I'd love to see what
134+
you come up with. If you're not much of a developer or don't have the time you can still contribute! Please
135+
post your ideas in the issues and label them appropriately (i.e. bug/enhancement) and someone will review it
136+
and add it as soon as possible.
137+
138+
Some ideas of great adds are:
139+
* Add a Python wrapper
140+
* Refactor Rule and/or Validator to implement an Abstract class or trait
141+
* There's a clear opportunity to abstract away some of the redundancy between rule types.
142+
* Implement a fast runner
143+
* Optimize performance by failing fast for big data. Smart sampling could be implemented to review subsets
144+
of columns/records and look for failures to enable a faster failure.
145+
* Implement tests
146+
* Yeah, I know...I should have done this on day 0...but...time is always an issue. I plan to come back and add
147+
tests but if you'd like to add tests, that's a great way to learn code base (especially one this small)
148+
* Implement the date time rule (or somet other custom rule)
149+
* The date time rule has already been scaffolded, it just needs to be built out
150+
* What kind of complex rules does your business require that isn't possible here
151+
* Add a quarantine pattern
152+
* Enable a configuration to a Ruleset to identify records that didn't pass the validations and add
153+
them to a predefined quarantine zone.
154+
* Add logic to attempt to auto-handle certain types of failures based on common business patterns
155+
156+
157+
## Legal Information
158+
This software is provided as-is and is not officially supported by Databricks through customer technical support channels.
159+
Support, questions, and feature requests can be submitted through the Issues page of this repo.
160+
Please see the [legal agreement](LICENSE.txt) and understand that issues with the use of this code will
161+
not be answered or investigated by Databricks Support.
162+
163+
## Core Contribution team
164+
* Lead Developer: [Daniel Tomes](https://www.linkedin.com/in/tomes/), Practice Leader, Databricks
165+
* Developer: <b> your name here </b> Contribute to the project
166+
167+
168+
## Project Support
169+
Please note that all projects in the /databrickslabs github account are provided for your exploration only,
170+
and are not formally supported by Databricks with Service Level Agreements (SLAs).
171+
They are provided AS-IS and we do not make any guarantees of any kind.
172+
Please do not submit a support ticket relating to any issues arising from the use of these projects.
173+
174+
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo.
175+
They will be reviewed as time permits, but there are no formal SLAs for support.
176+
177+
178+
## Building the Project
179+
To build the project: <br>
180+
```
181+
cd Downloads
182+
git pull repo
183+
sbt clean package
184+
```

build.sbt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ scalaVersion := "2.11.12"
88
scalacOptions ++= Seq("-Xmax-classfile-name", "78")
99

1010
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"
11-
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.4.0"
1211
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
1312

1413
lazy val commonSettings = Seq(

demo/Example.scala

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
package com.databricks.labs.validation
2+
3+
import com.databricks.labs.validation.utils.{Lookups, SparkSessionWrapper}
4+
import com.databricks.labs.validation.utils.Structures._
5+
import org.apache.spark.sql.Column
6+
import org.apache.spark.sql.functions._
7+
8+
object Example extends App with SparkSessionWrapper {
9+
import spark.implicits._
10+
11+
/**
12+
* Validation example
13+
* Passing pre-built array of rules into a RuleSet and validating a non-grouped dataframe
14+
*/
15+
16+
/**
17+
* Example of a proper UDF to simplify rules logic. Simplification UDFs should take in zero or many
18+
* columns and return one column
19+
* @param retailPrice column 1
20+
* @param scanPrice column 2
21+
* @return result column of applied logic
22+
*/
23+
def getDiscountPercentage(retailPrice: Column, scanPrice: Column): Column = {
24+
(retailPrice - scanPrice) / retailPrice
25+
}
26+
27+
// Example of creating array of custom rules
28+
val specializedRules = Array(
29+
Rule("Reasonable_sku_counts", count(col("sku")), Bounds(lower = 20.0, upper = 200.0)),
30+
Rule("Max_allowed_discount",
31+
max(getDiscountPercentage(col("retail_price"), col("scan_price"))),
32+
Bounds(upper = 90.0)),
33+
Rule("Retail_Price_Validation", col("retail_price"), Bounds(0.0, 6.99)),
34+
Rule("Unique_Skus", countDistinct("sku"), Bounds(upper = 1.0))
35+
)
36+
37+
// It's common to generate many min/max boundaries. These can be generated easily
38+
// The generator function can easily be extended or overridden to satisfy more complex requirements
39+
val minMaxPriceDefs = Array(
40+
MinMaxRuleDef("MinMax_Sku_Price", col("retail_price"), Bounds(0.0, 29.99)),
41+
MinMaxRuleDef("MinMax_Scan_Price", col("scan_price"), Bounds(0.0, 29.99)),
42+
MinMaxRuleDef("MinMax_Cost", col("cost"), Bounds(0.0, 12.0))
43+
)
44+
45+
val minMaxPriceRules = RuleSet.generateMinMaxRules(minMaxPriceDefs: _*)
46+
val someRuleSet = RuleSet(df)
47+
someRuleSet.addMinMaxRules(minMaxPriceDefs: _*)
48+
someRuleSet.addMinMaxRules("Retail_Price_Validation", col("retail_price"), Bounds(0.0, 6.99))
49+
50+
51+
val catNumerics = Array(
52+
Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs),
53+
Rule("Valid_Skus", col("sku"), Lookups.validSkus)
54+
)
55+
56+
val catStrings = Array(
57+
Rule("Valid_Regions", col("region"), Lookups.validRegions)
58+
)
59+
60+
//TODO - validate datetime
61+
// Test, example data frame
62+
val df = sc.parallelize(Seq(
63+
("Northwest", 1001, 123456, 9.32, 8.99, 4.23, "2020-02-01 00:00:00.000"),
64+
("Northwest", 1001, 123256, 19.99, 16.49, 12.99, "2020-02-01"),
65+
("Northwest", 1001, 123456, 0.99, 0.99, 0.10, "2020-02-01"),
66+
("Northwest", 1001, 123456, 0.98, 0.90, 0.10, "2020-02-01"), // non_distinct sku
67+
("Northwst", 1001, 123456, 0.99, 0.99, 0.10, "2020-02-01"), // Misspelled Region
68+
("Northwest", 1002, 122987, 9.99, 9.49, 6.49, "2021-02-01"), // Invalid Date/Timestamp
69+
("Northwest", 1002, 173544, 1.29, 0.99, 1.23, "2020-02-01"),
70+
("Northwest", 1002, 168212, 3.29, 1.99, 1.23, "2020-02-01"),
71+
("Northwest", 1002, 365423, 1.29, 0.99, 1.23, "2020-02-01"),
72+
("Northwest", 1002, 3897615, 14.99, 129.99, 1.23, "2020-02-01"),
73+
("Northwest", 1003, 163212, 3.29, 1.99, 1.23, "2020-02-01") // Invalid numeric store_id groupby test
74+
)).toDF("region", "store_id", "sku", "retail_price", "scan_price", "cost", "create_ts")
75+
.withColumn("create_ts", 'create_ts.cast("timestamp"))
76+
.withColumn("create_dt", 'create_ts.cast("date"))
77+
78+
// Doing the validation
79+
// The validate method will return the rules report dataframe which breaks down which rules passed and which
80+
// rules failed and how/why. The second return value returns a boolean to determine whether or not all tests passed
81+
// val (rulesReport, passed) = RuleSet(df, Array("store_id"))
82+
val (rulesReport, passed) = RuleSet(df)
83+
.add(specializedRules)
84+
.add(minMaxPriceRules)
85+
.add(catNumerics)
86+
.add(catStrings)
87+
.validate(2)
88+
89+
rulesReport.show(200, false)
90+
// rulesReport.printSchema()
91+
92+
93+
}

demo/Rules_Engine_Examples.dbc

5.08 KB
Binary file not shown.

demo/Rules_Engine_Examples.html

Lines changed: 42 additions & 0 deletions
Large diffs are not rendered by default.

images/Rules_arch.png

168 KB
Loading

images/rulesReport.png

65.8 KB
Loading

0 commit comments

Comments
 (0)