Skip to content

Commit e5d2807

Browse files
committed
docs: update README.md to include predefined metrics for data quality checks
1 parent e1da237 commit e5d2807

2 files changed

Lines changed: 58 additions & 2 deletions

File tree

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
9+
## [1.2.1] - 2025-09-24
10+
11+
### Added
12+
13+
- Support for data quality metrics that align with ODCS 3.1
14+
815
## [1.2.0] - 2025-07-05
916

1017
### Added

README.md

Lines changed: 51 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -848,16 +848,17 @@ Data can be verified by executing these checks through a data quality engine.
848848

849849
Quality attributes can be:
850850
- A text in natural language that describes the quality of the data.
851+
- A predefined metric from the library of commonly used metrics
851852
- An individual SQL query that returns a single value that can be compared.
852853
- Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported.
853854

854-
A quality object can be specified on field level and on model level.
855+
A quality object can be specified on the field level and on the model level.
855856
The top-level quality object is deprecated.
856857

857858
#### Description Text
858859

859860
A description in natural language that defines the expected quality of the data.
860-
This is useful to express requirements or expectation when discussing the data contract with stakeholders.
861+
This is useful to express requirements or expectations when discussing the data contract with stakeholders.
861862
Later in the development process, these might be translated into an executable check (such as `sql`).
862863
It can also be used as a prompt to check the data with an AI engine.
863864

@@ -929,6 +930,54 @@ models:
929930
SQL queries allow powerful checks for custom business logic.
930931
A SQL query should run not longer than 10 minutes.
931932

933+
#### Library / Metrics
934+
935+
A set of predefined metrics commonly used in data quality checks, designed to be compatible with all major data quality engines. This simplifies the work for data engineers by eliminating the need to manually write SQL queries.
936+
These metrics are aligned with ODCS 3.1.
937+
938+
| Field | Type | Description |
939+
|------------------------|-----------------------|----------------------------------------------------------------------------------|
940+
| type | `string` | `library` (can be omitted, if `metric` is defined) |
941+
| metric | `string` | `nullValues`, `missingValues`, `invalidValues`, `duplicateValues`, or `rowCount` |
942+
| arguments | `object` | Some metrics require additional arguments |
943+
| description | `string` | A plain text describing the quality of the data. |
944+
| mustBe | `integer` | The threshold to check the return value of the query |
945+
| mustNotBe | `integer` | The threshold to check the return value of the query |
946+
| mustBeGreaterThan | `integer` | The threshold to check the return value of the query |
947+
| mustBeGreaterOrEqualTo | `integer` | The threshold to check the return value of the query |
948+
| mustBeLessThan | `integer` | The threshold to check the return value of the query |
949+
| mustBeLessOrEqualTo | `integer` | The threshold to check the return value of the query |
950+
| mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. |
951+
| mustNotBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. |
952+
| unit | `string` | `rows` (default) or `percent` |
953+
954+
955+
Metrics:
956+
957+
| Metric | Level | Description | Arguments | Arguments Example |
958+
|--------|--------|----------------------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------|
959+
| `nullValues` | Property | Counts null values in a column/field | None | |
960+
| `missingValues` | Property | Counts values considered as missing (empty strings, N/A, etc.) | `missingValues`: Array of values considered missing | `missingValues: [null, '', 'N/A']` |
961+
| `invalidValues` | Property | Counts values that don't match valid criteria | `validValues`: Array of valid values<br>`pattern`: Regex pattern | `validValues: ['pounds', 'kg']`<br>`pattern: '^[A-Z]{2}[0-9]{2}...'` |
962+
| `duplicateValues` | Property | Counts duplicate values in a column | None | |
963+
| `duplicateValues` | Schema | Counts duplicate values across multiple columns | `properties`: Array of property names | `properties: ['tenant_id', 'order_id']` |
964+
| `rowCount` | Schema | Counts total number of rows in a table/object store | None | |
965+
966+
967+
Example:
968+
969+
```yaml
970+
properties:
971+
- name: email_address
972+
quality:
973+
- metric: missingValues
974+
arguments:
975+
missingValues: [null, '', 'N/A', 'n/a']
976+
mustBeLessThan: 5
977+
unit: percent # rows (default) or percent
978+
```
979+
980+
932981
#### Custom
933982

934983
You can define custom quality attributes that are specific to a data quality engine.

0 commit comments

Comments
 (0)