Skip to content

Commit ec470c0

Browse files
authored
Merge pull request #133 from datacontract/develop/1.2.1
Develop/1.2.1
2 parents e1da237 + cd8b846 commit ec470c0

11 files changed

Lines changed: 2356 additions & 23 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,16 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
9+
## [1.2.1] - 2025-09-24
10+
11+
### Added
12+
13+
- Support for data quality metrics that align with ODCS 3.1
14+
15+
### Changed
16+
- Replaced threshold operators mustBeGreaterThanOrEqualTo with mustBeGreaterOrEqualTo and mustBeLessThanOrEqualTo with mustBeLessOrEqualTo to align with ODCS 3.1, even if it feels wrong...
17+
818
## [1.2.0] - 2025-07-05
919

1020
### Added

README.md

Lines changed: 53 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,15 +35,15 @@ The specification comes along with the [Data Contract CLI](https://github.com/da
3535
Version
3636
---
3737

38-
1.2.0([Changelog](CHANGELOG.md))
38+
1.2.1([Changelog](CHANGELOG.md))
3939

4040
Example
4141
---
4242

4343
View in [Data Contract Catalog](https://datacontract.com/examples/index.html)
4444

4545
```yaml
46-
dataContractSpecification: 1.2.0
46+
dataContractSpecification: 1.2.1
4747
id: orders-latest
4848
info:
4949
title: Orders Latest
@@ -848,16 +848,17 @@ Data can be verified by executing these checks through a data quality engine.
848848

849849
Quality attributes can be:
850850
- A text in natural language that describes the quality of the data.
851+
- A predefined metric from the library of commonly used metrics
851852
- An individual SQL query that returns a single value that can be compared.
852853
- Engine-specific types: Pre-defined quality checks, as defined by data quality libraries. Currently, the engines `soda` and `great-expectations` are supported.
853854

854-
A quality object can be specified on field level and on model level.
855+
A quality object can be specified on the field level and on the model level.
855856
The top-level quality object is deprecated.
856857

857858
#### Description Text
858859

859860
A description in natural language that defines the expected quality of the data.
860-
This is useful to express requirements or expectation when discussing the data contract with stakeholders.
861+
This is useful to express requirements or expectations when discussing the data contract with stakeholders.
861862
Later in the development process, these might be translated into an executable check (such as `sql`).
862863
It can also be used as a prompt to check the data with an AI engine.
863864

@@ -929,6 +930,54 @@ models:
929930
SQL queries allow powerful checks for custom business logic.
930931
A SQL query should run not longer than 10 minutes.
931932

933+
#### Library / Metrics
934+
935+
A set of predefined metrics commonly used in data quality checks, designed to be compatible with all major data quality engines. This simplifies the work for data engineers by eliminating the need to manually write SQL queries.
936+
These metrics are aligned with ODCS 3.1.
937+
938+
| Field | Type | Description |
939+
|------------------------|-----------------------|----------------------------------------------------------------------------------|
940+
| type | `string` | `library` (can be omitted, if `metric` is defined) |
941+
| metric | `string` | `nullValues`, `missingValues`, `invalidValues`, `duplicateValues`, or `rowCount` |
942+
| arguments | `object` | Some metrics require additional arguments |
943+
| description | `string` | A plain text describing the quality of the data. |
944+
| mustBe | `integer` | The threshold to check the return value of the query |
945+
| mustNotBe | `integer` | The threshold to check the return value of the query |
946+
| mustBeGreaterThan | `integer` | The threshold to check the return value of the query |
947+
| mustBeGreaterOrEqualTo | `integer` | The threshold to check the return value of the query |
948+
| mustBeLessThan | `integer` | The threshold to check the return value of the query |
949+
| mustBeLessOrEqualTo | `integer` | The threshold to check the return value of the query |
950+
| mustBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. |
951+
| mustNotBeBetween | array of two integers | The threshold to check the return value of the query. Boundaries are inclusive. |
952+
| unit | `string` | `rows` (default) or `percent` |
953+
954+
955+
Metrics:
956+
957+
| Metric | Level | Description | Arguments | Arguments Example |
958+
|--------|--------|----------------------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------|
959+
| `nullValues` | Property | Counts null values in a column/field | None | |
960+
| `missingValues` | Property | Counts values considered as missing (empty strings, N/A, etc.) | `missingValues`: Array of values considered missing | `missingValues: [null, '', 'N/A']` |
961+
| `invalidValues` | Property | Counts values that don't match valid criteria | `validValues`: Array of valid values<br>`pattern`: Regex pattern | `validValues: ['pounds', 'kg']`<br>`pattern: '^[A-Z]{2}[0-9]{2}...'` |
962+
| `duplicateValues` | Property | Counts duplicate values in a column | None | |
963+
| `duplicateValues` | Schema | Counts duplicate values across multiple columns | `properties`: Array of property names | `properties: ['tenant_id', 'order_id']` |
964+
| `rowCount` | Schema | Counts total number of rows in a table/object store | None | |
965+
966+
967+
Example:
968+
969+
```yaml
970+
properties:
971+
- name: email_address
972+
quality:
973+
- metric: missingValues
974+
arguments:
975+
missingValues: [null, '', 'N/A', 'n/a']
976+
mustBeLessThan: 5
977+
unit: percent # rows (default) or percent
978+
```
979+
980+
932981
#### Custom
933982

934983
You can define custom quality attributes that are specific to a data quality engine.

datacontract.init.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
dataContractSpecification: 1.2.0
1+
dataContractSpecification: 1.2.1
22
id: my-data-contract-id
33
info:
44
title: My Data Contract

datacontract.schema.json

Lines changed: 41 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
"type": "string",
88
"title": "DataContractSpecificationVersion",
99
"enum": [
10+
"1.2.1",
1011
"1.2.0",
1112
"1.1.0",
1213
"0.9.3",
@@ -1816,13 +1817,21 @@
18161817
"mustBeGreaterThan": {
18171818
"type": "number"
18181819
},
1819-
"mustBeGreaterThanOrEqualTo": {
1820+
"mustBeGreaterOrEqualTo": {
18201821
"type": "number"
18211822
},
1823+
"mustBeGreaterThanOrEqualTo": {
1824+
"type": "number",
1825+
"deprecated": true
1826+
},
18221827
"mustBeLessThan": {
18231828
"type": "number"
18241829
},
18251830
"mustBeLessThanOrEqualTo": {
1831+
"type": "number",
1832+
"deprecated": true
1833+
},
1834+
"mustBeLessOrEqualTo": {
18261835
"type": "number"
18271836
},
18281837
"mustBeBetween": {
@@ -1849,18 +1858,42 @@
18491858
},
18501859
{
18511860
"if": {
1852-
"properties": {
1853-
"type": {
1854-
"const": "library"
1861+
"anyOf": [
1862+
{
1863+
"properties": {
1864+
"type": {
1865+
"const": "library"
1866+
}
1867+
}
1868+
},
1869+
{
1870+
"properties": {
1871+
"metric": {
1872+
"type": "string"
1873+
}
1874+
},
1875+
"required": ["metric"]
18551876
}
1856-
}
1877+
]
18571878
},
18581879
"then": {
18591880
"properties": {
1881+
"metric": {
1882+
"type": "string",
1883+
"description": "The DataQualityLibrary metric to use for the quality check.",
1884+
"enum": ["nullValues", "missingValues", "invalidValues", "duplicateValues", "rowCount"]
1885+
},
18601886
"rule": {
18611887
"type": "string",
1862-
"description": "Define a data quality check based on the predefined rules as per ODCS.",
1863-
"examples": ["duplicateCount", "validValues", "rowCount"]
1888+
"deprecated": true,
1889+
"description": "Deprecated. Use metric instead"
1890+
},
1891+
"arguments": {
1892+
"type": "object",
1893+
"description": "Additional metric-specific parameters for the quality check.",
1894+
"additionalProperties": {
1895+
"type": ["string", "number", "boolean", "array", "object"]
1896+
}
18641897
},
18651898
"mustBe": {
18661899
"description": "Must be equal to the value to be valid. When using numbers, it is equivalent to '='."
@@ -1906,7 +1939,7 @@
19061939
}
19071940
},
19081941
"required": [
1909-
"rule"
1942+
"metric"
19101943
]
19111944
}
19121945
},

examples/orders-latest/datacontract.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
dataContractSpecification: 1.2.0
1+
dataContractSpecification: 1.2.1
22
id: orders-latest
33
info:
44
title: Orders Latest

examples/time-example/datacontract.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<head>
44
<meta charset="UTF-8">
55
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6-
<title>Time Data Type Example - Data Contract Specification v1.2.0</title>
6+
<title>Time Data Type Example - Data Contract Specification v1.2.1</title>
77
<style>
88
body {
99
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
@@ -106,12 +106,12 @@
106106
</head>
107107
<body>
108108
<div class="container">
109-
<div class="version-badge">Data Contract Specification v1.2.0</div>
109+
<div class="version-badge">Data Contract Specification v1.2.1</div>
110110
<h1>Time Data Type Example</h1>
111111

112112
<div class="info-section">
113113
<h2>Overview</h2>
114-
<p>This example demonstrates the usage of the new <strong>time</strong> data type introduced in Data Contract Specification v1.2.0. The time data type is specifically designed for storing time values without date information, making it perfect for business hours, schedules, and time-based data.</p>
114+
<p>This example demonstrates the usage of the new <strong>time</strong> data type introduced in Data Contract Specification v1.2.1. The time data type is specifically designed for storing time values without date information, making it perfect for business hours, schedules, and time-based data.</p>
115115
</div>
116116

117117
<div class="warning">

examples/time-example/datacontract.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
dataContractSpecification: 1.2.0
1+
dataContractSpecification: 1.2.1
22
id: time-demo
33
info:
44
title: Time Data Type Example
55
version: 1.0.0
66
description: |
77
Example demonstrating the usage of the time data type
8-
introduced in Data Contract Specification v1.2.0.
8+
introduced in Data Contract Specification v1.2.1.
99
owner: Data Contract Team
1010
contact:
1111
name: Data Contract Team
@@ -127,6 +127,6 @@ models:
127127
tags:
128128
- example
129129
- time
130-
- v1.2.0
130+
- v1.2.1
131131
links:
132132
datacontractCli: https://cli.datacontract.com

examples/variant-json-example/datacontract.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
dataContractSpecification: 1.2.0
1+
dataContractSpecification: 1.2.1
22
id: variant-json-demo
33
info:
44
title: Variant and JSON Data Types Example
55
version: 1.0.0
66
description: |
77
Example demonstrating the usage of variant and json data types
8-
introduced in Data Contract Specification v1.2.0.
8+
introduced in Data Contract Specification v1.2.1.
99
owner: Data Contract Team
1010
contact:
1111
name: Data Contract Team
@@ -70,6 +70,6 @@ tags:
7070
- example
7171
- variant
7272
- json
73-
- v1.2.0
73+
- v1.2.1
7474
links:
7575
datacontractCli: https://cli.datacontract.com
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
dataContractSpecification: 1.2.0
2+
id: my-data-contract-id
3+
info:
4+
title: My Data Contract
5+
version: 0.0.1
6+
# description:
7+
# owner:
8+
# contact:
9+
# name:
10+
# url:
11+
# email:
12+
13+
14+
### servers
15+
16+
#servers:
17+
# production:
18+
# type: s3
19+
# location: s3://
20+
# format: parquet
21+
# delimiter: new_line
22+
23+
### terms
24+
25+
#terms:
26+
# usage:
27+
# limitations:
28+
# billing:
29+
# noticePeriod:
30+
31+
32+
### models
33+
34+
# models:
35+
# my_model:
36+
# description:
37+
# type:
38+
# fields:
39+
# my_field:
40+
# type:
41+
# description:
42+
43+
44+
### definitions
45+
46+
# definitions:
47+
# my_field:
48+
# domain:
49+
# name:
50+
# title:
51+
# type:
52+
# description:
53+
# example:
54+
# pii:
55+
# classification:
56+
57+
58+
### servicelevels
59+
60+
#servicelevels:
61+
# availability:
62+
# description: The server is available during support hours
63+
# percentage: 99.9%
64+
# retention:
65+
# description: Data is retained for one year because!
66+
# period: P1Y
67+
# unlimited: false
68+
# latency:
69+
# description: Data is available within 25 hours after the order was placed
70+
# threshold: 25h
71+
# sourceTimestampField: orders.order_timestamp
72+
# processedTimestampField: orders.processed_timestamp
73+
# freshness:
74+
# description: The age of the youngest row in a table.
75+
# threshold: 25h
76+
# timestampField: orders.order_timestamp
77+
# frequency:
78+
# description: Data is delivered once a day
79+
# type: batch # or streaming
80+
# interval: daily # for batch, either or cron
81+
# cron: 0 0 * * * # for batch, either or interval
82+
# support:
83+
# description: The data is available during typical business hours at headquarters
84+
# time: 9am to 5pm in EST on business days
85+
# responseTime: 1h
86+
# backup:
87+
# description: Data is backed up once a week, every Sunday at 0:00 UTC.
88+
# interval: weekly
89+
# cron: 0 0 * * 0
90+
# recoveryTime: 24 hours
91+
# recoveryPoint: 1 week

0 commit comments

Comments
 (0)