You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -6,136 +6,9 @@ It is meant for cases where we cannot access real data, only schema and cardinal
6
6
7
7
Based on the table(s) schema and a query, it will generate random data with respect to fields, foreign keys defined in databases, foreign keys infered from the query pattern, (plan: from existing cardinalities and distributions).
8
8
9
-
Notice:
10
-
This is early stage
11
-
12
9
## Usage
13
10
`random-data-load run --engine=(mysql|pg) --rows=INT-64 (--query=SELECT ...|--table=table_name) [options...]`
14
11
15
-
## Supported fields:
16
-
17
-
|Field type|Generated values|
18
-
|----------|----------------|
19
-
|bool|false ~ true|
20
-
|tinyint|0 ~ 0xFF|
21
-
|smallint|0 ~ 0XFFFF|
22
-
|mediumint|0 ~ 0xFFFFFF|
23
-
|int - integer|0 ~ 0xFFFFFFFF|
24
-
|bigint|0 ~ 0xFFFFFFFFFFFFFFFF|
25
-
|float|0 ~ 1e8|
26
-
|decimal(m,n)|0 ~ 10^(m-n)|
27
-
|double|0 ~ 1000|
28
-
|char(n)|up to n random chars|
29
-
|varchar(n)|up to n random chars|
30
-
|date|between --min-generated-time and --max-generated-time|
31
-
|datetime|between --min-generated-time and --max-generated-time|
32
-
|timestamp|between --min-generated-time and --max-generated-time|
33
-
|time|00:00:00 ~ 23:59:59|
34
-
|year|Current year - 1 ~ current year|
35
-
|tinyblob|up to 100 chars random paragraph|
36
-
|tinytext|up to 100 chars random paragraph|
37
-
|blob|up to --max-text-size chars random paragraph|
38
-
|text|up to --max-text-size chars random paragraph|
39
-
|mediumblob|up to --max-text-size chars random paragraph|
40
-
|mediumtext|up to --max-text-size chars random paragraph|
41
-
|longblob|up to --max-text-size chars random paragraph|
42
-
|longtext|up to --max-text-size chars random paragraph|
43
-
|enum|A random item from the valid items list|
44
-
|set|A random item from the valid items list|
45
-
46
-
Valuable types currently not implemented:
47
-
- JSONs
48
-
- Geospatial
49
-
- Vectors
50
-
51
-
## Options
52
-
53
-
Common options:
54
-
55
-
|Option|Description|
56
-
|------|-----------|
57
-
|--engine|mysql/pg|
58
-
|--host|Host name/ip|
59
-
|--user|Username|
60
-
|--password|Password|
61
-
|--port|Port number|
62
-
|--quiet|Do not print progress bar|
63
-
|--dry-run|Print queries to the standard output instead of inserting them into the db|
64
-
|--debug|Show some debug information|
65
-
|--pprof|Generate pprof trace at --cpu-prof-path. Also opens port 6060 for pprof go tool|
66
-
|--version|Show version and exit|
67
-
|--rows-per-table|Number of rows to insert per-table. Will have priority over --rows|
68
-
|--bulk-size|Number of rows per INSERT statement (Default: 1000)|
69
-
|--workers|how many workers to spawn. Only the random generation and sampling are parallelized. Insert queries are executed one at a time (Default: 3)|
70
-
|--table|Table to insert to. When using --query, --table will be used to restrict the tables to insert to.|
71
-
|--query|Providing a query will analyze its schema usage, insert recursively into tables, and identify implicit joins|
72
-
|--no-skip-fields|Disable field whitelist system. When using a --query, it will get the list of fields being used as a whitelist in order to generate the minimal sets of fields required, unless --no-skip-fields is being used or any * has been found.|
73
-
|--null-freq|Define how frequent nullable fields should be NULL|
74
-
|--null-freq-map|Define how frequent nullable fields should be NULL for a given column. Will have priority over --null-freq. The format is \"--null-freq-map=t1.c1=73;t1.c2=4\" to set 73% or 4% of NULL for respective columns|
75
-
|--values-freq-map|Inject arbitrary values at fixed frequencies. The format is "--values-freq-map=t1.c1=val1:0.75,val2:0.23;t1.c2=10:0.99" so that val1 will be on 75% of rows and val2 on 23% for column c1|
76
-
|--min-generated-time|Generated timestamps will be after this date. Format is RFC3339. Will default to --max-generated-time - 1 year|
77
-
|--max-generated-time|Generated timestamps will be before this date. Format is RFC3339. Will default to now()|
78
-
79
-
Foreign key sampling options:
80
-
|Option|Description|
81
-
|------|-----------|
82
-
|--add-fk|Add foreign keys, if they are not explicitely created in the table schema. It can complement the foreign keys guessed from the --query, or be used to manually define foreign keys when using --no-fk-guess too. Format: --add-fk="parent_table.col1[,col2...]=child_table.colx[,coly...][; additional fk ]". Example: --add-fk="customers.id,created_at=purchases.customer_id,created_at;purchases.id=items.purchase_id"|
83
-
|--no-fk-guess|Do not try to guess foreign keys from the --query missing in the schema. When a query is provided, it will analyze the expected JOINs and try to respect dependencies even when foreign keys are not explicitely created in the database objects. This flag will make the tool stick to the constraints defined in the database only, unless you add foreign keys manually with --add-foreign-keys.|
84
-
|--default-relationship|Will define the default foreign-key relationship to apply. Possible values: binomial,sequential. The default relation can be overriden with other parameters --binomial or --sequential|
85
-
|--binomial|Defines a 1-N foreign key relationships using repeated coin flips. Postgres' tablesamples Bernouilli or mysql RAND() < 0.1 (can be tuned with --coin-flip-percent). Format should be "parent_table=child_table". E.g: --binomial="customers=orders;orders=items"|
86
-
|--coin-flip-percent|When used with --binomial, it will set the likeliness of each rows to be sampled or not. 10 would mean each rows have only 10% chance to be selected when sampling a parent table. Using large values will favor hot rows: the coin flips are done with a table full scan, with a limit set at --bulk-size, so with a large percent chance most of the time the first rows will be selected. No effects when used with --sequential (Default: 1)|
87
-
|--sequential|Defines a sequential foreign key links relationships. Format should be "parent_table=child_table". E.g: --sequential="citizens=ssns"|
88
-
|--normal|Defines a 1-N foreign key relationships using box-muller transformation to provide normal distribution. Slow method needing full table scans for each samples.|
89
-
|--normal-stddev|Standard deviation to the normal law. Will default to 1/10 of the table size|
90
-
|--normal-mean|Mean of the normal law. Will default to the middle of the table, --rows/2|
91
-
|--pareto|Defines a 1-N foreign key relationships using zipf (pareto) distribution. Slow method needing full table scans for each samples|
92
-
|--pareto-s|Zipf slope parameter. Must be above 1. Higher value will mean faster decay, so first rows will be hotter|
93
-
|--pareto-v|Must be >=1. Directly map to V, https://pkg.go.dev/math/rand#Zipf.|
94
-
95
-
## Foreign keys support
96
-
If a field has Foreign Keys constraints, `random-data-load` will get samples from the referenced tables in order to insert valid values for the field.
97
-
To enforce orders, an arbitrary 'ORDER BY 1' is made. This is so that --sequential can create 1-1 relationship, and to better master the eventual distribution of --binomial.
98
-
99
-
Composites foreign keys are supported.
100
-
With very low chances to sample rows, we might sample too little. The tool will loop until it sampled enough rows to fill the next bulk insert.
101
-
102
-
**1.** sequential relationships will sample with LIMIT and OFFSET:
103
-
```
104
-
SELECT <field[, field2]> FROM <referenced schema>.<referenced table> ORDER BY 1 LIMIT <--bulk-size> OFFSET y
105
-
```
106
-
This isn't the fastest method but it works for every types. The value of the current OFFSET is protected by mutex to prevents frequent duplicates.
107
-
108
-
**2.** binomial relations will sample differently between postgres and mysql
109
-
110
-
**2.1** For postgres it relies on TABLESAMPLE
111
-
```
112
-
SELECT <field[, field2]> FROM <referenced schema>.<referenced table> TABLESAMPLE BERNOUILLI (<--coin-flip-percent>) ORDER BY 1 LIMIT <--bulk-size>
113
-
```
114
-
115
-
**2.2** For mysql, it relies on RAND()
116
-
```
117
-
SELECT <field[, field2]> FROM <referenced schema>.<referenced table> WHERE rand() < (<--coin-flip-percent>/100) ORDER BY 1 LIMIT <--bulk-size>
118
-
```
119
-
120
-
## Guessing implicit foreign keys from queries
121
-
If no foreign keys are explicitely defined in the schema, but the query is using JOINs with a "ON" clause, `random-data-load` will infer the foreign keys and insert valid values so that JOINs work.
It will skip guessing foreign keys for those cases:
130
-
- JOINs relying on subqueries instead of tables
131
-
- JOINs made implicitely without JOIN keywords or "ON" clauses
132
-
- (limitation) JOINs having its ON clause between parenthesis are currently thought to be subqueries and are skipped
133
-
- JOINs conditions using ambiguous columns, without expliciting to what table it belongs. Example `FROM x JOIN y ON apple=pear` instead of `FROM x JOIN y ON x.apple=y.pear`
134
-
135
-
## Skipping fields that are not relevant to the query
136
-
When using --query, `random-data-load` will avoid generating or sampling fields that are not necessary for the query to run.
137
-
It can be disabled with --no-skip-fields.
138
-
It will also disable itself if it encounter any * , since the full length of the row would have consequences on the query execution.
|--dry-run|Print queries to the standard output instead of inserting them into the db|
108
+
|--debug|Show some debug information|
109
+
|--pprof|Generate pprof trace at --cpu-prof-path. Also opens port 6060 for pprof go tool|
110
+
|--version|Show version and exit|
111
+
|--rows-per-table|Number of rows to insert per-table. Will have priority over --rows|
112
+
|--bulk-size|Number of rows per INSERT statement (Default: 1000)|
113
+
|--workers|how many workers to spawn. Only the random generation and sampling are parallelized. Insert queries are executed one at a time (Default: 3)|
114
+
|--table|Table to insert to. When using --query, --table will be used to restrict the tables to insert to.|
115
+
|--query|Providing a query will analyze its schema usage, insert recursively into tables, and identify implicit joins|
116
+
|--no-skip-fields|Disable field whitelist system. When using a --query, it will get the list of fields being used as a whitelist in order to generate the minimal sets of fields required, unless --no-skip-fields is being used or any * has been found.|
117
+
|--null-freq|Define how frequent nullable fields should be NULL|
118
+
|--null-freq-map|Define how frequent nullable fields should be NULL for a given column. Will have priority over --null-freq. The format is \"--null-freq-map=t1.c1=73;t1.c2=4\" to set 73% or 4% of NULL for respective columns|
119
+
|--values-freq-map|Inject arbitrary values at fixed frequencies. The format is "--values-freq-map=t1.c1=val1:0.75,val2:0.23;t1.c2=10:0.99" so that val1 will be on 75% of rows and val2 on 23% for column c1|
120
+
|--min-generated-time|Generated timestamps will be after this date. Format is RFC3339. Will default to --max-generated-time - 1 year|
121
+
|--max-generated-time|Generated timestamps will be before this date. Format is RFC3339. Will default to now()|
122
+
123
+
Foreign key sampling options:
124
+
|Option|Description|
125
+
|------|-----------|
126
+
|--add-fk|Add foreign keys, if they are not explicitely created in the table schema. It can complement the foreign keys guessed from the --query, or be used to manually define foreign keys when using --no-fk-guess too. Format: --add-fk="parent_table.col1[,col2...]=child_table.colx[,coly...][; additional fk ]". Example: --add-fk="customers.id,created_at=purchases.customer_id,created_at;purchases.id=items.purchase_id"|
127
+
|--no-fk-guess|Do not try to guess foreign keys from the --query missing in the schema. When a query is provided, it will analyze the expected JOINs and try to respect dependencies even when foreign keys are not explicitely created in the database objects. This flag will make the tool stick to the constraints defined in the database only, unless you add foreign keys manually with --add-foreign-keys.|
128
+
|--default-relationship|Will define the default foreign-key relationship to apply. Possible values: binomial,sequential. The default relation can be overriden with other parameters --binomial or --sequential|
129
+
|--binomial|Defines a 1-N foreign key relationships using repeated coin flips. Postgres' tablesamples Bernouilli or mysql RAND() < 0.1 (can be tuned with --coin-flip-percent). Format should be "parent_table=child_table". E.g: --binomial="customers=orders;orders=items"|
130
+
|--coin-flip-percent|When used with --binomial, it will set the likeliness of each rows to be sampled or not. 10 would mean each rows have only 10% chance to be selected when sampling a parent table. Using large values will favor hot rows: the coin flips are done with a table full scan, with a limit set at --bulk-size, so with a large percent chance most of the time the first rows will be selected. No effects when used with --sequential (Default: 1)|
131
+
|--sequential|Defines a sequential foreign key links relationships. Format should be "parent_table=child_table". E.g: --sequential="citizens=ssns"|
132
+
|--normal|Defines a 1-N foreign key relationships using box-muller transformation to provide normal distribution. Slow method needing full table scans for each samples.|
133
+
|--normal-stddev|Standard deviation to the normal law. Will default to 1/10 of the table size|
134
+
|--normal-mean|Mean of the normal law. Will default to the middle of the table, --rows/2|
135
+
|--pareto|Defines a 1-N foreign key relationships using zipf (pareto) distribution. Slow method needing full table scans for each samples|
136
+
|--pareto-s|Zipf slope parameter. Must be above 1. Higher value will mean faster decay, so first rows will be hotter|
137
+
|--pareto-v|Must be >=1. Directly map to V, https://pkg.go.dev/math/rand#Zipf.|
221
138
139
+
### Example
222
140
141
+
Continuing the example with orders, products and order_items:
142
+
```
223
143
-- how many times products are present in order_items
224
144
postgres=# select oi.product_no, count(*) from order_items oi group by 1 order by 2 desc limit 10;
225
145
product_no | count
@@ -349,7 +269,7 @@ With very low chances to sample rows, we might sample too little. The tool will
349
269
```
350
270
SELECT <field[, field2]> FROM <referenced schema>.<referenced table> ORDER BY 1 LIMIT <--bulk-size> OFFSET y
351
271
```
352
-
This isn't the fastest method but it works for every types. The value of the current OFFSET is protected by mutex to prevents frequent duplicates.
272
+
This isn't the fastest method but it works for every types and compound primary keys. The value of the current OFFSET is protected by mutex to prevents frequent duplicates.
353
273
354
274
**2.** binomial relations will sample differently between postgres and mysql
SELECT <field[, field2]> FROM <referenced schema>.<referenced table> WHERE rand() < (<--coin-flip-percent>/100) ORDER BY 1 LIMIT <--bulk-size>
364
284
```
365
285
286
+
**3.** Pareto and normal distribution
287
+
Both methods are implemented using row_number()
288
+
Postgres uses row_number()
289
+
```
290
+
select <fields,..> from (SELECT columns, ROW_NUMBER() OVER (ORDER BY <fields...>) as rownumber FROM table ) f where rownumber IN (x1, x2, ...) and <checking fields not to be null> order by 1 limit <--bulk-size>
291
+
```
292
+
While MySQL is still implemented with user variables to retain mysql 5.7 compatibility
293
+
```
294
+
select <fields,...> from table, (SELECT @rownumber := 0) f where (@rownumber := @rownumber + 1) IN (x1, x2, ...) and <checking fields not to be null> order by 1 limit <--bulk-size>
295
+
```
296
+
297
+
**3.1** Pareto
298
+
"pareto" is actually using zipf random number generation. The slope can be tuned with --pareto-s such as higher value will mean faster decay. The other parameter --pareto-v is not documented in its related go stddlib package for now.
299
+
First rows will be hotter and sampled far more commonly, but it will nonetheless retain a long "tail" over the whole table.
300
+
301
+
**3.2** Normal
302
+
"normal" is actually implemented using box-muller transformation (reproducing "normal" distribution from 2 uniformly random float numbers between 0.0 and 1.0)
303
+
It will mostly sample around the --normal-mean based on --normal-stddev, and very few rows on the outlier parts.
304
+
366
305
## Guessing implicit foreign keys from queries
367
306
If no foreign keys are explicitely defined in the schema, but the query is using JOINs with a "ON" clause, `random-data-load` will infer the foreign keys and insert valid values so that JOINs work.
368
307
Can be disabled with --no-fk-guess
@@ -416,6 +355,42 @@ Very, very minimal for now, based on simple regexes.
416
355
They will use an associated gofakeit generator, https://github.com/brianvoe/gofakeit
417
356
418
357
358
+
## Supported fields:
359
+
360
+
|Field type|Generated values|
361
+
|----------|----------------|
362
+
|bool|false ~ true|
363
+
|tinyint|0 ~ 0xFF|
364
+
|smallint|0 ~ 0XFFFF|
365
+
|mediumint|0 ~ 0xFFFFFF|
366
+
|int - integer|0 ~ 0xFFFFFFFF|
367
+
|bigint|0 ~ 0xFFFFFFFFFFFFFFFF|
368
+
|float|0 ~ 1e8|
369
+
|decimal(m,n)|0 ~ 10^(m-n)|
370
+
|double|0 ~ 1000|
371
+
|char(n)|up to n random chars|
372
+
|varchar(n)|up to n random chars|
373
+
|date|between --min-generated-time and --max-generated-time|
374
+
|datetime|between --min-generated-time and --max-generated-time|
375
+
|timestamp|between --min-generated-time and --max-generated-time|
376
+
|time|00:00:00 ~ 23:59:59|
377
+
|year|Current year - 1 ~ current year|
378
+
|tinyblob|up to 100 chars random paragraph|
379
+
|tinytext|up to 100 chars random paragraph|
380
+
|blob|up to --max-text-size chars random paragraph|
381
+
|text|up to --max-text-size chars random paragraph|
382
+
|mediumblob|up to --max-text-size chars random paragraph|
383
+
|mediumtext|up to --max-text-size chars random paragraph|
384
+
|longblob|up to --max-text-size chars random paragraph|
385
+
|longtext|up to --max-text-size chars random paragraph|
386
+
|enum|A random item from the valid items list|
387
+
|set|A random item from the valid items list|
388
+
389
+
Valuable types currently not implemented:
390
+
- JSONs
391
+
- Geospatial
392
+
- Vectors
393
+
419
394
## How to download the precompiled binaries
420
395
421
396
There are binaries available for each version for Linux and Darwin. You can find compiled binaries for each version in the releases tab:
0 commit comments