You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-81Lines changed: 25 additions & 81 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,9 +27,9 @@ This is early stage
27
27
|double|0 ~ 1000|
28
28
|char(n)|up to n random chars|
29
29
|varchar(n)|up to n random chars|
30
-
|date|NOW() - 1 year ~ NOW()|
31
-
|datetime|NOW() - 1 year ~ NOW()|
32
-
|timestamp|NOW() - 1 year ~ NOW()|
30
+
|date|between --min-generated-time and --max-generated-time|
31
+
|datetime|between --min-generated-time and --max-generated-time|
32
+
|timestamp|between --min-generated-time and --max-generated-time|
33
33
|time|00:00:00 ~ 23:59:59|
34
34
|year|Current year - 1 ~ current year|
35
35
|tinyblob|up to 100 chars random paragraph|
@@ -49,18 +49,38 @@ Valuable types currently not implemented:
49
49
- Vectors
50
50
51
51
## Options
52
+
53
+
Common options:
54
+
52
55
|Option|Description|
53
56
|------|-----------|
54
57
|--engine|mysql/pg|
55
58
|--host|Host name/ip|
56
59
|--user|Username|
57
60
|--password|Password|
58
61
|--port|Port number|
62
+
|--quiet|Do not print progress bar|
63
+
|--dry-run|Print queries to the standard output instead of inserting them into the db|
64
+
|--debug|Show some debug information|
65
+
|--pprof|Generate pprof trace at --cpu-prof-path. Also opens port 6060 for pprof go tool|
66
+
|--version|Show version and exit|
59
67
|--rows-per-table|Number of rows to insert per-table. Will have priority over --rows|
60
68
|--bulk-size|Number of rows per INSERT statement (Default: 1000)|
61
69
|--workers|how many workers to spawn. Only the random generation and sampling are parallelized. Insert queries are executed one at a time (Default: 3)|
62
70
|--table|Table to insert to. When using --query, --table will be used to restrict the tables to insert to.|
63
71
|--query|Providing a query will analyze its schema usage, insert recursively into tables, and identify implicit joins|
72
+
|--no-skip-fields|Disable field whitelist system. When using a --query, it will get the list of fields being used as a whitelist in order to generate the minimal sets of fields required, unless --no-skip-fields is being used or any * has been found.|
73
+
|--null-freq|Define how frequent nullable fields should be NULL|
74
+
|--null-freq-map|Define how frequent nullable fields should be NULL for a given column. Will have priority over --null-freq. The format is \"--null-freq-map=t1.c1=73;t1.c2=4\" to set 73% or 4% of NULL for respective columns|
75
+
|--values-freq-map|Inject arbitrary values at fixed frequencies. The format is "--values-freq-map=t1.c1=val1:0.75,val2:0.23;t1.c2=10:0.99" so that val1 will be on 75% of rows and val2 on 23% for column c1|
76
+
|--min-generated-time|Generated timestamps will be after this date. Format is RFC3339. Will default to --max-generated-time - 1 year|
77
+
|--max-generated-time|Generated timestamps will be before this date. Format is RFC3339. Will default to now()|
78
+
79
+
Foreign key sampling options:
80
+
|Option|Description|
81
+
|------|-----------|
82
+
|--add-fk|Add foreign keys, if they are not explicitely created in the table schema. It can complement the foreign keys guessed from the --query, or be used to manually define foreign keys when using --no-fk-guess too. Format: --add-fk="parent_table.col1[,col2...]=child_table.colx[,coly...][; additional fk ]". Example: --add-fk="customers.id,created_at=purchases.customer_id,created_at;purchases.id=items.purchase_id"|
83
+
|--no-fk-guess|Do not try to guess foreign keys from the --query missing in the schema. When a query is provided, it will analyze the expected JOINs and try to respect dependencies even when foreign keys are not explicitely created in the database objects. This flag will make the tool stick to the constraints defined in the database only, unless you add foreign keys manually with --add-foreign-keys.|
64
84
|--default-relationship|Will define the default foreign-key relationship to apply. Possible values: binomial,sequential. The default relation can be overriden with other parameters --binomial or --sequential|
65
85
|--binomial|Defines a 1-N foreign key relationships using repeated coin flips. Postgres' tablesamples Bernouilli or mysql RAND() < 0.1 (can be tuned with --coin-flip-percent). Format should be "parent_table=child_table". E.g: --binomial="customers=orders;orders=items"|
66
86
|--coin-flip-percent|When used with --binomial, it will set the likeliness of each rows to be sampled or not. 10 would mean each rows have only 10% chance to be selected when sampling a parent table. Using large values will favor hot rows: the coin flips are done with a table full scan, with a limit set at --bulk-size, so with a large percent chance most of the time the first rows will be selected. No effects when used with --sequential (Default: 1)|
@@ -71,17 +91,6 @@ Valuable types currently not implemented:
71
91
|--pareto|Defines a 1-N foreign key relationships using zipf (pareto) distribution. Slow method needing full table scans for each samples|
72
92
|--pareto-s|Zipf slope parameter. Must be above 1. Higher value will mean faster decay, so first rows will be hotter|
73
93
|--pareto-v|Must be >=1. Directly map to V, https://pkg.go.dev/math/rand#Zipf.|
74
-
|--add-fk|Add foreign keys, if they are not explicitely created in the table schema. It can complement the foreign keys guessed from the --query, or be used to manually define foreign keys when using --no-fk-guess too. Format: --add-fk="parent_table.col1[,col2...]=child_table.colx[,coly...][; additional fk ]". Example: --add-fk="customers.id,created_at=purchases.customer_id,created_at;purchases.id=items.purchase_id"|
75
-
|--no-fk-guess|Do not try to guess foreign keys from the --query missing in the schema. When a query is provided, it will analyze the expected JOINs and try to respect dependencies even when foreign keys are not explicitely created in the database objects. This flag will make the tool stick to the constraints defined in the database only, unless you add foreign keys manually with --add-foreign-keys.|
76
-
|--no-skip-fields|Disable field whitelist system. When using a --query, it will get the list of fields being used as a whitelist in order to generate the minimal sets of fields required, unless --no-skip-fields is being used or any * has been found.|
77
-
|--null-freq|Define how frequent nullable fields should be NULL|
78
-
|--null-freq-map|Define how frequent nullable fields should be NULL for a given column. Will have priority over --null-freq. The format is \"--null-freq-map=t1.c1=73;t1.c2=4\" to set 73% or 4% of NULL for respective columns|
79
-
|--values-freq-map|Inject arbitrary values at fixed frequencies. The format is "--values-freq-map=t1.c1=val1:0.75,val2:0.23;t1.c2=10:0.99" so that val1 will be on 75% of rows and val2 on 23% for column c1|
80
-
|--quiet|Do not print progress bar|
81
-
|--dry-run|Print queries to the standard output instead of inserting them into the db|
82
-
|--debug|Show some debug information|
83
-
|--pprof|Generate pprof trace at --cpu-prof-path. Also opens port 6060 for pprof go tool|
84
-
|--version|Show version and exit|
85
94
86
95
## Foreign keys support
87
96
If a field has Foreign Keys constraints, `random-data-load` will get samples from the referenced tables in order to insert valid values for the field.
@@ -329,72 +338,6 @@ postgres=# select oi.product_no, count(*) from order_items oi group by 1 order b
329
338
330
339
```
331
340
332
-
333
-
## Options
334
-
|Option|Description|
335
-
|------|-----------|
336
-
|--engine|mysql/pg|
337
-
|--host|Host name/ip|
338
-
|--user|Username|
339
-
|--password|Password|
340
-
|--port|Port number|
341
-
|--bulk-size|Number of rows per INSERT statement (Default: 1000)|
342
-
|--workers|how many workers to spawn. Only the random generation and sampling are parallelized. Insert queries are executed one at a time (Default: 3)|
343
-
|--table|Table to insert to. When using --query, --table will be used to restrict the tables to insert to.|
344
-
|--query|Providing a query will analyze its schema usage, insert recursively into tables, and identify implicit joins|
345
-
|--default-relationship|Will define the default foreign-key relationship to apply. Possible values: binomial,sequential. The default relation can be overriden with other parameters --binomial or --sequential|
346
-
|--binomial|Defines a 1-N foreign key relationships using repeated coin flips. Postgres' tablesamples Bernouilli or mysql RAND() < 0.1 (can be tuned with --coin-flip-percent). Format should be "parent_table=child_table". E.g: --binomial="customers=orders;orders=items"|
347
-
|--coin-flip-percent|When used with --binomial, it will set the likeliness of each rows to be sampled or not. 10 would mean each rows have only 10%% chance to be selected when sampling a parent table. Using large values will favor hot rows: the coin flips are done with a table full scan, with a limit set at --bulk-size, so with a large percent chance most of the time the first rows will be selected. No effects when used with --sequential (Default: 1)|
348
-
|--sequential|Defines a sequential foreign key links relationships. Format should be "parent_table=child_table". E.g: --sequential="citizens=ssns"|
349
-
|--add-foreign-keys|Add foreign keys, if they are not explicitely created in the table schema. The format must be parent_table.col1=child_table.col2. It can complement the foreign keys guessed from the --query, or be used to manually define foreign keys when using --no-fk-guess too. Example --add-foreign-keys="customers.id=purchases.customer_id;purchases.id=items.purchase_id"|
350
-
|--no-fk-guess|Do not try to guess foreign keys from the --query missing in the schema. When a query is provided, it will analyze the expected JOINs and try to respect dependencies even when foreign keys are not explicitely created in the database objects. This flag will make the tool stick to the constraints defined in the database only, unless you add foreign keys manually with --add-foreign-keys.|
351
-
|--no-skip-fields|Disable field whitelist system. When using a --query, it will get the list of fields being used as a whitelist in order to generate the minimal sets of fields required, unless --no-skip-fields is being used or any * has been found.|
352
-
|--null-freq|Define how frequent nullable fields should be NULL by default|
353
-
|--values-freq-map|Define how frequent nullable fields should be NULL for a given column. Will have priority over --null-freq. The format is "--null-freq-map=t1.c1=73;t1.c2=4" to set 73%% or 4%% of NULL for respective columns
354
-
|--query-param-freq|Frequency at which to insert arbitrary values guessed from the query parameters. = and IN operators are handled. Can be disabled when set to 0.0.|
355
-
|--quiet|Do not print progress bar|
356
-
|--dry-run|Print queries to the standard output instead of inserting them into the db|
357
-
|--debug|Show some debug information|
358
-
|--pprof|Generate pprof trace at --cpu-prof-path. Also opens port 6060 for pprof go tool|
359
-
|--version|Show version and exit|
360
-
361
-
362
-
## Supported fields:
363
-
|Field type|Generated values|
364
-
|----------|----------------|
365
-
|bool|false ~ true|
366
-
|tinyint|0 ~ 0xFF|
367
-
|smallint|0 ~ 0XFFFF|
368
-
|mediumint|0 ~ 0xFFFFFF|
369
-
|int - integer|0 ~ 0xFFFFFFFF|
370
-
|bigint|0 ~ 0xFFFFFFFFFFFFFFFF|
371
-
|float|0 ~ 1e8|
372
-
|decimal(m,n)|0 ~ 10^(m-n)|
373
-
|double|0 ~ 1000|
374
-
|char(n)|up to n random chars|
375
-
|varchar(n)|up to n random chars|
376
-
|date|NOW() - 1 year ~ NOW()|
377
-
|datetime|NOW() - 1 year ~ NOW()|
378
-
|timestamp|NOW() - 1 year ~ NOW()|
379
-
|time|00:00:00 ~ 23:59:59|
380
-
|year|Current year - 1 ~ current year|
381
-
|tinyblob|up to 100 chars random paragraph|
382
-
|tinytext|up to 100 chars random paragraph|
383
-
|blob|up to --max-text-size chars random paragraph|
384
-
|text|up to --max-text-size chars random paragraph|
385
-
|mediumblob|up to --max-text-size chars random paragraph|
386
-
|mediumtext|up to --max-text-size chars random paragraph|
387
-
|longblob|up to --max-text-size chars random paragraph|
388
-
|longtext|up to --max-text-size chars random paragraph|
389
-
|enum|A random item from the valid items list|
390
-
|set|A random item from the valid items list|
391
-
392
-
Valuable types currently not implemented:
393
-
- JSONs
394
-
- Geospatial
395
-
- Vectors
396
-
397
-
398
341
## Foreign keys support
399
342
If a field has Foreign Keys constraints, `random-data-load` will get samples from the referenced tables in order to insert valid values for the field.
400
343
To enforce orders, an arbitrary 'ORDER BY 1' is made. This is so that --sequential can create 1-1 relationship, and to better master the eventual distribution of --binomial.
-[] better datetime random generation. It should be flexible over its range
427
+
-[x] better datetime random generation. It should be flexible over its range
485
428
-[x] use more gofakeit generators with regexes to generate "legit" data when possible
486
429
-[ ] helpers to get schema (generate pgdump/mysqldump commands, get index stats, ...)
487
430
-[x] protect against foreign key cycles. Both explicits and implicits (avoid generating implicits that would end up causing loops)
@@ -498,6 +441,7 @@ Stepping stones to fully reproduce cardinalities:
498
441
-[x] table-per-table override for --rows, --null-frequency
499
442
-[ ] coin-flip-percent per relationship basis. Current thought: adding it to --binomial this way --binomial="parent=child:70" to set the coinflip to 70 for this link
500
443
-[ ] parse col/index stats (cardinality + most_common_elems + most_common_freqs for postgres, cardinalities for MySQL)
444
+
-[ ] estimate/decide sampling method+tuning based on stats
501
445
502
446
Without clear plan:
503
447
-[x] More random algorithms (as of now, no good implementations has been found for pareto that wouldn't provoke huge runtime and/or huge memory consumption, unless implemented fields are restricted to integers)
Copy file name to clipboardExpand all lines: cmd/run.go
+21-11Lines changed: 21 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@ import (
6
6
"regexp"
7
7
"slices"
8
8
"strings"
9
+
"time"
9
10
10
11
"github.com/apoorvam/goterminal"
11
12
"github.com/pkg/errors"
@@ -19,16 +20,18 @@ import (
19
20
typeRunCmdstruct {
20
21
DB db.Config`embed:""`
21
22
22
-
Tablestring`help:"Table to insert to. When using --query, --table will be used to restrict the tables to insert to."`
23
-
Rowsint64`name:"rows" required:"true" help:"Number of rows to insert"`
24
-
RowsPerTablemap[string]int64`name:"rows-per-table" help:"Number of rows to insert per-table. Will have priority over --rows. Format is \"{table}=X\"" default:""`
25
-
BulkSizeint64`name:"bulk-size" help:"Number of rows per insert statement" default:"1000"`
26
-
DryRunbool`name:"dry-run" help:"Print queries to the standard output instead of inserting them into the db"`
27
-
Quietbool`name:"quiet" help:"Do not print progress bar"`
28
-
WorkersCountint`name:"workers" help:"How many workers to spawn. Only the random generation and sampling are parallelized. Insert queries are executed one at a time" default:"3"`
29
-
MaxTextSizeint64`help:"Limit the maximum size of long text, varchar and blob fields." default:"65535"`
30
-
UUIDVersionint`name:"uuid-version" help:"UUID v4 or v7 for uuid datatypes" default:"4" enum:"4,7"`
31
-
Querystring`help:"Providing a query will enable to automatically discover the schema, insert recursively into tables, enforce implicit joins."`
23
+
Tablestring`help:"Table to insert to. When using --query, --table will be used to restrict the tables to insert to."`
24
+
Rowsint64`name:"rows" required:"true" help:"Number of rows to insert"`
25
+
RowsPerTablemap[string]int64`name:"rows-per-table" help:"Number of rows to insert per-table. Will have priority over --rows. Format is \"{table}=X\"" default:""`
26
+
BulkSizeint64`name:"bulk-size" help:"Number of rows per insert statement" default:"1000"`
27
+
DryRunbool`name:"dry-run" help:"Print queries to the standard output instead of inserting them into the db"`
28
+
Quietbool`name:"quiet" help:"Do not print progress bar"`
29
+
WorkersCountint`name:"workers" help:"How many workers to spawn. Only the random generation and sampling are parallelized. Insert queries are executed one at a time" default:"3"`
30
+
MaxTextSizeint64`help:"Limit the maximum size of long text, varchar and blob fields." default:"65535"`
31
+
UUIDVersionint`name:"uuid-version" help:"UUID v4 or v7 for uuid datatypes" default:"4" enum:"4,7"`
32
+
MinGeneratedTime time.Time`help:"Generated timestamps will be after this date. Format is RFC3339. Will default to --max-generated-time - 1 year"`
33
+
MaxGeneratedTime time.Time`help:"Generated timestamps will be before this date. Format is RFC3339. Will default to now()"`
34
+
Querystring`help:"Providing a query will enable to automatically discover the schema, insert recursively into tables, enforce implicit joins."`
32
35
33
36
generate.ForeignKeyLinks
34
37
AddForeignKeys query.VirtualJoins`name:"add-fk" help:"Add foreign keys, if they are not explicitely created in the table schema. It can complement the foreign keys guessed from the --query, or be used to manually define foreign keys when using --no-fk-guess too. Format: --add-fk=\"parent_table.col1[,col2...]=child_table.colx[,coly...][; additional fk ]\". Example: --add-fk=\"customers.id,created_at=purchases.customer_id,created_at;purchases.id=items.purchase_id\""`
log.Info().Msgf("Increasing --coin-flip-percent to %.10f due to low --rows to ensure we can at least sample and get half of --bulk-size at a time", cmd.CoinFlipPercent)
0 commit comments