@@ -16,9 +16,62 @@ The column types can also be configured to override the default type mapping, us
1616 diagram (see the [ Getting started] ( getting_started.md ) page for directions on how to visualize data models) and
1717 then adapt the configuration if need be.
1818
19- Configuration options are described below.
19+ Configuration options are described below. Some options can be set at the model level, others at the table level and
20+ others at the field level. The general structure of the configuration dict is the following:
21+
22+ ``` py title="Model config general structure" linenums="1"
23+ {
24+ " document_tree_hook" : None ,
25+ " document_tree_node_hook" : None ,
26+ " row_numbers" : False ,
27+ " as_columnstore" : False ,
28+ " metadata_columns" : None ,
29+ " tables" : {
30+ " table1" : {
31+ " reuse" : True ,
32+ " choice_transform" : False ,
33+ " as_columnstore" : False ,
34+ " fields" : {
35+ " my_column" : {
36+ " type" : None # default type
37+ }
38+ },
39+ " extra_args" : [],
40+ }
41+ }
42+ }
43+ ```
44+
45+ ## Model configuration
2046
21- ## Field level config
47+ The following options can be passed as a top-level keys of the model configuration ` dict ` :
48+
49+ * ` document_tree_hook ` (` Callable ` ): sets a hook function which can modify the data extracted from the XML. It gives direct
50+ access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
51+ for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
52+ should of course stay compatible with the data model.
53+ * ` document_tree_node_hook ` (` Callable ` ): sets a hook function which can modify the data extracted from the XML. It is
54+ similar with ` document_tree_hook ` , but it is call as soon as a node is completed, not waiting for the entire parsing to
55+ finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing.
56+ * ` row_numbers ` (` bool ` ): adds ` xml2db_row_number ` columns either to ` n-n ` relationships tables, or directly to data tables when
57+ deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
58+ always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
59+ default value is ` False ` (disabled).
60+ * ` as_columnstore ` (` bool ` ): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
61+ the table level for each table. However, for ` n-n ` relationships tables, this option is the only way to configure the
62+ clustered columnstore indexes. The default value is ` False ` (disabled).
63+ * ` metadata_columns ` (` list ` ): a list of extra columns that you want to add to the root table of your model. This is
64+ useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
65+ as dicts, the only required keys are ` name ` and ` type ` (a SQLAlchemy type object); other keys will be passed directly
66+ as keyword arguments to ` sqlalchemy.Column ` . Actual values need to be passed to
67+ [ ` Document.insert_into_target_tables ` ] ( api/document.md#xml2db.document.Document.insert_into_target_tables ) for each
68+ parsed documents, as a ` dict ` , using the ` metadata ` argument.
69+ * ` record_hash_column_name ` : the column name to use to store records hash data (defaults to ` xml2db_record_hash ` ).
70+ * ` record_hash_constructor ` : a function used to build a hash, with a signature similar to ` hashlib ` constructor
71+ functions (defaults to ` hashlib.sha1 ` ).
72+ * ` record_hash_size ` : the byte size of the record hash (defaults to 20, which is the size of a ` sha-1 ` hash).
73+
74+ ## Fields configuration
2275
2376These configuration options are defined for a specific field of a specific table. A "field" refers to a column in the
2477table, or a child table.
@@ -140,7 +193,7 @@ timeInterval_end[1, 1]: string
140193 }
141194 ```
142195
143- ## Table level config
196+ ## Tables configuration
144197
145198### Simplify "choice groups"
146199
@@ -226,20 +279,22 @@ With MS SQL Server database backend, `xml2db` can create
226279on tables. However, for ` n-n ` relationships tables, this option needs to be set globally (see below). The default value
227280is ` False ` (disabled).
228281
229- Configuration: ` "as_columnstore": ` ` False ` (default) or ` True `
282+ ### Extra arguments
230283
231- ## Global options
284+ Extra arguments can be passed to ` sqlalchemy.Table ` constructors, for instance if you want to customize indexes. These
285+ can be passed in an iterable (e.g. ` tuple ` or ` list ` ) which will be simply unpacked into the ` sqlalchemy.Table `
286+ constructor when building the table.
232287
233- These options can be passed as a top-level keys of the model configuration ` dict ` :
288+ Configuration: ` "extra_args": [] ` (default)
234289
235- * ` document_tree_hook ` ( ` Callable ` ): sets a hook function which can modify the data extracted from the XML. It gives direct
236- access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
237- for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
238- should of course stay compatible with the data model.
239- * ` row_numbers ` ( ` bool ` ): adds ` xml2db_row_number ` columns either to ` n-n ` relationships tables, or directly to data tables when
240- deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
241- always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
242- default value is ` False ` (disabled).
243- * ` as_columnstore ` ( ` bool ` ): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
244- the table level for each table. However, for ` n-n ` relationships tables, this option is the only way to configure the
245- clustered columnstore indexes. The default value is ` False ` (disabled).
290+ !!! example
291+ Adding an index on a specific column:
292+ ``` python
293+ model_config = {
294+ " tables": {
295+ "my_table": {
296+ "extra_args": sqlalchemy.Index("my_index", "my_column1", "my_column2"),
297+ }
298+ }
299+ }
300+ ```
0 commit comments