Skip to content

Commit f97eb2f

Browse files
authored
Enable loading several documents at the same time (#7)
1 parent ec5ef05 commit f97eb2f

13 files changed

Lines changed: 120 additions & 38 deletions

.github/workflows/integration-tests-mssql.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,10 @@ jobs:
2323
- name: Check out repository code
2424
uses: actions/checkout@v4
2525

26-
- name: Set up Python 3.11
26+
- name: Set up Python 3.12
2727
uses: actions/setup-python@v5
2828
with:
29-
python-version: 3.11
29+
python-version: 3.12
3030

3131
- name: Install dependencies
3232
run: |

.github/workflows/integration-tests-mysql.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ jobs:
3232
- name: Check out repository code
3333
uses: actions/checkout@v4
3434

35-
- name: Set up Python 3.11
35+
- name: Set up Python 3.12
3636
uses: actions/setup-python@v5
3737
with:
38-
python-version: 3.11
38+
python-version: 3.12
3939

4040
- name: Install dependencies
4141
run: |

.github/workflows/integration-tests-postgres.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ on:
1010
jobs:
1111
integration-tests:
1212
runs-on: ubuntu-latest
13-
container: python:3.11-bookworm
13+
container: python:3.12-bookworm
1414
services:
1515
postgres:
1616
image: postgres
@@ -29,10 +29,10 @@ jobs:
2929
- name: Check out repository code
3030
uses: actions/checkout@v4
3131

32-
- name: Set up Python 3.11
32+
- name: Set up Python 3.12
3333
uses: actions/setup-python@v5
3434
with:
35-
python-version: 3.11
35+
python-version: 3.12
3636

3737
- name: Install dependencies
3838
run: |

.github/workflows/publish-to-gh-pages.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
2020
- uses: actions/setup-python@v5
2121
with:
22-
python-version: 3.x
22+
python-version: 3.12
2323
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
2424
- uses: actions/cache@v4
2525
with:

.github/workflows/publish-to-pypi.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ jobs:
1515
- name: Set up Python
1616
uses: actions/setup-python@v5
1717
with:
18-
python-version: "3.11"
18+
python-version: "3.12"
1919
- name: Install pypa/build
2020
run: >-
2121
python3 -m

docs/configuring.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ clustered columnstore indexes. The default value is `False` (disabled).
6464
useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
6565
as dicts, the only required keys are `name` and `type` (a SQLAlchemy type object); other keys will be passed directly
6666
as keyword arguments to `sqlalchemy.Column`. Actual values need to be passed to
67-
[`Document.insert_into_target_tables`](api/document.md#xml2db.document.Document.insert_into_target_tables) for each
67+
[`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for each
6868
parsed documents, as a `dict`, using the `metadata` argument.
6969
* `record_hash_column_name`: the column name to use to store records hash data (defaults to `xml2db_record_hash`).
7070
* `record_hash_constructor`: a function used to build a hash, with a signature similar to `hashlib` constructor

docs/getting_started.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,25 @@ troubleshooting if need be.
121121
[`metadata_columns` option](configuring.md#model-configuration) and create additional columns in the root table.
122122
It can be used for instance to save file name or loading timestamp.
123123

124+
Actual values need to be passed to [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for
125+
each parsed documents, as a `dict`, using the `metadata` argument.
126+
127+
!!! note
128+
You can also load multiple documents at the same time to the database, which could make the process faster if you
129+
have a lot of small XML files to load:
130+
``` py
131+
data = None
132+
for xml_file in files:
133+
document = data_model.parse_xml(
134+
xml_file="path/to/file.xml",
135+
flat_data=data,
136+
)
137+
data = document.data
138+
document.insert_into_target_tables()
139+
```
140+
141+
142+
124143
## Getting back the data into XML
125144

126145
You can extract the data from the database into XML files. This was implemented primarily to be able to test the package

docs/stylesheets/extra.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,8 @@
66
--md-accent-fg-color: #116baa;
77
--md-accent-fg-color--light: #2a8cd0;
88
--md-accent-fg-color--dark: #116baa;
9+
}
10+
11+
.md-typeset .admonition, .md-typeset details {
12+
font-size: .75rem;
913
}

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "xml2db"
7-
version = "0.11.0"
7+
version = "0.12.0"
88
authors = [
99
{ name="Commission de régulation de l'énergie", email="opensource@cre.fr" },
1010
]

src/xml2db/document.py

Lines changed: 27 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,11 @@ def __init__(self, model: "DataModel"):
3737
def parse_xml(
3838
self,
3939
xml_file: Union[str, BytesIO],
40+
metadata: dict = None,
4041
skip_validation: bool = True,
4142
iterparse: bool = True,
4243
recover: bool = False,
44+
flat_data: dict = None,
4345
) -> None:
4446
"""Parse an XML document and apply transformation corresponding to the target data model
4547
@@ -50,9 +52,13 @@ def parse_xml(
5052
5153
Args:
5254
xml_file: The path or the file object of an XML file to parse
55+
metadata: A dict of metadata values to add to the root table (a value for each key defined in
56+
`metadata_columns` passed to model config)
5357
skip_validation: Should we validate the document against the schema first?
5458
iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
5559
recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
60+
flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
61+
a new one
5662
"""
5763
self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "<stream>"
5864

@@ -69,7 +75,11 @@ def parse_xml(
6975
document_tree = self.model.model_config["document_tree_hook"](document_tree)
7076

7177
logger.info(f"Adding records to data model for {self.xml_file_path}")
72-
self.data = self.doc_tree_to_flat_data(document_tree)
78+
self.data = self.doc_tree_to_flat_data(
79+
document_tree,
80+
metadata=metadata,
81+
flat_data=flat_data,
82+
)
7383

7484
logger.debug(self.__repr__())
7585

@@ -90,11 +100,16 @@ def to_xml(
90100
converter.document_tree = self.flat_data_to_doc_tree()
91101
return converter.to_xml(out_file=out_file, nsmap=nsmap, indent=indent)
92102

93-
def doc_tree_to_flat_data(self, document_tree: tuple) -> dict:
103+
def doc_tree_to_flat_data(
104+
self, document_tree: tuple, metadata: dict = None, flat_data: dict = None
105+
) -> dict:
94106
"""Convert document tree (nested dict) to flat tables data model to prepare database import
95107
96108
Args:
97109
document_tree: A tuple (node_type, content, hash) containing the document tree
110+
metadata: A dict of metadata values to add to the root table (a value for each key defined in
111+
`metadata_columns` passed to model config)
112+
flat_data: A dict to store the flat data into
98113
99114
Returns:
100115
A dict containing flat tables
@@ -108,6 +123,7 @@ def _extract_node(
108123
Args:
109124
node: A tuple (node_type, content, hash) containing a node of the document tree
110125
pk_parent_node: The primary key of its parent node
126+
row_number: The row number of the record
111127
data_model: The dict to write output to
112128
113129
Returns:
@@ -196,6 +212,12 @@ def _extract_node(
196212
else:
197213
record[f"temp_{rel.field_name}"] = None
198214

215+
# write metadata if it is the root table
216+
if pk_parent_node == 0 and isinstance(metadata, dict):
217+
for meta_col in self.model.model_config.get("metadata_columns", []):
218+
if meta_col["name"] in metadata:
219+
record[meta_col["name"]] = metadata[meta_col["name"]]
220+
199221
record[self.model.model_config["record_hash_column_name"]] = node_hash
200222

201223
# add n-n relationship data for reused children nodes
@@ -231,7 +253,7 @@ def _extract_node(
231253

232254
return record_pk
233255

234-
flat_tables = {}
256+
flat_tables = flat_data if flat_data else {}
235257
_extract_node(document_tree, 0, 0, flat_tables)
236258

237259
return flat_tables
@@ -346,17 +368,13 @@ def _build_node(node_type: str, node_pk: int) -> tuple:
346368
int(list(data_index[self.model.root_table]["records"].keys())[0]),
347369
)
348370

349-
def insert_into_temp_tables(
350-
self, max_lines: int = -1, metadata: dict = None
351-
) -> None:
371+
def insert_into_temp_tables(self, max_lines: int = -1) -> None:
352372
"""Insert data into temporary tables
353373
354374
(Re)creates temp tables before inserting data.
355375
356376
Args:
357377
max_lines: The maximum number of lines to insert in a single statement
358-
metadata: A dict of metadata values to add to the root table (a value for each key defined in
359-
`metadata_columns` passed to model config)
360378
"""
361379
logger.info(f"Dropping temp tables if exist for {self.xml_file_path}")
362380
self.model.drop_all_temp_tables()
@@ -365,11 +383,6 @@ def insert_into_temp_tables(
365383
self.model.create_all_tables(temp=True)
366384

367385
logger.info(f"Inserting data into temporary tables from {self.xml_file_path}")
368-
# write metadata into the root table data
369-
root_data = self.data[self.model.root_table]["records"][0]
370-
for meta_col in self.model.model_config.get("metadata_columns", []):
371-
if meta_col["name"] in metadata:
372-
root_data[meta_col["name"]] = metadata[meta_col["name"]]
373386
# insert data (order does not really matter)
374387
for tb in self.model.fk_ordered_tables:
375388
for query, data in tb.get_insert_temp_records_statements(
@@ -418,7 +431,6 @@ def insert_into_target_tables(
418431
self,
419432
single_transaction: bool = True,
420433
max_lines: int = -1,
421-
metadata: dict = None,
422434
) -> int:
423435
"""Insert and merge data into the database
424436
@@ -429,8 +441,6 @@ def insert_into_target_tables(
429441
scope required to ensure database consistency?
430442
max_lines: The maximum number of lines to insert in a single statement when loading data to the temporary
431443
tables
432-
metadata: A dict of metadata values to add to the root table (a value for each key defined in
433-
`metadata_columns` passed to model config)
434444
435445
Returns:
436446
The number of inserted rows
@@ -444,7 +454,7 @@ def insert_into_target_tables(
444454
logger.error(e)
445455
raise
446456
try:
447-
self.insert_into_temp_tables(max_lines, metadata)
457+
self.insert_into_temp_tables(max_lines)
448458
except Exception as e:
449459
logger.error(
450460
f"Error while importing into temporary tables from {self.xml_file_path}"

0 commit comments

Comments
 (0)