The idea would be to extract more data from the page xml. I'm looking mainly for these fields
- Wikipedia article id ->
<id>39</id>
- Revision id ->
<revision><id>1338813175</id></revision>
- Revision timestamp ->
<revision><timestamp>2026-02-17T10:56:58Z</timestamp></revision>
I want to use them to check for when an article was last updated, and the article id for better indexing rather than using the titles. I don't mean to change the articleId to be the key of the pages table, just have it as an extra for other uses.
The idea would be to add them in parse_dump_xml, and in def add_page parameters, then to the table. Or maybe another way to implement it if touching that is too much, like a article metadata model.
I can look to make the pr myself if approved
The idea would be to extract more data from the page xml. I'm looking mainly for these fields
<id>39</id><revision><id>1338813175</id></revision><revision><timestamp>2026-02-17T10:56:58Z</timestamp></revision>I want to use them to check for when an article was last updated, and the article id for better indexing rather than using the titles. I don't mean to change the articleId to be the key of the pages table, just have it as an extra for other uses.
The idea would be to add them in
parse_dump_xml, and indef add_pageparameters, then to the table. Or maybe another way to implement it if touching that is too much, like a article metadata model.I can look to make the pr myself if approved