Thank you for sharing this corpus.
Creating GT is not an easy job. I took a random look at the page files from the pageXmlTranskribusCorrected folders.
I noticed the following problems:
- the entire text of a line was encoded at the Word level, as a single Word.
Solution: Convert Word ind line
- often the drop-capital are annotated as Graphic
- many separators can be seen as so called fake separators and should be corrected
- a wish, Transkribus does not create valid page instances, of course such annotations as:
<TranskribusMetadata docId="188203" .../> can be commented out.
but:
- open type="" attributes
- open id="" Attributes should be corrected to.
-
- the Alto format files contain very deeply structured data, unfortunately when converting to Page-XML format this information was not included.
I will be very welcome to help you to improve the data within my possibilities.
Thanks again for everything
Thank you for sharing this corpus.
Creating GT is not an easy job. I took a random look at the page files from the pageXmlTranskribusCorrected folders.
I noticed the following problems:
Solution: Convert Word ind line
<TranskribusMetadata docId="188203" .../> can be commented out.
but:
I will be very welcome to help you to improve the data within my possibilities.
Thanks again for everything