Summary
When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.
This is not a hypothetical, I was scraping a docket at the Ohio Northern Bankruptcy Court (ohnb), and the docketreport.parse() failed because of some invalid XML characters coming back from the request.
Tasks
- update the code in the
juriscraper/lib/html_utils.py to escape these characters, probably using some regex so we don't lose too much speed.
- capture the raw response of the parsed docket, and include it in the test suite.
Questions
- has anyone seen this type of error coming from a pacer scrape ? You would've seen a
All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters traceback bubble up the stack.
- any opposition to having someone (possibly me) work on a patch for the html_utils to ensure that this type of data is protected against ?
Summary
When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.
Tasks
juriscraper/lib/html_utils.pyto escape these characters, probably using some regex so we don't lose too much speed.Questions
All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characterstraceback bubble up the stack.