Skip to content

Invalid XML character break docket parsers #348

@cgdeboer-toptal

Description

@cgdeboer-toptal

Summary

When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.

This is not a hypothetical, I was scraping a docket at the Ohio Northern Bankruptcy Court (ohnb), and the docketreport.parse() failed because of some invalid XML characters coming back from the request.

Tasks

  • update the code in the juriscraper/lib/html_utils.py to escape these characters, probably using some regex so we don't lose too much speed.
  • capture the raw response of the parsed docket, and include it in the test suite.

Questions

  • has anyone seen this type of error coming from a pacer scrape ? You would've seen a All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters traceback bubble up the stack.
  • any opposition to having someone (possibly me) work on a patch for the html_utils to ensure that this type of data is protected against ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions