- Fixed a bug where Unicode escapes in CSS were not properly decoded before security checks. This prevents attackers from bypassing filters using escape sequences. (CVE-2026-28348)
- Fixed a security issue where
<base>tags could be used for URL hijacking attacks. The<base>tag is now automatically removed whenever the<head>tag is removed (viapage_structure=Trueor manual configuration), as<base>must be inside<head>according to HTML specifications. (CVE-2026-28350)
- Tests updated to work correctly with new lxml and libxml2 releases.
- Python 3.6 and 3.7 are no longer tested.
- Improved documentation about CSS removal behavior.
- lxml_html_clean now correctly handles HTML input as bytes as it did before the 0.2.0 release.
- Removed superfluous debug prints.
- The
Cleaner()now scans for hidden JavaScript code embedded within CSS comments. In certain contexts, such as within<svg>or<math>tags,<style>tags may lose their intended function, allowing comments like/* foo */to potentially be executed by the browser. If a suspicious content is detected, only the comment is removed. (CVE-2024-52595)
- Do not parse URL addresses when it is not necessary.
- Parsing of URL addresses has been enhanced and Cleaner removes ambiguous URLs.
- sdist now includes all test files and changelog.
- Memory efficiency is now much better for HTML pages where cleaner removes a lot of elements. (#14)
- ASCII control characters (except HT, VT, CR and LF) are now removed from string inputs before they're parsed by lxml/libxml2.
- Regular expresion for image data URLs now supports multiple data URLs on a single line.
First official release of the split project.
This part contains releases of lxml project containing important changes related to HTML Cleaner functionalities.
- The HTML
Cleaner()interpreted an accidentally provided string parameter for thehost_whitelistas list of characters and silently failed to reject any hosts. Passing a non-collection is now rejected.
- A memory leak in
lxml.html.cleanwas resolved by switching to Cython 0.29.34+. - URL checking in the HTML cleaner was improved. Patch by Tim McCormack.
- A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images (CVE-2021-43818).
- A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs (CVE-2021-43818).
- A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung,
which allowed JavaScript to pass through. The cleaner now removes the HTML5
formactionattribute.
- A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.
- A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.
Cleaner()now validates that only known configuration options can be set.Cleaner.clean_html()discarded comments and PIs regardless of the corresponding configuration option, ifremove_unknown_tagswas set.
- Javascript URLs that used URL escaping were not removed by the HTML cleaner. Security problem found by Omar Eissa. (CVE-2018-19787)
- The modules
lxml.builder,lxml.html.diffandlxml.html.cleanare also compiled using Cython in order to speed them up.