This tutorial walks Python students through building a small HTML sanitizer like the one in the HTML-Sanitizer project.
You will build a command-line Python app that can:
- remove dangerous HTML like
<script>and<style> - remove inline event handlers like
onclick - validate
hrefandsrcvalues - offer two output modes:
plain: return text onlysafe: keep a small allowlist of safe tags
- generate a structured report of what was removed
- include automated tests with
pytest
This project is excellent for learning about Python strings, regular expressions, command-line tools, and test-driven development.
It is not a production-grade HTML sanitizer. Browsers parse HTML with real parsers, not regex. For hostile input in real web apps, use a parser-backed sanitizer and add other defenses like output encoding and CSP.
Create a new folder and open it in your editor.
mkdir HTML-Sanitizer
cd HTML-Sanitizer
mkdir testsCreate these files:
HTML-Sanitizer/
├── html_sanitizer.py
├── errors.py
├── requirements.txt
└── tests/
└── test_html_sanitizer.py
python -m venv .venvActivate it:
source .venv/bin/activate.venv\Scripts\Activate.ps1Install test dependencies:
pip install -r requirements.txtCreate errors.py:
"""Shared DataGuard exceptions."""
class DataGuardError(Exception):
"""Base exception for friendly CLI failures."""
class InputError(DataGuardError):
"""Raised when input cannot be read or decoded."""
class ParseError(DataGuardError):
"""Raised when data cannot be parsed."""
class ValidationError(DataGuardError):
"""Raised when validation fails in a non-fatal way."""It keeps your program cleaner. Instead of raising generic exceptions everywhere, you define errors that match your app.
At the top of html_sanitizer.py, import the tools you need:
from __future__ import annotations
import argparse
import html
import json
import re
import sys
from pathlib import Path
from errors import InputErrorargparse: builds the CLIhtml: decodes and escapes HTML entitiesjson: prints structured reportsre: powers regex matchingsys: writes to stdout and stderrPath: reads and writes files cleanly
Add the core patterns:
COMMENT_PATTERN = re.compile(
r"(?:<!--\[if[\s\S]*?<!\[endif\]-->|<!--[\s\S]*?-->)",
re.IGNORECASE | re.DOTALL,
)
SCRIPT_PATTERN = re.compile(r"<script\b[^>]*>.*?</script\s*>", re.IGNORECASE | re.DOTALL)
STYLE_PATTERN = re.compile(r"<style\b[^>]*>.*?</style\s*>", re.IGNORECASE | re.DOTALL)
BLOCKED_TAGS_PATTERN = re.compile(
r"<(?:iframe|object|embed|applet|form|base)\b[^>]*>.*?</(?:iframe|object|embed|applet|form|base)\s*>|<(?:iframe|object|embed|applet|form|base)\b[^>]*/?>",
re.IGNORECASE | re.DOTALL,
)
META_REFRESH_PATTERN = re.compile(
r"<meta\b[^>]*http-equiv\s*=\s*['\"]?refresh['\"]?[^>]*>",
re.IGNORECASE,
)
EVENT_HANDLER_PATTERN = re.compile(
r"\s+on[a-z0-9_-]+\s*=\s*(?:\"[^\"]*\"|'[^']*'|`[^`]*`|[^\s>]+)",
re.IGNORECASE,
)
STYLE_DANGER_PATTERN = re.compile(
r"expression\s*\(|url\s*\(\s*javascript:|url\s*\(\s*['\"]?data:|-moz-binding|behavior\s*:",
re.IGNORECASE,
)
TAG_PATTERN = re.compile(
r"<(?P<closing>/)?(?P<tag>[A-Za-z0-9]+)(?P<attrs>[^>]*)>",
re.IGNORECASE | re.DOTALL,
)
ATTRIBUTE_PATTERN = re.compile(
r"([^\s=<>'\"`/]+)(?:\s*=\s*(?:\"([^\"]*)\"|'([^']*)'|`([^`]*)`|([^\s>]+)))?",
re.DOTALL,
)Each regex is solving one narrow problem. This is easier to debug than trying to build one giant pattern.
Add a small set of allowed tags and attributes:
DEFAULT_SAFE_TAGS = {
"p": [],
"b": [],
"i": [],
"u": [],
"strong": [],
"em": [],
"br": [],
"hr": [],
"ul": [],
"ol": [],
"li": [],
"h1": [],
"h2": [],
"h3": [],
"h4": [],
"h5": [],
"h6": [],
"blockquote": [],
"pre": [],
"code": [],
"a": ["href"],
"img": ["src", "alt"],
}Void elements in this allowlist are br, hr, and img. The real module does not use a separate constant for them; when rebuilding tags, they use the same HTML5 start-tag shape as everything else (<br>, <img …>), not XML-style />.
A blocklist says “remove known bad things.” An allowlist says “keep only these approved things.”
For security work, allowlists are usually safer.
You want your app to explain what it removed.
def line_number_for_position(text: str, position: int) -> int:
return text.count("\n", 0, max(position, 0)) + 1
def add_finding(findings: list[dict], text: str, position: int, category: str, severity: str, message: str) -> None:
findings.append(
{
"severity": severity,
"category": category,
"line": line_number_for_position(text, position),
"message": message,
}
)Students often stop at “it works.” A better tool also explains what happened.
Instead of repeating the same substitution logic over and over, create one reusable function:
def replace_pattern(
text: str,
pattern: re.Pattern,
replacement: str,
findings: list[dict],
category: str,
severity: str,
message_template: str,
) -> str:
def replacement_function(match: re.Match) -> str:
add_finding(
findings,
text,
match.start(),
category,
severity,
message_template.format(content=match.group(0)[:80]),
)
return replacement
return pattern.sub(replacement_function, text)It reduces duplication and makes later maintenance easier.
A sanitizer should not keep dangerous links.
def validate_url(raw_value: str, attribute_name: str) -> tuple[str, bool]:
decoded = html.unescape(raw_value).replace("\x00", "").strip()
compact = re.sub(r"\s+", "", decoded).lower()
allowed = ("http:", "https:", "mailto:")
if compact.startswith(("javascript:", "data:", "vbscript:")):
return "#removed", False
if compact.startswith("//"):
return "#removed", False
if ":" in compact and not compact.startswith(allowed):
return "#removed", False
if attribute_name == "src" and compact.startswith("mailto:"):
return "#removed", False
return decoded, Truejavascript:alert(1)data:text/html,...vbscript:- protocol-relative URLs like
//evil.example
Now write a function that filters attributes on safe tags.
def sanitize_allowed_attributes(
tag_name: str,
raw_attrs: str,
findings: list[dict],
source_text: str,
start_position: int,
safe_tags: dict,
) -> str:
kept = []
allowed_attributes = safe_tags.get(tag_name, [])
for match in ATTRIBUTE_PATTERN.finditer(raw_attrs):
attribute_name = match.group(1).lower()
raw_value = next((group for group in match.groups()[1:] if group is not None), "")
if attribute_name.startswith("on"):
add_finding(findings, source_text, start_position, "event_handler", "high", f"Removed event handler {attribute_name}.")
continue
if attribute_name == "style":
continue
if attribute_name not in allowed_attributes:
if attribute_name not in {"", "/"}:
add_finding(findings, source_text, start_position, "attribute_strip", "low", f"Removed attribute {attribute_name} from <{tag_name}>.")
continue
if attribute_name in {"href", "src"}:
safe_value, is_safe = validate_url(raw_value, attribute_name)
if not is_safe:
add_finding(findings, source_text, start_position, "dangerous_url", "critical", f"Replaced dangerous {attribute_name} value on <{tag_name}>.")
kept.append(f'{attribute_name}="{html.escape(safe_value, quote=True)}"')
continue
kept.append(f'{attribute_name}="{html.escape(raw_value, quote=True)}"')
return (" " + " ".join(kept)) if kept else ""A safe tag is not enough. Safe attributes matter too.
<a> is often okay. <a href="javascript:..."> is not.
In safe mode, you do not want to keep everything. You want to rebuild only what passed the rules.
def rebuild_safe_html(text: str, safe_tags: dict, findings: list[dict]) -> str:
def replacement_function(match: re.Match) -> str:
tag_name = match.group("tag").lower()
closing = bool(match.group("closing"))
raw_attrs = match.group("attrs") or ""
if tag_name not in safe_tags:
add_finding(findings, text, match.start(), "tag_strip", "low", f"Removed disallowed tag <{tag_name}>.")
return ""
if closing:
return f"</{tag_name}>"
safe_attrs = sanitize_allowed_attributes(tag_name, raw_attrs, findings, text, match.start(), safe_tags)
# Void elements (br, hr, img) use the same HTML5 start-tag form; no separate branch needed.
return f"<{tag_name}{safe_attrs}>"
return TAG_PATTERN.sub(replacement_function, text)An older version sometimes split “self-closing” vs “normal” branches, but both returned the same string. Keeping one return avoids dead code; browsers treat <br>, <hr>, and <img …> correctly without />.
Rebuilding is safer. You only keep content that passed the checks.
Plain mode is simpler. Remove all tags and return only text.
def strip_all_tags(text: str) -> str:
without_tags = re.sub(r"<[^>]+>", "", text)
return html.unescape(without_tags)- logs
- summaries
- exporting text only
- pipelines that should not keep HTML formatting
This gives users a quick sense of how risky the input looked.
def danger_score(findings: list[dict]) -> int:
weights = {"critical": 25, "high": 15, "medium": 8, "low": 2, "info": 1}
return min(sum(weights.get(item.get("severity", "info"), 1) for item in findings), 100)It turns many individual findings into one easy summary number.
Now connect all the pieces.
def sanitize_html(input_text: str, mode: str = "plain", allowed_tags: list[str] | None = None) -> tuple[str, list[dict], dict]:
findings = []
safe_tags = DEFAULT_SAFE_TAGS.copy()
if allowed_tags is not None:
safe_tags = {tag: DEFAULT_SAFE_TAGS.get(tag, []) for tag in allowed_tags if tag}
text = input_text
decoded = html.unescape(text)
if decoded != text:
findings.append({"severity": "low", "category": "entity_decode", "line": 1, "message": "Decoded HTML entities before scanning."})
text = decoded
if "\x00" in text:
findings.append({"severity": "medium", "category": "null_byte", "line": 1, "message": "Removed null bytes used for obfuscation."})
text = text.replace("\x00", "")
previous = None
loop_count = 0
while text != previous and loop_count < 5:
previous = text
loop_count += 1
text = replace_pattern(text, COMMENT_PATTERN, "", findings, "comment", "low", "Removed HTML comment.")
text = replace_pattern(text, SCRIPT_PATTERN, "", findings, "script_tag", "critical", "Removed script tag and contents.")
text = replace_pattern(text, STYLE_PATTERN, "", findings, "style_block", "medium", "Removed style block.")
text = replace_pattern(text, BLOCKED_TAGS_PATTERN, "", findings, "blocked_tag", "critical", "Removed blocked tag container.")
text = replace_pattern(text, META_REFRESH_PATTERN, "", findings, "meta_refresh", "critical", "Removed meta refresh tag.")
text = replace_pattern(text, EVENT_HANDLER_PATTERN, "", findings, "event_handler", "high", "Removed inline event handler: {content}")
if mode == "safe":
output = rebuild_safe_html(text, safe_tags, findings)
else:
output = strip_all_tags(text)
stats = {
"mode": mode,
"before_characters": len(input_text),
"after_characters": len(output),
"characters_removed": max(len(input_text) - len(output), 0),
"danger_score": danger_score(findings),
"passes": loop_count,
}
return output, findings, statsThis function is the pipeline:
- normalize input
- remove dangerous patterns
- choose output mode
- return both cleaned output and structured metadata
def run(input_text: str, config: dict | None = None) -> dict:
config = config or {}
mode = config.get("mode", "plain")
allowed_tags = config.get("allowed_tags")
output, findings, stats = sanitize_html(input_text, mode=mode, allowed_tags=allowed_tags)
summary = (
f"Sanitized HTML in {mode} mode. Removed {stats['characters_removed']} characters "
f"with danger score {stats['danger_score']}."
)
return {
"module_name": "html",
"title": "DataGuard HTML Sanitizer Report",
"output": output,
"findings": findings,
"warnings": [] if not findings else [f"Removed or modified {len(findings)} HTML threat indicators."],
"errors": [],
"stats": stats,
"metadata": {"source": config.get("source_name", "<input>"), "mode": mode},
"summary": summary,
}It provides a stable API for other programs to call.
Add these helper functions and a main() entry point.
def _parse_allowed_tags(raw: str | None) -> list[str] | None:
if raw is None or not raw.strip():
return None
return [t.strip().lower() for t in raw.split(",") if t.strip()]Then create the CLI:
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="Strip dangerous HTML; optional allowlisted safe HTML mode.")
src = parser.add_mutually_exclusive_group(required=True)
src.add_argument("--input", "-i", help="HTML string to sanitize")
src.add_argument("--file", "-f", type=Path, help="Path to a .html or .txt file (UTF-8)")
parser.add_argument("--mode", choices=("plain", "safe"), default="plain")
parser.add_argument("--allow", help="Comma-separated tag names for safe mode")
parser.add_argument("--output", "-o", type=Path, help="Write cleaned output to this file")
parser.add_argument("--report", action="store_true", help="Print JSON findings and stats to stderr")
parser.add_argument("--show-diff", action="store_true", help="Print before/after character counts to stderr")
args = parser.parse_args(argv)
if args.input is not None:
input_text = args.input
else:
try:
input_text = args.file.read_text(encoding="utf-8")
except OSError as exc:
raise InputError(f"Cannot read file: {args.file}") from exc
allowed_tags = _parse_allowed_tags(args.allow)
result = run(
input_text,
{
"mode": args.mode,
"allowed_tags": allowed_tags,
"source_name": str(args.file) if args.file else "<--input>",
},
)
out = result["output"]
if args.output is not None:
try:
args.output.write_text(out, encoding="utf-8")
except OSError as exc:
raise InputError(f"Cannot write file: {args.output}") from exc
else:
sys.stdout.write(out)
if out and not out.endswith("\n"):
sys.stdout.write("\n")
if args.show_diff:
st = result["stats"]
print(
f"Characters: {st['before_characters']} -> {st['after_characters']} "
f"(removed {st['characters_removed']}, danger_score={st['danger_score']}, passes={st['passes']})",
file=sys.stderr,
)
if args.report:
payload = {"findings": result["findings"], "stats": result["stats"], "summary": result["summary"]}
print(json.dumps(payload, indent=2), file=sys.stderr)
return 0Finish the file with:
if __name__ == "__main__":
try:
raise SystemExit(main())
except InputError as exc:
print(str(exc), file=sys.stderr)
raise SystemExit(1) from excCreate requirements.txt:
pytest>=8.0.0
Create tests/test_html_sanitizer.py.
Start with these core tests:
from __future__ import annotations
import json
import subprocess
import sys
from pathlib import Path
import pytest
from html_sanitizer import main, run, sanitize_html, validate_url
ROOT = Path(__file__).resolve().parent.parent
def test_comment_removed_in_safe_mode() -> None:
out, findings, _ = sanitize_html("<p>a</p><!-- evil -->", mode="safe")
assert "<!--" not in out
assert "evil" not in out
assert "<p>a</p>" == out
assert any(f.get("category") == "comment" for f in findings)
def test_script_stripped_and_finding() -> None:
out, findings, _ = sanitize_html('<p>x</p><script>alert(1)</script>', mode="safe")
assert "script" not in out.lower()
assert any(f.get("category") == "script_tag" for f in findings)
def test_javascript_href_replaced() -> None:
out, findings, _ = sanitize_html('<a href="javascript:alert(1)">x</a>', mode="safe")
assert 'href="#removed"' in out or "#removed" in out
assert any(f.get("category") == "dangerous_url" for f in findings)
def test_plain_mode_strips_tags() -> None:
out, _, _ = sanitize_html("<p>hello <b>w</b></p>", mode="plain")
assert "<" not in out
assert "hello" in out and "w" in outThen add CLI tests:
def test_main_cli_plain_input(capsys: pytest.CaptureFixture[str]) -> None:
code = main(["--input", "<em>z</em>", "--mode", "plain"])
assert code == 0
captured = capsys.readouterr()
assert "z" in captured.out
assert "<em>" not in captured.out
def test_cli_missing_file_exit_code() -> None:
proc = subprocess.run(
[sys.executable, str(ROOT / "html_sanitizer.py"), "--file", str(ROOT / "missing.txt")],
cwd=str(ROOT),
capture_output=True,
text=True,
check=False,
)
assert proc.returncode == 1
assert "Cannot read" in proc.stderrAdd more tests for:
- protocol-relative URLs
data:URLs- event handlers
- custom allowlists
- file round-trips
- JSON reports
- null bytes
- encoded payloads like
<script>
python -m pytestA passing test suite means your core behaviors are working as expected.
python html_sanitizer.py --input "<p>Hello</p><script>alert(1)</script>" --mode plainExpected output:
Hello
python html_sanitizer.py --input "<p>Hello</p><script>alert(1)</script>" --mode safeExpected output:
<p>Hello</p>python html_sanitizer.py --input "<a href='javascript:alert(1)'>x</a>" --mode safe --reportpython html_sanitizer.py --file sample.html --mode safe --output clean.htmlThis app is a good beginner-to-intermediate Python project because it teaches:
- regex pattern design
- string normalization
- allowlist thinking
- CLI design with
argparse - returning structured data
- test writing with
pytest - error handling with custom exceptions
It is also small enough that one student can understand the whole codebase.
Students should understand what this app does well and what it does not do well.
- learning core Python
- quick cleanup in a pipeline
- removing obvious dangerous markup
- creating security reports
- production-grade browser-safe sanitization
- handling every malformed HTML edge case
- matching browser parsing behavior
- protecting a real web app by itself
After students finish the base version, they can improve it.
- Add more tests.
- Handle Unicode decode errors more gracefully.
- Add a
--jsonflag to print the whole report to stdout. - Add docstrings to every function.
- Add packaging with
pyproject.toml. - Add GitHub Actions for tests.
- Split regex patterns into a separate config file.
- Track counts per category in the report.
- Replace regex sanitization with a parser-backed approach.
- Add configurable attribute allowlists.
- Add fuzz tests for malformed HTML.
- Add benchmarks for large inputs.
Ask yourself:
- Why is an allowlist safer than a blocklist?
- Why is
javascript:dangerous inhref? - Why should reports go to
stderrand cleaned output go tostdout? - Why is regex useful for learning but risky for full HTML parsing?
- What new tests would make this app more trustworthy?
This project is a strong learning app because it is small, readable, testable, and realistic enough to teach useful software engineering habits.
If your goal is to learn Python app design, this is a very good project to build. If your goal is to secure a real browser-facing production app, treat this as a teaching prototype and move to a parser-backed sanitizer.