Skip to content

Commit 6301833

Browse files
authored
Document HTML sanitation policy
Recommend JustHTML as an HTML sanitizer with bleach and nh3 as alternatives. Reorganize cli.md for clarity. Resolves #1479.
1 parent 7f29f1a commit 6301833

File tree

7 files changed

+266
-76
lines changed

7 files changed

+266
-76
lines changed

.spell-dict

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ implementers
5454
InlineProcessor
5555
Jiryu
5656
JSON
57+
JustHTML
5758
keepachangelog
5859
Kjell
5960
Krech
@@ -111,6 +112,7 @@ rST
111112
ryneeverett
112113
sanitizer
113114
sanitizers
115+
sanitization
114116
Sauder
115117
schemeless
116118
setuptools
@@ -135,6 +137,7 @@ svn
135137
Swartz
136138
Szakmeister
137139
Takhteyev
140+
templating
138141
Tiago
139142
toc
140143
tokenized
@@ -168,6 +171,7 @@ workflow
168171
Xanthakis
169172
XHTML
170173
xhtml
174+
XSS
171175
YAML
172176
Yunusov
173177
inline

docs/cli.md

Lines changed: 110 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ Generally, you will want to have the Markdown library fully installed on your
1212
system to run the command line script. See the
1313
[Installation instructions](install.md) for details.
1414

15+
## Basic Usage
16+
1517
Python-Markdown's command line script takes advantage of Python's `-m` flag.
1618
Therefore, assuming the python executable is on your system path, use the
1719
following format:
@@ -28,92 +30,62 @@ At its most basic usage, one would simply pass in a file name as the only argume
2830
python -m markdown input_file.txt
2931
```
3032

31-
Piping input and output (on `STDIN` and `STDOUT`) is fully supported as well.
32-
For example:
33-
34-
```bash
35-
echo "Some **Markdown** text." | python -m markdown > output.html
36-
```
37-
38-
Use the `--help` option for a list all available options and arguments:
33+
Use the `--help` option for a list of all available options and arguments:
3934

4035
```bash
4136
python -m markdown --help
4237
```
4338

44-
If you don't want to call the python executable directly (using the `-m` flag),
45-
follow the instructions below to use a wrapper script:
46-
47-
Setup
48-
-----
49-
50-
Upon installation, the `markdown_py` script will have been copied to
51-
your Python "Scripts" directory. Different systems require different methods to
52-
ensure that any files in the Python "Scripts" directory are on your system
53-
path.
54-
55-
* **Windows**:
56-
57-
Assuming a default install of Python on Windows, your "Scripts" directory
58-
is most likely something like `C:\\Python37\Scripts`. Verify the location
59-
of your "Scripts" directory and add it to you system path.
39+
!!! warning
6040

61-
Calling `markdown_py` from the command line will call the wrapper batch
62-
file `markdown_py.bat` in the `"Scripts"` directory created during install.
63-
64-
* __*nix__ (Linux, OSX, BSD, Unix, etc.):
41+
The Python-Markdown library does ***not*** sanitize its HTML output. If
42+
you are processing Markdown input from an untrusted source, it is your
43+
responsibility to ensure that it is properly sanitized. For more
44+
information see [Sanitizing HTML Output](sanitization.md).
6545

66-
As each \*nix distribution is different and we can't possibly document all
67-
of them here, we'll provide a few helpful pointers:
68-
69-
* Some systems will automatically install the script on your path. Try it
70-
and see if it works. Just run `markdown_py` from the command line.
71-
72-
* Other systems may maintain a separate "Scripts" ("bin") directory which
73-
you need to add to your path. Find it (check with your distribution) and
74-
either add it to your path or make a symbolic link to it from your path.
46+
## Piping Input and Output
7547

76-
* If you are sure `markdown_py` is on your path, but it still is not being
77-
found, check the permissions of the file and make sure it is executable.
78-
79-
As an alternative, you could just `cd` into the directory which contains
80-
the source distribution, and run it from there. However, remember that your
81-
markdown text files will not likely be in that directory, so it is much
82-
more convenient to have `markdown_py` on your path.
83-
84-
!!!Note
85-
Python-Markdown uses `"markdown_py"` as a script name because the Perl
86-
implementation has already taken the more obvious name "markdown".
87-
Additionally, the default Python configuration on some systems would cause a
88-
script named `"markdown.py"` to fail by importing itself rather than the
89-
markdown library. Therefore, the script has been named `"markdown_py"` as a
90-
compromise. If you prefer a different name for the script on your system, it
91-
is suggested that you create a symbolic link to `markdown_py` with your
92-
preferred name.
93-
94-
Usage
95-
-----
96-
97-
To use `markdown_py` from the command line, run it as
48+
Piping input and output (on `STDIN` and `STDOUT`) is fully supported.
49+
For example:
9850

9951
```bash
100-
markdown_py input_file.txt
52+
echo "Some **Markdown** text." | python -m markdown > output.html
10153
```
10254

103-
or
55+
The above command would generate a file named `output.html` with the following content:
56+
```html
57+
<p>Some <strong>Markdown</strong> Text.</p>
58+
```
59+
60+
As Python-Markdown only ever outputs HTML fragments (no `<html>`, `<head>`,
61+
and `<body>` tags), it is generally expected that the command line interface
62+
will always be used to pipe output to a templating engine. In the event that
63+
no additional content is needed and the output only needs to be wrapped in
64+
otherwise empty `<html>`, `<head>`, and `<body>` tags,
65+
[JustHTML](https://emilstenstrom.github.io/justhtml/) can do that with with
66+
a single command:
10467

10568
```bash
106-
markdown_py input_file.txt > output_file.html
69+
echo "Some **Markdown** text." | python -m markdown | justhtml - --fragment > output.html
10770
```
10871

109-
For a complete list of options, run
72+
The above command would generate a file named `output.html` with the following content:
11073

111-
```bash
112-
markdown_py --help
74+
```html
75+
<html>
76+
<head></head>
77+
<body>
78+
<p>Some <strong>Markdown</strong> Text.</p>
79+
</body>
80+
</html>
11381
```
11482

115-
Using Extensions
116-
----------------
83+
If you don't need or want JustHTML's HTML sanitation, you can disable it with the
84+
`--unsafe` flag, although that is not recommended. See JustHTML's
85+
[Command Line Interface](https://emilstenstrom.github.io/justhtml/cli.html)
86+
documentation for details.
87+
88+
## Using Extensions
11789

11890
To load a Python-Markdown extension from the command line use the `-x`
11991
(or `--extension`) option. The extension module must be on your `PYTHONPATH`
@@ -187,3 +159,74 @@ dependencies. The format of your configuration file is automatically detected.
187159
[JSON]: https://json.org/
188160
[PyYAML]: https://pyyaml.org/
189161
[2.5 release notes]: change_log/release-2.5.md
162+
163+
## Using the `markdown_py` Command
164+
165+
If you don't want to call the python executable directly (using the `-m` flag),
166+
follow the instructions below to use a wrapper script:
167+
168+
### Setup `markdown_py`
169+
170+
Upon installation, the `markdown_py` script will have been copied to
171+
your Python "Scripts" directory. Different systems require different methods to
172+
ensure that any files in the Python "Scripts" directory are on your system
173+
path.
174+
175+
* **Windows**:
176+
177+
Assuming a default install of Python on Windows, your "Scripts" directory
178+
is most likely something like `C:\\Python37\Scripts`. Verify the location
179+
of your "Scripts" directory and add it to you system path.
180+
181+
Calling `markdown_py` from the command line will call the wrapper batch
182+
file `markdown_py.bat` in the `"Scripts"` directory created during install.
183+
184+
* __*nix__ (Linux, OSX, BSD, Unix, etc.):
185+
186+
As each \*nix distribution is different and we can't possibly document all
187+
of them here, we'll provide a few helpful pointers:
188+
189+
* Some systems will automatically install the script on your path. Try it
190+
and see if it works. Just run `markdown_py` from the command line.
191+
192+
* Other systems may maintain a separate "Scripts" ("bin") directory which
193+
you need to add to your path. Find it (check with your distribution) and
194+
either add it to your path or make a symbolic link to it from your path.
195+
196+
* If you are sure `markdown_py` is on your path, but it still is not being
197+
found, check the permissions of the file and make sure it is executable.
198+
199+
As an alternative, you could just `cd` into the directory which contains
200+
the source distribution, and run it from there. However, remember that your
201+
markdown text files will not likely be in that directory, so it is much
202+
more convenient to have `markdown_py` on your path.
203+
204+
!!!Note
205+
Python-Markdown uses `"markdown_py"` as a script name because the Perl
206+
implementation has already taken the more obvious name "markdown".
207+
Additionally, the default Python configuration on some systems would cause a
208+
script named `"markdown.py"` to fail by importing itself rather than the
209+
markdown library. Therefore, the script has been named `"markdown_py"` as a
210+
compromise. If you prefer a different name for the script on your system, it
211+
is suggested that you create a symbolic link to `markdown_py` with your
212+
preferred name.
213+
214+
### Using `markdown_py`
215+
216+
To use `markdown_py` from the command line, run it as
217+
218+
```bash
219+
markdown_py input_file.txt
220+
```
221+
222+
or
223+
224+
```bash
225+
markdown_py input_file.txt > output_file.html
226+
```
227+
228+
For a complete list of options, run
229+
230+
```bash
231+
markdown_py --help
232+
```

docs/reference.md

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,16 @@ instance of the `markdown.Markdown` class and pass multiple documents through
2525
it. If you do use a single instance though, make sure to call the `reset`
2626
method appropriately ([see below](#convert)).
2727

28-
### markdown.markdown(text [, **kwargs]) {: #markdown data-toc-label='markdown.markdown' }
28+
### `markdown.markdown(text [, **kwargs])` {: #markdown data-toc-label='markdown.markdown' }
29+
30+
!!! warning
31+
32+
The Python-Markdown library does ***not*** sanitize its HTML output. If
33+
you are processing Markdown input from an untrusted source, it is your
34+
responsibility to ensure that it is properly sanitized. For more
35+
information see [Sanitizing HTML Output].
36+
37+
[Sanitizing HTML Output]: sanitization.md
2938

3039
The following options are available on the `markdown.markdown` function:
3140

@@ -179,6 +188,15 @@ __tab_length__{: #tab_length }:
179188

180189
### `markdown.markdownFromFile (**kwargs)` {: #markdownFromFile data-toc-label='markdown.markdownFromFile' }
181190

191+
!!! warning
192+
193+
The Python-Markdown library does ***not*** sanitize its HTML output. As
194+
`markdown.markdownFromFile` writes directly to the file system, there is
195+
no easy way to sanitize the output from Python code. Therefore, it is
196+
recommended that the `markdown.markdownFromFile` function not be used on
197+
input from an untrusted source. For more information see [Sanitizing HTML
198+
Output].
199+
182200
With a few exceptions, `markdown.markdownFromFile` accepts the same options as
183201
`markdown.markdown`. It does **not** accept a `text` (or Unicode) string.
184202
Instead, it accepts the following required options:
@@ -216,7 +234,7 @@ __encoding__{: #encoding }
216234
meet your specific needs, it is suggested that you write your own code
217235
to handle your encoding/decoding needs.
218236

219-
### markdown.Markdown([**kwargs]) {: #Markdown data-toc-label='markdown.Markdown' }
237+
### `markdown.Markdown([**kwargs])` {: #Markdown data-toc-label='markdown.Markdown' }
220238

221239
The same options are available when initializing the `markdown.Markdown` class
222240
as on the [`markdown.markdown`](#markdown) function, except that the class does
@@ -229,7 +247,14 @@ string must be passed to one of two instance methods.
229247
the thread they were created in. A single instance should not be accessed
230248
from multiple threads.
231249

232-
#### Markdown.convert(source) {: #convert data-toc-label='Markdown.convert' }
250+
#### `Markdown.convert(source)` {: #convert data-toc-label='Markdown.convert' }
251+
252+
!!! warning
253+
254+
The Python-Markdown library does ***not*** sanitize its HTML output. If
255+
you are processing Markdown input from an untrusted source, it is your
256+
responsibility to ensure that it is properly sanitized. For more
257+
information see [Sanitizing HTML Output].
233258

234259
The `source` text must meet the same requirements as the [`text`](#text)
235260
argument of the [`markdown.markdown`](#markdown) function.
@@ -258,7 +283,16 @@ To make this easier, you can also chain calls to `reset` together:
258283
html3 = md.reset().convert(text3)
259284
```
260285

261-
#### Markdown.convertFile(**kwargs) {: #convertFile data-toc-label='Markdown.convertFile' }
286+
#### `Markdown.convertFile(**kwargs)` {: #convertFile data-toc-label='Markdown.convertFile' }
287+
288+
!!! warning
289+
290+
The Python-Markdown library does ***not*** sanitize its HTML output. As
291+
`Markdown.convertFile` writes directly to the file system, there is no
292+
easy way to sanitize the output from Python code. Therefore, it is
293+
recommended that the `Markdown.convertFile` method not be used on input
294+
from an untrusted source. For more information see [Sanitizing HTML
295+
Output].
262296

263297
The arguments of this method are identical to the arguments of the same
264298
name on the `markdown.markdownFromFile` function ([`input`](#input),

docs/sanitization.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
title: Sanitization and Security
2+
3+
# Sanitizing HTML Output
4+
5+
The Python-Markdown library does ***not*** sanitize its HTML output. If you
6+
are processing Markdown input from an untrusted source, it is your
7+
responsibility to ensure that it is properly sanitized. See _[Markdown and
8+
XSS]_ for an overview of some of the dangers and _[Improper markup sanitization
9+
in popular software]_ for notes on best practices to ensure HTML is properly
10+
sanitized. With those concerns in mind, some recommendations are provided
11+
below to ensure that any input from an untrusted source is properly
12+
sanitized.
13+
14+
That said, if you fully trust the source of your input, you may choose to do
15+
nothing. Conversely, you may find solutions other than those suggested here.
16+
However, you do so at your own risk.
17+
18+
## Using `JustHTML`
19+
20+
[`JustHTML`][JustHTML] is recommended as a sanitizer on the output of `markdown.markdown`
21+
or `Markdown.convert`. When you pass HTML output through `JustHTML`, it is
22+
sanitized by default according to a strict [allow list policy]. The policy
23+
can be [customized] if necessary.
24+
25+
``` python
26+
import markdown
27+
from justhtml import JustHTML
28+
29+
html = markdown.markdown(text)
30+
safe_html = JustHTML(html, fragment=True).to_html()
31+
```
32+
33+
## Using `nh3` or `bleach`
34+
35+
If you cannot use `JustHTML` for some reason, some alternatives include [`nh3`][nh3] or
36+
[`bleach`][bleach][^1]. However, be aware that these libraries will not be sufficient
37+
in themselves and will require customization. Some useful lists of allowed
38+
tags and attributes can be found in the [`bleach-allowlist`]
39+
[bleach-allowlist] library, which should work with both `nh3` and `bleach` as `nh3`
40+
mirrors `bleach`'s API.
41+
42+
``` python
43+
import markdown
44+
import bleach
45+
from bleach_allowlist import markdown_tags, markdown_attrs
46+
47+
html = markdown.markdown(text)
48+
safe_html = bleach.clean(html, markdown_tags, markdown_attrs)
49+
```
50+
51+
[^1]: The [`bleach`][bleach] project has been [deprecated](https://github.com/mozilla/bleach/issues/698).
52+
However, it may be the only option for some users as `nh3` is a set of Python bindings to a Rust library.
53+
54+
## Sanitizing on the Command Line
55+
56+
Both Python-Markdown and `JustHTML` provide command line interfaces which read
57+
from `STDIN` and write to `STDOUT`. Therefore, they can be used together to
58+
ensure that the output from untrusted input is properly sanitized.
59+
60+
```sh
61+
echo "Some **Markdown** text." | python -m markdown | justhtml - --fragment > safe_output.html
62+
```
63+
64+
For more information on `JustHTML`'s Command Line Interface, see the
65+
[documentation][JustHTML_CLI]. Use the `--help` option for a list of all available
66+
options and arguments to the `markdown` command.
67+
68+
[Markdown and XSS]: https://michelf.ca/blog/2010/markdown-and-xss/
69+
[Improper markup sanitization in popular software]: https://github.com/ChALkeR/notes/blob/master/Improper-markup-sanitization.md
70+
[JustHTML]: https://emilstenstrom.github.io/justhtml/
71+
[allow list policy]: https://emilstenstrom.github.io/justhtml/html-cleaning.html#default-sanitization-policy
72+
[customized]: https://emilstenstrom.github.io/justhtml/html-cleaning.html#use-a-custom-sanitization-policy
73+
[nh3]: https://nh3.readthedocs.io/en/latest/
74+
[bleach]: http://bleach.readthedocs.org/en/latest/
75+
[bleach-allowlist]: https://github.com/yourcelf/bleach-allowlist
76+
[JustHTML_CLI]: https://emilstenstrom.github.io/justhtml/cli.html

0 commit comments

Comments
 (0)