Skip to content

Commit 28a5fe8

Browse files
author
Jason Thorpe
committed
initial commit
1 parent 1e388fb commit 28a5fe8

28 files changed

Lines changed: 2739 additions & 0 deletions

.pylintrc

Lines changed: 564 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,163 @@
1+
# Overview
2+
3+
DOCX files are complex, and their complexity makes scraping documents
4+
for their content difficult. The aim of this package is to simplify
5+
`.docx` files to just the components which carry meaning thereby easing the
6+
process of document identification and scraping by converting a `.docx`
7+
file into a predictable an *human readable* JSON file.
8+
9+
Simplifying a complex document down to it's *meaningful* parts of course
10+
requires taking a position on what does and does-not convey meaning in a
11+
document. Generally, this package takes the stance that the document
12+
structure (body, paragraphs, tables, etc.) are meaningful as is the text
13+
itself, whereas text styling (font, font-weight, etc.) is ignored almost
14+
entirely, with the exception of paragraph indentation and numbering which
15+
is often used to create lists, block quotes, etc. Furthermore, the
16+
opinions expressed by this package are explained in the Options section
17+
below and can be changed to suite your needs.
18+
19+
# Usage
20+
```python
21+
import docx
22+
from simplify_docx import simplify
23+
24+
# read in a document
25+
my_doc = docx.Document("/path/to/my/favorite/file.docx")
26+
27+
# coerce to JSON using the standard options
28+
my_doc_as_json = simplify(my_doc)
29+
30+
# or with non-standard options
31+
my_doc_as_json = simplify(my_doc,{"remove-leading-white-space":False})
32+
```
33+
34+
# Installation
35+
36+
This project relies on the `python-docx` package which can be installed via
37+
`pip install python-docx`. **However**, as of this writing, if you wish to
38+
scrape documents which contain (A) form fields such as drop down lists,
39+
checkboxes and text inputs or (B) nested documents (subdocs, altChunks,
40+
etc.), you'll need to clone [this fork](https://github.com/jdthorpe/python-docx) of the python-docx package.
41+
42+
# Options
43+
44+
### General
45+
46+
* **"friendly-names"**: (*Default = `True`*): Use user-friendly type names
47+
such as "table-cell", over standard element names like "CT_Tc"
48+
49+
### Ignoring Invisible things
50+
51+
* **"ignore-empty-paragraphs"**: (*Default = `True`*): Empty paragraphs are
52+
often used for styling purpose and rarely have significance in the
53+
meaning of the document.
54+
* **"ignore-empty-text"**: (*Default = `True`*): Empty text runs can make an
55+
otherwise empty paragraph appear to contain data.
56+
* **"remove-leading-white-space"**: (*Default = `True`*): Leading white-space
57+
at the start of a paragraph is ocassionaly used for styling purposes
58+
and rarely has significance in the interpretation of a document.
59+
* **"remove-trailing-white-space"**: (*Default = `True`*): Trailing white-space
60+
at the end of a paragraph rarely has significance in the interpretation
61+
of a document.
62+
* **"flatten-inner-spaces"**: (*Default = `False`*): Collapse multiple
63+
space characters between words to a single space.
64+
* **"ignore-joiners"**: (*Default = `False`*): Zero width joiner and non-joiner
65+
characters are special characters used to create ligatures in displayed
66+
text and don't typically convey meaning (at least in alphabet based
67+
languages).
68+
69+
### Special symbols
70+
71+
* **"dumb-quotes"**: (*Default = `True`*): Replace smart quotes with
72+
dumb quotes.
73+
* **"dumb-hyphens"**: (*Default = `True`*): Replace en-dash, em-dash,
74+
figure-dash, horizontal bar, and non-breaking hyphens with ordinary hyphens.
75+
* **"dumb-spaces"**: (*Default = `True`*): Replace zero width spaces, hair
76+
spaces, thin spaces, punctuation spaces, figure spaces, six per em
77+
spaces, four per em spaces, three per em spaces, em spaces, en spaces,
78+
em quad spaces, and en quad spaces with ordinary spaces.
79+
* **"special-characters-as-text"**: (*Default = `True`*): Coerce special
80+
characters into text equivalents according to the following table:
81+
82+
| Character | Text Equivalent |
83+
| --------- | --------------- |
84+
| CarriageReturn | `\n` |
85+
| Break | `\r` |
86+
| TabChar | `\t` |
87+
| PositionalTab | `\t` |
88+
| NoBreakHyphen | `-` |
89+
| SoftHyphen | `-` |
90+
91+
* **"symbol-as-text"**: (*Default = `True`*): Special symbols often cary
92+
meaning other than the underlying unicode character, especially when
93+
the font is a special font such as `Wingdings`. If `True` these are
94+
included as ordinary text and their font information is omitted.
95+
* **"empty-as-text"**: (*Default = `False`*): There are a variety of "Empty"
96+
tags such as the `<"w:yearLong">` tag which cause the current year to
97+
be inserted into the document text. If `True`, include these as text
98+
formatted as `"[yearLong]"`.
99+
* **"ignore-left-to-right-mark"**: (*Default = `False`*): Ignore the left-to-right
100+
mark, which is not writeable by pythons csv writer.
101+
* **"ignore-right-to-left-mark"**: (*Default = `False`*): Ignore the right-to-left
102+
mark which is not writeable by pythons csv writer.
103+
104+
### Paragraph style:
105+
106+
Paragraph style markup are one exception to the styling vs. content
107+
dichotomy. For example, block quotes are often indicated by indenting whole
108+
paragraphs, and Ordered lists, Unordered lists and nesting of lists is
109+
often used to divide sections of a document into logical components.
110+
111+
* **"include-paragraph-indent"**: (*Default = `True`*): Include the
112+
indentation markup on paragraph (`CT_P`) elements. Indentation is
113+
measured in twips
114+
* **"include-paragraph-numbering"**: (*Default = `True`*): Include the
115+
numbering styles, which are included in the `CT_P.pPr.numPr` element.
116+
The `ilvl` attribute indicates the level of nesting (zero based index)
117+
and the `numId` attribute refers to a specific numbering style
118+
included in the document's internal styles sheet.
119+
120+
### Form Elements
121+
122+
* **"simplify-dropdown"**: (*Default = `True`*): Include just the selected
123+
and default values, the available options, and the name and label attributes in the form element.
124+
* **"simplify-textinput"**: (*Default = `True`*): Include just the current
125+
and default values, and the name and label attributes in the form element.
126+
* **"greedy-text-input"**: (*Default = `True`*): Continue consuming run
127+
elements when the text-input has not ended at the end of a paragraph,
128+
and the next block level element is also a paragraph. This typically
129+
occurs when the user preses the return key while editing a text input
130+
field.
131+
* **"simplify-checkbox"**: (*Default = `True`*): Include just the current
132+
and default values, and the name and label attributes in the form element.
133+
* **"use-checkbox-default"**: (*Default = `True`*): If the checkbox has no
134+
`value` attribute (typically because the user has not interacted with
135+
it), report the default value as the checkbox value.
136+
* **"checkbox-as-text"**: (*Default = `False`*): Coerce the value of the
137+
checkbox to text, represented as either `"[CheckBox:True]"` or `"[CheckBox:False]"`
138+
* **"dropdown-as-text"**: (*Default = `False`*): Coerce the value of the
139+
checkbox to text, represented as `"[DropDown:<selected value>]"`
140+
* **"trim-dropdown-options"**: (*Default = `True`*): Remove white-space on
141+
the left and right of drop down option items.
142+
* **"flatten-generic-field"**: (*Default = `True`*): `generic-fields` are
143+
`CT_FldChar` runs which are not marked as a drop-down, text-input, or
144+
checkbox. These may include special instructions which apply special
145+
formatting to a text run (e.g. a hyper link). If `True`, the contents
146+
of generic-fields are included in the normal flow of text
147+
148+
### Special content
149+
150+
* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single
151+
words can be represented by multiple text elements. If `True`,
152+
concatenate consecutive text elements into a single text element.
153+
* **"flatten-hyperlink"**: (*Default = `True`*): Flatten hyperlinks, including
154+
their contents in the flow of normal text.
155+
* **"flatten-smartTag"**: (*Default = `True`*): Flatten smartTag elements,
156+
including their contents in the flow of normal text.
157+
* **"flatten-customXml"**: (*Default = `True`*): Flatten customXml elements,
158+
including their contents in the flow of normal text.
159+
* **"flatten-simpleField"**: (*Default = `True`*): Flatten simpleField elements,
160+
including their contents in the flow of normal text.
1161

2162
# Contributing
3163

setup.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
"""
2+
Package installation via setup()
3+
"""
4+
import codecs
5+
import os
6+
import re
7+
from setuptools import setup
8+
9+
#Allow single version in source file to be used here
10+
#From https://packaging.python.org/guides/single-sourcing-package-version/
11+
def read(*parts):
12+
# intentionally *not* adding an encoding option to open
13+
# see here: https://github.com/pypa/virtualenv/issues/201#issuecomment-3145690
14+
here = os.path.abspath(os.path.dirname(__file__))
15+
return codecs.open(os.path.join(here, *parts), 'r').read()
16+
def find_version(*file_paths):
17+
version_file = read(*file_paths)
18+
version_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]",
19+
version_file, re.M)
20+
if version_match:
21+
return version_match.group(1)
22+
raise RuntimeError("Unable to find version string.")
23+
24+
setup(name="simplify-docx",
25+
version=find_version('simplify_docx', '__init__.py'),
26+
description="A utility for simplifying python-docx document objects",
27+
author="Microsoft Research",
28+
packages=['simplify_docx'],
29+
license='UNLICENSED',
30+
install_requires=["python-docx"])

simplify_docx/__init__.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
"""
2+
Coerce Docx Documents to JSON
3+
4+
Not thread safe! (but could be if build_iterators returned the built iterator
5+
definitions and passed them around...)
6+
"""
7+
8+
from typing import Union, Dict, Optional, Type, Any
9+
from .types.fragment import documentPart
10+
from .utils.walk import walk
11+
from .utils.friendly_names import apply_friendly_names
12+
from .elements import document
13+
from .utils.set_options import set_options as __set_options__
14+
15+
__version__ = "0.1.0"
16+
17+
# --------------------------------------------------
18+
# Main API
19+
# --------------------------------------------------
20+
def simplify(doc: documentPart, options: Optional[Dict[str, Any]] = None):
21+
"""
22+
Coerce Docx Documents to JSON
23+
"""
24+
25+
# SET OPTIONS
26+
_options: Dict[str, Any]
27+
if options:
28+
_options = dict(__default_options__, **options)
29+
else:
30+
_options = __default_options__
31+
__set_options__(_options)
32+
33+
out = document(doc.element).to_json(doc, _options)
34+
35+
if _options.get("friendly-name", True):
36+
apply_friendly_names(out)
37+
38+
return out
39+
40+
41+
# --------------------------------------------------
42+
# Default Options
43+
# --------------------------------------------------
44+
__default_options__: Dict[str, Union[str, bool, int, float]] = {
45+
# general
46+
"friendly-names": True,
47+
# flattening special content
48+
"flatten-hyperlink": True,
49+
"flatten-smartTag": True,
50+
"flatten-customXml": True,
51+
"flatten-simpleField": True,
52+
"merge-consecutive-text": True,
53+
"flatten-inner-spaces": False,
54+
# possibly meaningful style:
55+
"include-paragraph-indent": True,
56+
"include-paragraph-numbering": True,
57+
# ignoring invisible things
58+
"ignore-joiners": True,
59+
"ignore-left-to-right-mark": False,
60+
"ignore-right-to-left-mark": False,
61+
"ignore-empty-table-description": True,
62+
"ignore-empty-table-caption": True,
63+
"ignore-empty-paragraphs": True,
64+
"ignore-empty-text": True,
65+
"remove-trailing-white-space": True,
66+
"remove-leading-white-space": True,
67+
# forms
68+
"use-checkbox-default": True,
69+
"greedy-text-input": True,
70+
"checkbox-as-text": False,
71+
"dropdown-as-text": False,
72+
"simplify-dropdown": True,
73+
"simplify-textinput": True,
74+
"simplify-checkbox": True,
75+
"flatten-generic-field": True,
76+
"trim-dropdown-options": True,
77+
# special symbols
78+
"empty-as-text": False,
79+
"symbol-as-text": True,
80+
"special-characters-as-text": True,
81+
"dumb-quotes": True,
82+
"dumb-hyphens": True,
83+
"dumb-spaces": True,
84+
}

simplify_docx/elements/__init__.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
"""
2+
Docx element objects
3+
"""
4+
# from .blocks import smartTag, customXml, fldSimple, hyperlink, paragraph_list, paragraph
5+
from .base import el, container, IncompatibleTypeError
6+
7+
from .body import body
8+
from .document import document, altChunk, subDoc, contentPart
9+
from .table import table, tr, tc
10+
from .run_contents import text, simpleTextElement, SymbolChar, empty
11+
from .form import fldChar, checkBox, ddList, textInput, ffData
12+
from .paragraph import (
13+
EG_PContent,
14+
paragraph,
15+
hyperlink,
16+
fldSimple,
17+
customXml,
18+
smartTag,
19+
)

0 commit comments

Comments
 (0)