|
| 1 | +# Overview |
| 2 | + |
| 3 | +DOCX files are complex, and their complexity makes scraping documents |
| 4 | +for their content difficult. The aim of this package is to simplify |
| 5 | +`.docx` files to just the components which carry meaning thereby easing the |
| 6 | +process of document identification and scraping by converting a `.docx` |
| 7 | +file into a predictable an *human readable* JSON file. |
| 8 | + |
| 9 | +Simplifying a complex document down to it's *meaningful* parts of course |
| 10 | +requires taking a position on what does and does-not convey meaning in a |
| 11 | +document. Generally, this package takes the stance that the document |
| 12 | +structure (body, paragraphs, tables, etc.) are meaningful as is the text |
| 13 | +itself, whereas text styling (font, font-weight, etc.) is ignored almost |
| 14 | +entirely, with the exception of paragraph indentation and numbering which |
| 15 | +is often used to create lists, block quotes, etc. Furthermore, the |
| 16 | +opinions expressed by this package are explained in the Options section |
| 17 | +below and can be changed to suite your needs. |
| 18 | + |
| 19 | +# Usage |
| 20 | +```python |
| 21 | +import docx |
| 22 | +from simplify_docx import simplify |
| 23 | + |
| 24 | +# read in a document |
| 25 | +my_doc = docx.Document("/path/to/my/favorite/file.docx") |
| 26 | + |
| 27 | +# coerce to JSON using the standard options |
| 28 | +my_doc_as_json = simplify(my_doc) |
| 29 | + |
| 30 | +# or with non-standard options |
| 31 | +my_doc_as_json = simplify(my_doc,{"remove-leading-white-space":False}) |
| 32 | +``` |
| 33 | + |
| 34 | +# Installation |
| 35 | + |
| 36 | +This project relies on the `python-docx` package which can be installed via |
| 37 | +`pip install python-docx`. **However**, as of this writing, if you wish to |
| 38 | +scrape documents which contain (A) form fields such as drop down lists, |
| 39 | +checkboxes and text inputs or (B) nested documents (subdocs, altChunks, |
| 40 | +etc.), you'll need to clone [this fork](https://github.com/jdthorpe/python-docx) of the python-docx package. |
| 41 | + |
| 42 | +# Options |
| 43 | + |
| 44 | +### General |
| 45 | + |
| 46 | +* **"friendly-names"**: (*Default = `True`*): Use user-friendly type names |
| 47 | + such as "table-cell", over standard element names like "CT_Tc" |
| 48 | + |
| 49 | +### Ignoring Invisible things |
| 50 | + |
| 51 | +* **"ignore-empty-paragraphs"**: (*Default = `True`*): Empty paragraphs are |
| 52 | + often used for styling purpose and rarely have significance in the |
| 53 | + meaning of the document. |
| 54 | +* **"ignore-empty-text"**: (*Default = `True`*): Empty text runs can make an |
| 55 | + otherwise empty paragraph appear to contain data. |
| 56 | +* **"remove-leading-white-space"**: (*Default = `True`*): Leading white-space |
| 57 | + at the start of a paragraph is ocassionaly used for styling purposes |
| 58 | + and rarely has significance in the interpretation of a document. |
| 59 | +* **"remove-trailing-white-space"**: (*Default = `True`*): Trailing white-space |
| 60 | + at the end of a paragraph rarely has significance in the interpretation |
| 61 | + of a document. |
| 62 | +* **"flatten-inner-spaces"**: (*Default = `False`*): Collapse multiple |
| 63 | + space characters between words to a single space. |
| 64 | +* **"ignore-joiners"**: (*Default = `False`*): Zero width joiner and non-joiner |
| 65 | + characters are special characters used to create ligatures in displayed |
| 66 | + text and don't typically convey meaning (at least in alphabet based |
| 67 | + languages). |
| 68 | + |
| 69 | +### Special symbols |
| 70 | + |
| 71 | +* **"dumb-quotes"**: (*Default = `True`*): Replace smart quotes with |
| 72 | + dumb quotes. |
| 73 | +* **"dumb-hyphens"**: (*Default = `True`*): Replace en-dash, em-dash, |
| 74 | + figure-dash, horizontal bar, and non-breaking hyphens with ordinary hyphens. |
| 75 | +* **"dumb-spaces"**: (*Default = `True`*): Replace zero width spaces, hair |
| 76 | + spaces, thin spaces, punctuation spaces, figure spaces, six per em |
| 77 | + spaces, four per em spaces, three per em spaces, em spaces, en spaces, |
| 78 | + em quad spaces, and en quad spaces with ordinary spaces. |
| 79 | +* **"special-characters-as-text"**: (*Default = `True`*): Coerce special |
| 80 | + characters into text equivalents according to the following table: |
| 81 | + |
| 82 | +| Character | Text Equivalent | |
| 83 | +| --------- | --------------- | |
| 84 | +| CarriageReturn | `\n` | |
| 85 | +| Break | `\r` | |
| 86 | +| TabChar | `\t` | |
| 87 | +| PositionalTab | `\t` | |
| 88 | +| NoBreakHyphen | `-` | |
| 89 | +| SoftHyphen | `-` | |
| 90 | + |
| 91 | +* **"symbol-as-text"**: (*Default = `True`*): Special symbols often cary |
| 92 | + meaning other than the underlying unicode character, especially when |
| 93 | + the font is a special font such as `Wingdings`. If `True` these are |
| 94 | + included as ordinary text and their font information is omitted. |
| 95 | +* **"empty-as-text"**: (*Default = `False`*): There are a variety of "Empty" |
| 96 | + tags such as the `<"w:yearLong">` tag which cause the current year to |
| 97 | + be inserted into the document text. If `True`, include these as text |
| 98 | + formatted as `"[yearLong]"`. |
| 99 | +* **"ignore-left-to-right-mark"**: (*Default = `False`*): Ignore the left-to-right |
| 100 | + mark, which is not writeable by pythons csv writer. |
| 101 | +* **"ignore-right-to-left-mark"**: (*Default = `False`*): Ignore the right-to-left |
| 102 | + mark which is not writeable by pythons csv writer. |
| 103 | + |
| 104 | +### Paragraph style: |
| 105 | + |
| 106 | +Paragraph style markup are one exception to the styling vs. content |
| 107 | +dichotomy. For example, block quotes are often indicated by indenting whole |
| 108 | +paragraphs, and Ordered lists, Unordered lists and nesting of lists is |
| 109 | +often used to divide sections of a document into logical components. |
| 110 | + |
| 111 | +* **"include-paragraph-indent"**: (*Default = `True`*): Include the |
| 112 | + indentation markup on paragraph (`CT_P`) elements. Indentation is |
| 113 | + measured in twips |
| 114 | +* **"include-paragraph-numbering"**: (*Default = `True`*): Include the |
| 115 | + numbering styles, which are included in the `CT_P.pPr.numPr` element. |
| 116 | + The `ilvl` attribute indicates the level of nesting (zero based index) |
| 117 | + and the `numId` attribute refers to a specific numbering style |
| 118 | + included in the document's internal styles sheet. |
| 119 | + |
| 120 | +### Form Elements |
| 121 | + |
| 122 | +* **"simplify-dropdown"**: (*Default = `True`*): Include just the selected |
| 123 | + and default values, the available options, and the name and label attributes in the form element. |
| 124 | +* **"simplify-textinput"**: (*Default = `True`*): Include just the current |
| 125 | + and default values, and the name and label attributes in the form element. |
| 126 | +* **"greedy-text-input"**: (*Default = `True`*): Continue consuming run |
| 127 | + elements when the text-input has not ended at the end of a paragraph, |
| 128 | + and the next block level element is also a paragraph. This typically |
| 129 | + occurs when the user preses the return key while editing a text input |
| 130 | + field. |
| 131 | +* **"simplify-checkbox"**: (*Default = `True`*): Include just the current |
| 132 | + and default values, and the name and label attributes in the form element. |
| 133 | +* **"use-checkbox-default"**: (*Default = `True`*): If the checkbox has no |
| 134 | + `value` attribute (typically because the user has not interacted with |
| 135 | + it), report the default value as the checkbox value. |
| 136 | +* **"checkbox-as-text"**: (*Default = `False`*): Coerce the value of the |
| 137 | + checkbox to text, represented as either `"[CheckBox:True]"` or `"[CheckBox:False]"` |
| 138 | +* **"dropdown-as-text"**: (*Default = `False`*): Coerce the value of the |
| 139 | + checkbox to text, represented as `"[DropDown:<selected value>]"` |
| 140 | +* **"trim-dropdown-options"**: (*Default = `True`*): Remove white-space on |
| 141 | + the left and right of drop down option items. |
| 142 | +* **"flatten-generic-field"**: (*Default = `True`*): `generic-fields` are |
| 143 | + `CT_FldChar` runs which are not marked as a drop-down, text-input, or |
| 144 | + checkbox. These may include special instructions which apply special |
| 145 | + formatting to a text run (e.g. a hyper link). If `True`, the contents |
| 146 | + of generic-fields are included in the normal flow of text |
| 147 | + |
| 148 | +### Special content |
| 149 | + |
| 150 | +* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single |
| 151 | + words can be represented by multiple text elements. If `True`, |
| 152 | + concatenate consecutive text elements into a single text element. |
| 153 | +* **"flatten-hyperlink"**: (*Default = `True`*): Flatten hyperlinks, including |
| 154 | + their contents in the flow of normal text. |
| 155 | +* **"flatten-smartTag"**: (*Default = `True`*): Flatten smartTag elements, |
| 156 | + including their contents in the flow of normal text. |
| 157 | +* **"flatten-customXml"**: (*Default = `True`*): Flatten customXml elements, |
| 158 | + including their contents in the flow of normal text. |
| 159 | +* **"flatten-simpleField"**: (*Default = `True`*): Flatten simpleField elements, |
| 160 | + including their contents in the flow of normal text. |
1 | 161 |
|
2 | 162 | # Contributing |
3 | 163 |
|
|
0 commit comments