|
2 | 2 |
|
3 | 3 | DOCX files are complex, and their complexity makes scraping documents |
4 | 4 | for their content difficult. The aim of this package is to simplify |
5 | | -`.docx` files to just the components which carry meaning thereby easing the |
6 | | -process of document identification and scraping by converting a `.docx` |
7 | | -file into a predictable an *human readable* JSON file. |
| 5 | +`.docx` files to just the components which carry meaning, thereby easing the |
| 6 | +process of pattern matching and data extraction by converting a `.docx` |
| 7 | +file into a predictable and *human readable* JSON file. |
8 | 8 |
|
9 | 9 | Simplifying a complex document down to it's *meaningful* parts of course |
10 | 10 | requires taking a position on what does and does-not convey meaning in a |
@@ -43,9 +43,13 @@ etc.), you'll need to clone [this fork](https://github.com/jdthorpe/python-docx) |
43 | 43 |
|
44 | 44 | ### General |
45 | 45 |
|
46 | | -* **"friendly-names"**: (*Default = `True`*): Use user-friendly type names |
| 46 | +* **"friendly-name"**: (*Default = `True`*): Use user-friendly type names |
47 | 47 | such as "table-cell", over standard element names like "CT_Tc" |
48 | 48 |
|
| 49 | +* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single |
| 50 | + words can be represented by multiple text elements. If `True`, |
| 51 | + concatenate consecutive text elements into a single text element. |
| 52 | + |
49 | 53 | ### Ignoring Invisible things |
50 | 54 |
|
51 | 55 | * **"ignore-empty-paragraphs"**: (*Default = `True`*): Empty paragraphs are |
@@ -147,9 +151,6 @@ often used to divide sections of a document into logical components. |
147 | 151 |
|
148 | 152 | ### Special content |
149 | 153 |
|
150 | | -* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single |
151 | | - words can be represented by multiple text elements. If `True`, |
152 | | - concatenate consecutive text elements into a single text element. |
153 | 154 | * **"flatten-hyperlink"**: (*Default = `True`*): Flatten hyperlinks, including |
154 | 155 | their contents in the flow of normal text. |
155 | 156 | * **"flatten-smartTag"**: (*Default = `True`*): Flatten smartTag elements, |
|
0 commit comments