Skip to content

Commit 4606c41

Browse files
author
Jason Thorpe
committed
2 parents e690e78 + 0a1f432 commit 4606c41

3 files changed

Lines changed: 25 additions & 24 deletions

File tree

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
DOCX files are complex, and their complexity makes scraping documents
44
for their content difficult. The aim of this package is to simplify
5-
`.docx` files to just the components which carry meaning thereby easing the
6-
process of document identification and scraping by converting a `.docx`
7-
file into a predictable an *human readable* JSON file.
5+
`.docx` files to just the components which carry meaning, thereby easing the
6+
process of pattern matching and data extraction by converting a `.docx`
7+
file into a predictable and *human readable* JSON file.
88

99
Simplifying a complex document down to it's *meaningful* parts of course
1010
requires taking a position on what does and does-not convey meaning in a
@@ -43,9 +43,13 @@ etc.), you'll need to clone [this fork](https://github.com/jdthorpe/python-docx)
4343

4444
### General
4545

46-
* **"friendly-names"**: (*Default = `True`*): Use user-friendly type names
46+
* **"friendly-name"**: (*Default = `True`*): Use user-friendly type names
4747
such as "table-cell", over standard element names like "CT_Tc"
4848

49+
* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single
50+
words can be represented by multiple text elements. If `True`,
51+
concatenate consecutive text elements into a single text element.
52+
4953
### Ignoring Invisible things
5054

5155
* **"ignore-empty-paragraphs"**: (*Default = `True`*): Empty paragraphs are
@@ -147,9 +151,6 @@ often used to divide sections of a document into logical components.
147151

148152
### Special content
149153

150-
* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single
151-
words can be represented by multiple text elements. If `True`,
152-
concatenate consecutive text elements into a single text element.
153154
* **"flatten-hyperlink"**: (*Default = `True`*): Flatten hyperlinks, including
154155
their contents in the flow of normal text.
155156
* **"flatten-smartTag"**: (*Default = `True`*): Flatten smartTag elements,

src/simplify_docx/elements/run_contents.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -75,23 +75,23 @@ def to_json(
7575
)
7676
_value = _value.replace(u"\u201c", '"').replace(u"\u201d", '"')
7777

78-
if options.get("dumb-hyphens", True):
78+
if options.get("dumb-spaces", True):
7979
_value = (
80-
_value.replace(u"\u2000", "-")
81-
.replace(u"\u2001", "-")
82-
.replace(u"\u2002", "-")
83-
.replace(u"\u2003", "-")
84-
.replace(u"\u2004", "-")
85-
.replace(u"\u2005", "-")
86-
.replace(u"\u2006", "-")
87-
.replace(u"\u2007", "-")
88-
.replace(u"\u2008", "-")
89-
.replace(u"\u2009", "-")
90-
.replace(u"\u200A", "-")
91-
.replace(u"\u201B", "-")
80+
_value.replace(u"\u2000", " ")
81+
.replace(u"\u2001", " ")
82+
.replace(u"\u2002", " ")
83+
.replace(u"\u2003", " ")
84+
.replace(u"\u2004", " ")
85+
.replace(u"\u2005", " ")
86+
.replace(u"\u2006", " ")
87+
.replace(u"\u2007", " ")
88+
.replace(u"\u2008", " ")
89+
.replace(u"\u2009", " ")
90+
.replace(u"\u200A", " ")
91+
.replace(u"\u201B", " ")
9292
)
9393

94-
if options.get("dumb-spaces", True):
94+
if options.get("dumb-hyphens", True):
9595
_value = (
9696
_value.replace(u"\u2010", "-")
9797
.replace(u"\u2011", "-")

src/simplify_docx/utils/paragrapy_style.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ def get_pStyle(p, doc):
66
"""
77
Get the referenced style element for a paragraph with a p.pPr.pStyle
88
"""
9-
if p.pPr is not None and \
9+
if getattr(p, "pPr", None) is not None and \
1010
p.pPr.pStyle is not None:
1111
return doc.styles.element.find("w:style[@w:styleId='%s']" % p.pPr.pStyle.val,
1212
doc.styles.element.nsmap)
@@ -17,7 +17,7 @@ def get_num_style(p, doc):
1717
"""
1818
The the paragraph's Numbering style
1919
"""
20-
if p.pPr is not None \
20+
if getattr(p, "pPr", None) is not None \
2121
and p.pPr.numPr is not None\
2222
and p.pPr.numPr.numId is not None:
2323
# the numbering style doc
@@ -47,7 +47,7 @@ def get_paragraph_ind(p, doc):
4747
* Direct Formatting
4848
"""
4949

50-
if p.pPr is not None and\
50+
if getattr(p, "pPr", None) is not None and\
5151
p.pPr.ind is not None:
5252
return p.pPr.ind
5353

0 commit comments

Comments
 (0)