Skip to content

nbehrnd/pdf_rewriter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Background

Depending on the pdf creator/engine used, a .pdf file may include content not this much relevant for reading by a human. Examples include bitmap illustrations with a resolution greater than 300 dpi, or color images while an illustration in grayscale is good enough. A reprint of the .pdf again as a pdf with ghostscript may address this and yield a smaller file size easier to store / to transfer (for instance attached to an email). Text-only files print into postscript (e.g., with a2ps 1), then distilled into a pdf equally can benefit from such an optimization.

By far, this bash script does not claim to be the first one collecting bits and bolts to address the issue. It rather serves as an aide-memoire of finds encountered earlier, and to moderate ghostscript in Linux accordingly. Within reason, the snippets were joined as provided; thus, the credit belongs to those already in the field.

Intended Use

The intended use adds the executable bit to the script (chmod +x pdf_reprint.sh) and accesses the functionality by an alias. Debian's default shell is BASH; thus, this configuration can be set either in file /etc/bash.bashrc for any user, in file file ~/.bashrc for the current user; or as a template for any (new) user's own file ~/.bashrc in file /etc/skel/.bashrc in a pattern of

alias pdf_rewrite="/path/to/pdf_rewrite.sh"
  • Presuming the alias to pdf_rewrite.sh set is pdf_rewrite, run either one of the following commands to reprint the .pdf while retaining the color:

    pdf_rewrite --reprint input.pdf
    pdf_rewrite -r input.pdf
    pdf_rewrite --colour input.pdf
    pdf_rewrite --color input.pdf
    pdf_rewrite -c input.pdf

    If the reprint is smaller in size than the original file, the reprint replaces the original file. In addition, a brief note states the percentage of the savings achieved. If wanted, you may repeat the reprint; eventually, either the savings are insignificant in comparison to the remaining file size, or the script itself will report no change and thus retain the original file.

    The credit for the underlying approach and implementation belongs to Evan Langlois.2

  • Often, a reprint in grayscale is sufficient. Use either one of the following commands

    pdf_rewrite --gray input.pdf
    pdf_rewrite --grey input.pdf
    pdf_rewrite -g input.pdf

    to overwrite file input.pdf accordingly. The credit for this approach belongs to user slm on the Unix stackexchange.3

  • To process multiple .pdf, you may consider a for loop as in

    for file in *.pdf
    do
        echo "$file"
        bash ./pdf_rewrite.sh -r "$file"
    done

    This approach equally provides you a brief progress report, too.

Note, illustrations in a reprint in grayscale may render illustrations less intelligible. When preparing a document, a service like https://colorbrewer2.org/ may guide your selection for color palettes suitable for this kind of "photocopying" / identify a palette safe for the color blind.

Keep a backup of the .pdf to be processed. Though the script may report problems while processing the data (or even crash, which may destroy the .pdf), it is not a PDF validator such as e.g., veraPDF.4

Benchmark

Initially written for Linux Xubuntu 18.04.3 LTS and ghostscript (version 9.26), the script is known to work well for instance with Debian 13/trixie and GPL Ghostscript (version 10.05.1).

  • File link2web.pdf included to the project was compiled with pdfLaTeX based on an example provided by www.texample.net. This .pdf contains a color figure and link to an external reference. Note, the simplification into half-tones (option -g) affects the document printed; depending on the pdf viewer used, the highlighting box around the link may remain colored for the display on screen.

  • The performance of the utility was tested on a couple of recent publications in chemistry and gs (10.07.0) as currently provided in Debian 14/forky (currently branch testing). To ease a replication, only open access publications used for the bench mark were used.

The table below compares the difference of the file size prior and after the processing with either option after a single run of optimization.

journal publisher original with -r saved % with -g saved %
2023ACR3654 ACS 2.8 MB 1.6 MB 42.9 1.6 MB 42.9
2026ACR1414 ACS 9.1 MB 1.9 MB 79.1 1.8 MB 80.2
2023CrystGrowthDes8469 ACS 3.7 MB 0.7 MB 81.1 0.7 MB 81.1
2026CrystGrowthDes2939 ACS 8.2 MB 2.0 MB 75.6 1.8 MB 78.0
2023CRV13291 ACS 25.5 MB 3.5 MB 86.3 3.3 MB 87.1
2026CRV4375 ACS 12.6 MB 2.6 MB 79.4 2.5 MB 80.2
2023JCE4728 ACS 3.1 MB 1.0 MB 67.7 1.0 MB 67.7
2026JCE1723 ACS 3.2 MB 1.2 MB 62.5 1.2 MB 62.5
2023JOC16719 ACS 9.9 MB 2.3 MB 76.8 2.0 MB 79.8
2026JOC5520 ACS 5.4 MB 2.0 MB 63.0 1.8 MB 66.7
2023OL9243 ACS 2.2 MB 1.4 MB 36.4 1.3 MB 40.9
2026OL5021 ACS 2.5 MB 1.4 MB 44.0 1.4 MB 44.0
2026Tetrahedron135290 Elsevier 2.4 MB 2.2 MB 8.3 1.4 MB 41.7
2026Tetrahedron135286 Elsevier 1.1 MB 1.0 MB 9.1 0.9 MB 18.2
2026TL156046 Elsevier 498 kB 374 kB 24.9 286 kB 42.6
2026TL156057 Elsevier 2.1 MB 2.0 MB 4.8 0.5 MB 76.2
2024PCCP770 RSC 2.3 MB 1.3 MB 43.5 0.9 MB 60.9
2026PCCP9840 RSC 5.3 MB 1.9 MB 64.2 1.3 MB 75.5
2024TheorChemAcc4 Springer 1.8 MB 0.7 MB 61.1 0.7 MB 61.1
2026TheorChemAcc25 Springer 1.5 MB 0.9 MB 40.0 0.9 MB 40.0
2023Synthesis3777 Thieme 976 kB 949 kB 2.8 540 kB 44.7
2026Synthesis910 Thieme 759 kB 438 kB 42.3 436 kB 42.6
2024ACIEe202314446 Wiley 2.5 MB 1.3 MB 48.0 0.8 MB 68.0
2026ACIEe26144 Wiley 5.3 MB 4.0 MB 24.5 2.9 MB 45.3
2023HCAe202300110 Wiley 10.4 MB 1.2 MB 88.5 1.2 MB 88.5
2026HCA0:e00224 Wiley 4.1 MB 1.3 MB 68.3 1.3 MB 68.3
2023JApplCryst1639 Wiley 1.1 MB 0.7 MB 36.4 0.7 MB 36.4
2026JApplCryst291 Wiley 7.3 MB 1.3 MB 82.2 0.7 MB 90.4
2026BJoc672 Beilstein 1.6 MB 0.5 MB 68.8 0.3 MB 81.2
2026BJoc620 Beilstein 461 kB 461 kB 0.0 350 kB 24.1
2026Molecules1499 MDPI 771 kB 558 kB 27.6 328 kB 57.5
2026Molecules1495 MDPI 2.8 MB 0.8 MB 71.4 0.6 MB 78.6
2026JOSS0825 JOSS 213 kB 84 kB 60.6 84 kB 60.6
2026JOSS09890 JOSS 921 kB 392 kB 57.4 246 kB 73.3
arXiv:2605.00564v1 arxiv 3.1 MB 1.4 MB 54.8 0.9 MB 71.0
arXiv:2605.00149v1 arxiv 371 kB 232 kB 37.5 232 kB 37.5
link2web.pdf pdflatex 38.9 kB 10.7 kB 72.5 10.6 kB 72.8

Disclaimer

By inspection with the utilities of pdfinfo and exiftools, the rewrite overwrites pdf metadata such as Producer (which can be an entry like LaTeX with hyperref), CreationDate, and ModDate, while others are lost for good. With metadata such as TITLE, SUBJECT, KEYWORDS, and AUTHOR typically retained, a reference manager like manager zotero (tested with version 9.0.1) still can collect complementary bibliographic metadata.

A conversion to grayscale is more likely to be successful if the pdf of interest is converted directly. This seems especially the case if the document includes ligatures like fl, fl, ae, oe, umlauts in the Latin script; if the (intermediate) color retaining reprint failed to properly define these, a subsequent reprint constraint to gray scale may yield a gap. This issue depends both on the version of ghostscript installed, and font / pdf-engine of the pdf to be processed because recent journal publications (like by ACS, member of STIX project5 tend to be less frequently affected by this. This pdf-reprinter is not tested on pdf about documents predominantly written in other scripts than Latin.

Footnotes

Footnotes

  1. https://www.gnu.org/software/a2ps/

  2. https://tex.stackexchange.com/questions/18987/how-to-make-the-pdfs-produced-by-pdflatex-smaller?rq=1

  3. https://unix.stackexchange.com/questions/93959/how-to-convert-a-color-pdf-to-black-white

  4. https://openpreservation.org/tools/verapdf/

  5. https://en.wikipedia.org/wiki/STIX_Fonts_project

About

Optimize .pdf by «reprint to pdf» by ghostcript in color, or gray; if present, text layer, internal crosslinks (e.g., TOC) and hyperlinks (e.g., to websites) may be preserved.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors