bugfix: stop ignoring first line of imported email#484
Open
skrobul wants to merge 1 commit into
Open
Conversation
I have stumbled upon a bug where almost all of the restored emails were
corrupted. The emails in question seemed to have almost the same text
but the formatting was all over the place and some of the words were
mangled badly. HTML tables were broken.
Upon investigation I noticed that actual body of the email in the
original .eml file and one downloaded from Googles "Download message"
was practically identical with exception of few headers. One of those
headers was `Content-Transfer-Encoding` which happened to be very first
line of each corrupted email.
Example diff:
$ diff docker_meetup_email_after_gyb_restore.eml original_docker_email.eml
1,11c1
< Authentication-Results: mx.google.com;
< dkim=neutral (body hash did not verify) header.i=@meetup.com header.s=s1 header.b="Ivpq/sFe"
< X-Google-Smtp-Source: AGHT+IGEy4ty3doaMTjFqiOkSsSCpk9NLEy/NCs28XMnDJUGPy5CZ54yo2foi5usb9P4cI1hNo4Fqzyh56Lj5OK1xw==
< Received: from 777146845227
< named unknown
< by gmailapi.google.com
< with HTTPREST;
< Fri, 1 Nov 2024 20:03:11 +0000
---
> Content-Transfer-Encoding: quoted-printable
46a37
> X-Google-Smtp-Source: AAOMgpen7PSnhPReh9WOrpUPOxq9IhkBBjd6pokoxWeGNf9xtEIQtwHrvjIF7wax5u3067qhdJYI
$
After looking into the source code of `fmbox.py` I noticed that
constructor of `class fmbox()` advances the `self._file` when
initialising the `_last_from_line` but does not rewind it back which
effectively produces a message that is stripped of first line.
Presumably this is not a problem when a message starts with a `From`
header but it is when it's anything else.
At this point I am not sure if this is provider specific or what, but
for some context, my .eml files have been created by Proton Mail export
tool. The same emails were imported from Google Takeout to Proton few
years earlier if that matters.
I have tested the fix by importing about 500 messages and they all
display correctly.
Member
|
Hmmm... So this is a difference between mbox files where the first line is the From delimiter (see https://en.wikipedia.org/wiki/Mbox) and .eml files where the first line is an email header. The proper fix here would be to examine that first line and if it's actually "From " (notice no :) then remove that first line. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I have stumbled upon a bug where almost all of the restored emails were corrupted. The emails in question seemed to have almost the same text but the formatting was all over the place and some of the words were mangled badly. HTML tables were broken.
Upon investigation I noticed that actual body of the email in the original .eml file and one downloaded from Googles "Download message" was practically identical with exception of few headers. One of those headers was
Content-Transfer-Encodingwhich happened to be very first line of each corrupted email.Example diff:
After looking into the source code of
fmbox.pyI noticed that constructor ofclass fmbox()advances theself._filewhen initialising the_last_from_linebut does not rewind it back which effectively produces a message that is stripped of first line.Presumably this is not a problem when a message starts with a
Fromheader but it is when it's anything else.At this point I am not sure if this is provider specific or what, but for some context, my .eml files have been created by Proton Mail export tool. The same emails were imported from Google Takeout to Proton few years earlier if that matters.
I have tested the fix by importing about 500 messages and they all display correctly.
This is also likely related to the problem @infovations has seen in #148 as well as #157