Dovecot Sieve for Pre-Fetching and Inlining Images
I’ve been self hosting email for at least three years now, and since the start I’ve been wanting a particular feature. Many may remember years ago when Google announced how Gmail would start prefetching and locally serving up images. A feature that was announced as privacy protecting, but in Google classic fashion really just meant they wanted to increase the value of you data, by preventing other from also tracking it. Googles awful privacy track record aside, it’s still a pretty good idea. I wanted that feature. I wanted Dovecot to automagically fetch remote images to a local server and rewrite the URLs. This method had one particular downside, running a server and a small risk of someone randomly guessing image URLs. What I did end up settling on was… instead of serving them from my own server, I’d simply inline the images with base64 encoding, attached as a multi-part section.
Sieves⌗
I’ve checked out Sieves a couple of times through out the years, particularly trying to solve this problem but couldn’t get it to work properly. I believe this was due to a miss-configuration in my Mail Server setup, which seems to have self-corrected during an update or rebuild.
Sieve scripts are pretty neat and allow you to do all sorts of customizations and I am excited to start using it more, but for this particular problem the bulk is written in Python, with only a ‘1 liner’ Sieve script to invoke it.
require ["vnd.dovecot.filter"];
filter "image_stripper.py";
NOTE: I personally found the naming and documentation for the Dovecot
extprograms
confusing. I would have expectedfilter
to work on purely accept/reject andpipe
to alter messages. I still never quite figured out whatpipe
was for but whatever content is returned from afilter
call will replace the contents of the email. This includes all of the email headers.
|
|
This used all built in libraries, starting with email
to parse the incoming message. With this we find a couple
of important things. We need the message body and the from address domain. The body get rewritten at the end, but
it also gets cleared out early on. The body string is passed to the build in html
parser library. This will be
used to find all the image tags and if they start with http
the image is fetched with urllib.request
.
When the html
parsing is done, we’re left with a list of hashes, image URLs, the image itself and its content type.
Then the main loop can process these, creating a new MIMEPart
for each image in base64 encoding with a Content-ID
matching the ones creating during fetching. That’s much of the work. I’ve been running this off and on for a week now
and for the most part it seems to be working, I have seen a particular set of glitchy messages, but as it’s Craigslist
saved searches….. yeah, that could be anything. More testing to come.