I'm curious as to why you choose to break out specific headers in the schema.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
CREATE TABLE IF NOT EXISTS "messages" (
"id" INTEGER NOT NULL PRIMARY KEY, -- internal id
"message_id" TEXT NOT NULL, -- Gmail message id
"thread_id" TEXT NOT NULL, -- Gmail thread id
"headers" JSON NOT NULL, -- JSON object of { "header": value },
"subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
...
);
CREATE INDEX subjectidx on messages(subject);
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as
ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
CREATE INDEX dkimidx on messages(dkim);
SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;
Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!
I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.
Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.
Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.
Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.
FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.
So I can understand why using Google's proprietary API might work better (or not, I don't know)
Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.
And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.
This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.
I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.
* Google collects vast amounts of personal data, specifically through receiving all of your email and analyizing it.
* It builds elaborate user profiles and uses them to target you with ads designed to better influence you.
* Its hold on information (from different sources) has made it excessively powerful economically, and thus also politically.
* Google/Alphabet has long started to affect legislation, including through direct registered lobbying: ~15 Million USD in 2024 (opensecrets.org).
* It has been known to pass, and likely still passes, the information it collects - including copies of your email correspondence - on to the US government (Edward Snowden leaks).
and finally:
* There are multiple email providers, many of them quite good - both for pay and gratis. Naturally most of the gratis ones have their own interests, but nothing like Google.
This should be seen as a encouraging to switch to something else rather than defeatist. Many of my communications do not touch Google services, professionally it has been judged as too risky, personally I keep a google account but also others.
Edit: You can create groups of people that are not affected by Google/Apple/Facebook, this should be seen as a goal.
I'm curious as to why you choose to break out specific headers in the schema.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as or whatever you want.Note that you don't actually need the generated column either, SQLite supports indexes on expressions, so you can do, for example,
and it will use this index anywhere you reference that expression.I find it useful to create indexes like this, then create VIEWs using these expressions instead of ALTER'ing the main table with generated columns.
TIL, thanks a lot!
You can also create indices directly on expressions, including json_extract etc.
I see that you defined the `dkim` column as NOT NULL. So what happens when an email message does not contain the Dkim-Signature header?
Probably something like
because, unlike MySQL, SQLite apparently returns SQL NULL for JSON null value.As hun3 said, it would throw an error. My mistake in the quick example I put together. I just noticed an unbalanced () as well. Whoops.
I build something to visualize huge amounts of email (such as from Gmail) some years ago:
https://github.com/terhechte/postsack
Looks interesting, the link to gmvault in your readme is now a dead end, is this it https://github.com/gaubert/gmvault. Thanks!
Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!
I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.
Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.
Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.
Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.
step 1: extract data step 2: just dont use google shit anymore. Deal with it.
you dont get it done by moping about it, but by doing
Shouldn't this be "imap to sqlite" or something? Why tie it to one specific email provider?
FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.
So I can understand why using Google's proprietary API might work better (or not, I don't know)
Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.
And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.
Because _it is_ specific to Gmail. It's using OAuth and presumable API access.
IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.
Doing a mbox export with Google Takeout from gmail is pretty fast.
Would be nice to enable fulltext search as well
Yes! I find gmail’s full text search surprisingly bad given it’s run by a search company.
But not as bad as Outlook 365's search...
Outlook must be the worst email client there is. Something about least common denominator.
Boot as bad as the Mail.app from iOS and macOS
Agreed! One of the reasons we started working on Marco.
https://marcoapp.io
What's the best open-source GMail backup software that exists? Someone has setup something like that? (also archiving attachments, etc)
https://github.com/GAM-team/got-your-back
- Open source
- Resume (so backups/restores will eventually complete)
Honorable mention: https://www.mailstore.com/en/products/mailstore-home/
- Not open source
- GUI with index: nice for searching mail locally
- Resume only for backup (so large restores generally fail)
This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.
I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.
I would have preferred a script that parses the mail backup Google sends you.
I think it's a big eml file.
Google Takeout regularly fails to complete for me. Syncing via the API seems like a reasonable alternative.
this is great if only there was a tool for whatsapp to sqlite it would make my data so much more useful
https://medium.com/@Med1um1/extracting-whatsapp-messages-fro...
Would love a comparison to gbackup-rs[0].
To me having to install a tool through Python is a show-stopper.
[0] https://github.com/djipko/gbackup-rs
Let us stop using GMail:
* Google collects vast amounts of personal data, specifically through receiving all of your email and analyizing it.
* It builds elaborate user profiles and uses them to target you with ads designed to better influence you.
* Its hold on information (from different sources) has made it excessively powerful economically, and thus also politically.
* Google/Alphabet has long started to affect legislation, including through direct registered lobbying: ~15 Million USD in 2024 (opensecrets.org).
* It has been known to pass, and likely still passes, the information it collects - including copies of your email correspondence - on to the US government (Edward Snowden leaks).
and finally:
* There are multiple email providers, many of them quite good - both for pay and gratis. Naturally most of the gratis ones have their own interests, but nothing like Google.
https://mako.cc/copyrighteous/google-has-most-of-my-email-be...
This should be seen as a encouraging to switch to something else rather than defeatist. Many of my communications do not touch Google services, professionally it has been judged as too risky, personally I keep a google account but also others.
Edit: You can create groups of people that are not affected by Google/Apple/Facebook, this should be seen as a goal.
I was with you from day one and never started using gmail.