Joel H. Simplex ([info]njyoder) wrote in [info]computerhelp,
@ 2008-05-12 22:59:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
After having fetched all of my gmail messages via POP3 into an mbox file, I decided to check to make sure that the number of email messages received matches the number in the mbox file.

So I ran this command to count the number of messages:
$ egrep -c "^From \w+@\w+\.\w+ " gmail-archive/gmail-backup.mbox
48858


I have about 11,800 messages in my inbox, according to gmail, so something is obviously wrong. Either the command I used is wrong and/or something about the way I downloaded is wrong. What is wrong?

Also, the mbox file is 186.6 MiB, while gmail says that only 165MB of my account is being used. What's up with that?

I used the getmail program to get the email in many batches (gmail only allows ~1k to be fetched at a time), if that matters.

ETA[0]: What's a good way to either edit the mbox file or extract just certain messages from it that meet certain criteria? Specifically, I want just the messages within a certain time frame. I can write my own script, but would prefer something simpler.

ETA[1]: I tried grepmail and it gave discouraging results:
$ grepmail . -r gmail-archive/gmail-backup.mbox
gmail-archive/gmail-backup.mbox: 50205

$ mboxgrep.exe -c . gmail-archive/gmail-backup.mbox
51601


This could mean something regarding how getmail has fetched emails. Also, I'm a bit concerned that I can't get consistent counts from anything.

ETA[2]: Duplicates filtered:
$ mboxgrep.exe -c -nd . gmail-archive/gmail-backup.mbox
48807


(Post a new comment)


[info]compwizrd
2008-05-13 03:24 am UTC (link)
my guess is you're picking up From: lines that are in the body of the emails, not just the headers.

grepmail looks like it has a feature that will work, it mentions using -br to count, so -hr should work on headers.. check the man page

also, you can try mboxgrep, with the -H option... though it can't deal with a mailbox bigger than 2gbyte unless you redirect input into the program

(Reply to this)(Thread)


[info]njyoder
2008-05-13 03:28 am UTC (link)
No, those have a colon in them ("From: "), but the "From " lines that denote a new message in mbox format don't. Whatever it is, those other programs will work better. Thanks for the recommendations.

EDIT: I tried grepmail and it gave discouraging results:
$ grepmail . -r gmail-archive/gmail-backup.mbox
gmail-archive/gmail-backup.mbox: 50205

$ mboxgrep -c . gmail-archive/gmail-backup.mbox
51601


I tried mboxgrep as well and I get inconsistent counts on all fronts. That is a bit concerning. This also means that something is probably weird with the way getmail fetches messages from gmail. Any ideas?

Edited at 2008-05-13 08:06 am UTC

(Reply to this)(Parent)(Thread)


[info]compwizrd
2008-05-13 01:05 pm UTC (link)
You can just straight out copy the mbox file and stick it in Thunderbird, and thunderbird will happily read that file(it stores its own mail in mbox format)

That will also tell you once and for all how many emails it thinks are there.

Also, there's a plugin for thunderbird, though i can't remember the name, that finds duplicate emails and presents them to you to allow you to delete them.

Oh, and grepmail can filter by date, once you do get this working


(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…