Re: mailing list archiver chewing patches

From: Matteo Beccati <php(at)beccati(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mailing list archiver chewing patches
Date: 2010-01-12 19:56:29
Message-ID: 4B4CD3ED.5040207@beccati.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

Il 12/01/2010 10:30, Magnus Hagander ha scritto:
> The problem is usually with strange looking emails with 15 different
> MIME types. If we can figure out the proper way to render that, the
> rest really is just a SMOP.

Yeah, I was expecting some, but all the message I've looked at seemed to
be working ok.

> (BTW, for something to actually be used In Production (TM), we want
> something that uses one of our existing frameworks. So don't go
> overboard in code-wise implementations on something else - proof of
> concept on something else is always ok, of course)

OK, that's something I didn't know, even though I expected some kind of
limitations. Could you please elaborate a bit more (i.e. where to find
info)?

Having played with it, here's my feedback about AOX:

pros:
- seemed to be working reliably;
- does most of the dirty job of parsing emails, splitting parts, etc
- highly normalized schema
- thread support (partial?)

cons:
- directly publishing the live email feed might not be desirable
- queries might end up being a bit complicate for simple tasks
- might be not easy to add additional processing in the workflow

>>> So just to put this into perspective and give anyone paying attention
>>> an idea of the pain that lies ahead should they decide to work on
>>> this:
>>>
>>> - We need to import the old archives (of which there are hundreds of
>>> thousands of messages, the first few years of which have, umm, minimal
>>> headers.
>>
>> Anyone having a local copy of this in his mailboxes? At some point there
>> were some NNTP gateway, so maybe there's a copy this way.
>
> We have MBOX files.
>
> IIRC, aox has an import function that can read MBOX files. The
> interesting thing is what happens with the really old files that don't
> have complete headers.
>
> I don't think you can trust the NNTP gateway now or in the past,
> messages are sometimes lost there. The mbox files are as complete as
> anything we'll ever get.

Importing the whole pgsql-www archive with a perl script that bounces
messages via SMTP took about 30m. Maybe there's even a way to skip SMTP,
I haven't looked into it that much.

>>> - We need to generate thread indexes
>>
>> We have CTEs :)
>
> Right. We still need the threading information, so we have something
> to use our CTEs on :-)
>
> But I assume that AOX already does this?

there are thread related tables and they seem to get filled when a SORT
IMAP command is issued, however I haven't found a way to get the
hierarchy out of them.

What that means is that we'd need some kind of post processing to
populate a thread hierarchy.

If there isn't a fully usable thread hierarchy I was more thinking to
ltree, mainly because I've successfully used it in past and I haven't
had enough time yet to look at CTEs. But if performance is comparable I
don't see a reason why we shouldn't use them.

>>> - We need to re-generate the original URLs for backwards compatibility
>>
>> I guess the message-id one ain't the tricky one... and it should be
>> possible to fill a relation table like
>> monharc_compat(message_id, list, year, month, message_number);
>
> Yeah. It's not so hard, you can just screen-scrape the current
> archives the same way the search server does.

Definitely an easy enough task.

With all that said, I can't promise anything as it all depends on how
much spare time I have, but I can proceed with the evaluation if you
think it's useful. I have a feeling that AOX is not truly the right tool
for the job, but we might be able to customise it to suit our needs. Are
there any other requirements that weren't specified?

Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matteo Beccati 2010-01-12 19:58:09 Re: mailing list archiver chewing patches
Previous Message Teodor Sigaev 2010-01-12 19:21:03 Re: KNNGiST for knn-search (WIP)

Browse pgsql-www by date

  From Date Subject
Next Message Matteo Beccati 2010-01-12 19:58:09 Re: mailing list archiver chewing patches
Previous Message Magnus Hagander 2010-01-12 18:54:27 Re: mailing list archiver chewing patches