Skip site navigation (1) Skip section navigation (2)

Re: mailing list archiver chewing patches

From: Matteo Beccati <php(at)beccati(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mailing list archiver chewing patches
Date: 2010-01-12 19:56:29
Message-ID: 4B4CD3ED.5040207@beccati.com (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-www
Il 12/01/2010 10:30, Magnus Hagander ha scritto:
> The problem is usually with strange looking emails with 15 different
> MIME types. If we can figure out the proper way to render that, the
> rest really is just a SMOP.

Yeah, I was expecting some, but all the message I've looked at seemed to 
be working ok.

> (BTW, for something to actually be used In Production (TM), we want
> something that uses one of our existing frameworks. So don't go
> overboard in code-wise implementations on something else - proof of
> concept on something else is always ok, of course)

OK, that's something I didn't know, even though I expected some kind of 
limitations. Could you please elaborate a bit more (i.e. where to find 
info)?

Having played with it, here's my feedback about AOX:

pros:
- seemed to be working reliably;
- does most of the dirty job of parsing emails, splitting parts, etc
- highly normalized schema
- thread support (partial?)

cons:
- directly publishing the live email feed might not be desirable
- queries might end up being a bit complicate for simple tasks
- might be not easy to add additional processing in the workflow


>>> So just to put this into perspective and give anyone paying attention
>>> an idea of the pain that lies ahead should they decide to work on
>>> this:
>>>
>>> - We need to import the old archives (of which there are hundreds of
>>> thousands of messages, the first few years of which have, umm, minimal
>>> headers.
>>
>> Anyone having a local copy of this in his mailboxes? At some point there
>> were some NNTP gateway, so maybe there's a copy this way.
>
> We have MBOX files.
>
> IIRC, aox has an import function that can read MBOX files. The
> interesting thing is what happens with the really old files that don't
> have complete headers.
>
> I don't think you can trust the NNTP gateway now or in the past,
> messages are sometimes lost there. The mbox files are as complete as
> anything we'll ever get.

Importing the whole pgsql-www archive with a perl script that bounces 
messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, 
I haven't looked into it that much.

>>> - We need to generate thread indexes
>>
>> We have CTEs :)
>
> Right. We still need the threading information, so we have something
> to use our CTEs on :-)
>
> But I assume that AOX already does this?

there are thread related tables and they seem to get filled when a SORT 
IMAP command is issued, however I haven't found a way to get the 
hierarchy out of them.

What that means is that we'd need some kind of post processing to 
populate a thread hierarchy.

If there isn't a fully usable thread hierarchy I was more thinking to 
ltree, mainly because I've successfully used it in past and I haven't 
had enough time yet to look at CTEs. But if performance is comparable I 
don't see a reason why we shouldn't use them.

>>> - We need to re-generate the original URLs for backwards compatibility
>>
>> I guess the message-id one ain't the tricky one... and it should be
>> possible to fill a relation table like
>>   monharc_compat(message_id, list, year, month, message_number);
>
> Yeah. It's not so hard, you can just screen-scrape the current
> archives the same way the search server does.

Definitely an easy enough task.

With all that said, I can't promise anything as it all depends on how 
much spare time I have, but I can proceed with the evaluation if you 
think it's useful. I have a feeling that AOX is not truly the right tool 
for the job, but we might be able to customise it to suit our needs. Are 
there any other requirements that weren't specified?


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/

In response to

Responses

pgsql-www by date

Next:From: Matteo BeccatiDate: 2010-01-12 19:58:09
Subject: Re: mailing list archiver chewing patches
Previous:From: Magnus HaganderDate: 2010-01-12 18:54:27
Subject: Re: mailing list archiver chewing patches

pgsql-hackers by date

Next:From: Matteo BeccatiDate: 2010-01-12 19:58:09
Subject: Re: mailing list archiver chewing patches
Previous:From: Teodor SigaevDate: 2010-01-12 19:21:03
Subject: Re: KNNGiST for knn-search (WIP)

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group