Re: mailing list archiver chewing patches

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Dave Page <dpage(at)pgadmin(dot)org>, Matteo Beccati <php(at)beccati(dot)com>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mailing list archiver chewing patches
Date: 2010-01-12 09:30:41
Message-ID: 9837222c1001120130n67516f0fj317b5656b435f964@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

On Tue, Jan 12, 2010 at 10:05, Dimitri Fontaine <dfontaine(at)hi-media(dot)com> wrote:
> Dave Page <dpage(at)pgadmin(dot)org> writes:
>> 2010/1/12 Matteo Beccati <php(at)beccati(dot)com>:
>>> So, I've decided to spend a bit more time on this and here is a proof of concept web app that displays mailing list archives reading from the AOX database:
>>>
>>> http://archives.beccati.org/
>>
>> Seems to work.
>
> Hehe, nice a beginning!

The problem is usually with strange looking emails with 15 different
MIME types. If we can figure out the proper way to render that, the
rest really is just a SMOP.

(BTW, for something to actually be used In Production (TM), we want
something that uses one of our existing frameworks. So don't go
overboard in code-wise implementations on something else - proof of
concept on something else is always ok, of course)

>> So just to put this into perspective and give anyone paying attention
>> an idea of the pain that lies ahead should they decide to work on
>> this:
>>
>> - We need to import the old archives (of which there are hundreds of
>> thousands of messages, the first few years of which have, umm, minimal
>> headers.
>
> Anyone having a local copy of this in his mailboxes? At some point there
> were some NNTP gateway, so maybe there's a copy this way.

We have MBOX files.

IIRC, aox has an import function that can read MBOX files. The
interesting thing is what happens with the really old files that don't
have complete headers.

I don't think you can trust the NNTP gateway now or in the past,
messages are sometimes lost there. The mbox files are as complete as
anything we'll ever get.

>> - We need to generate thread indexes
>
> We have CTEs :)

Right. We still need the threading information, so we have something
to use our CTEs on :-)

But I assume that AOX already does this?

>> - We need to re-generate the original URLs for backwards compatibility
>
> I guess the message-id one ain't the tricky one... and it should be
> possible to fill a relation table like
>  monharc_compat(message_id, list, year, month, message_number);

Yeah. It's not so hard, you can just screen-scrape the current
archives the same way the search server does.

> Then we'd need some help from the webserver (rewrite rules I guess) so
> that the current URL is transformed to call a catch-all script:
>   http://archives.postgresql.org/pgsql-xxx/YYYY-MM/msg01234.php
> -> http://archives.postgresql.org/compat.php?l=xxx&y=YYYY&m=MM&n=01234

Or just a trivial regexp catch in any modern app platform.

> In that compat.php script you then issue the following query or the like
> to get the message_id, then use the newer infrastructure to get to
> display it:
>
>  SELECT message_id
>    FROM monharc_compat
>   WHERE list = ? and year = ? and month = ? and message_number = ?;

I'd rather see it redirect it to the new style URL, but it's the same
query, yes :-)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2010-01-12 09:31:17 Re: Hot Standy introduced problem with query cancel behavior
Previous Message Peter Eisentraut 2010-01-12 09:28:41 Re: Typed tables

Browse pgsql-www by date

  From Date Subject
Next Message Magnus Hagander 2010-01-12 09:33:46 Re: mailing list archiver chewing patches
Previous Message Dimitri Fontaine 2010-01-12 09:05:56 Re: mailing list archiver chewing patches