Re: mailing list archiver chewing patches

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Matteo Beccati <php(at)beccati(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mailing list archiver chewing patches
Date: 2010-01-12 20:04:27
Message-ID: 9837222c1001121204m52adbf21k6260609f1c8768b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

On Tue, Jan 12, 2010 at 20:56, Matteo Beccati <php(at)beccati(dot)com> wrote:
> Il 12/01/2010 10:30, Magnus Hagander ha scritto:
>>
>> The problem is usually with strange looking emails with 15 different
>> MIME types. If we can figure out the proper way to render that, the
>> rest really is just a SMOP.
>
> Yeah, I was expecting some, but all the message I've looked at seemed to be
> working ok.

Have you been looking at old or new messages? Try grabbing a couple of
MBOX files off archives.postgresql.org from several years back, you're
more likely to find weird MUAs then I think.

>> (BTW, for something to actually be used In Production (TM), we want
>> something that uses one of our existing frameworks. So don't go
>> overboard in code-wise implementations on something else - proof of
>> concept on something else is always ok, of course)
>
> OK, that's something I didn't know, even though I expected some kind of
> limitations. Could you please elaborate a bit more (i.e. where to find
> info)?

Well, the framework we're moving towards is built on top of django, so
that would be a good first start.

There is also whever the commitfest thing is built on, but I'm told
that's basically no framework.

> Having played with it, here's my feedback about AOX:
>
> pros:
> - seemed to be working reliably;
> - does most of the dirty job of parsing emails, splitting parts, etc
> - highly normalized schema
> - thread support (partial?)

A killer will be if that thread support is enough. If we have to build
that completely ourselves, it'll take a lot more work.

> cons:
> - directly publishing the live email feed might not be desirable

Why not?

> - queries might end up being a bit complicate for simple tasks

As long as we don't have to hit them too often, which is solve:able
with caching. And we do have a pretty good RDBMS to run the queries on
:)

>> I don't think you can trust the NNTP gateway now or in the past,
>> messages are sometimes lost there. The mbox files are as complete as
>> anything we'll ever get.
>
> Importing the whole pgsql-www archive with a perl script that bounces
> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I
> haven't looked into it that much.

Um, yes. There is an MBOX import tool.

>>>> - We need to generate thread indexes
>>>
>>> We have CTEs :)
>>
>> Right. We still need the threading information, so we have something
>> to use our CTEs on :-)
>>
>> But I assume that AOX already does this?
>
> there are thread related tables and they seem to get filled when a SORT IMAP
> command is issued, however I haven't found a way to get the hierarchy out of
> them.
>
> What that means is that we'd need some kind of post processing to populate a
> thread hierarchy.
>
> If there isn't a fully usable thread hierarchy I was more thinking to ltree,
> mainly because I've successfully used it in past and I haven't had enough
> time yet to look at CTEs. But if performance is comparable I don't see a
> reason why we shouldn't use them.

I'd favor CTEs if they are fast enough. Great flexibility.

>>>> - We need to re-generate the original URLs for backwards compatibility
>>>
>>> I guess the message-id one ain't the tricky one... and it should be
>>> possible to fill a relation table like
>>>  monharc_compat(message_id, list, year, month, message_number);
>>
>> Yeah. It's not so hard, you can just screen-scrape the current
>> archives the same way the search server does.
>
> Definitely an easy enough task.
>
> With all that said, I can't promise anything as it all depends on how much
> spare time I have, but I can proceed with the evaluation if you think it's
> useful. I have a feeling that AOX is not truly the right tool for the job,
> but we might be able to customise it to suit our needs. Are there any other
> requirements that weren't specified?

Well, I think we want to avoid customizing it. Using a custom
frontend, sure. But we don't want to end up customizing the
parser/backend. That's the road to unmaintainability.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2010-01-12 20:06:10 Re: Streaming replication status
Previous Message Matteo Beccati 2010-01-12 19:58:09 Re: mailing list archiver chewing patches

Browse pgsql-www by date

  From Date Subject
Next Message Aidan Van Dyk 2010-01-12 20:16:47 Re: mailing list archiver chewing patches
Previous Message Matteo Beccati 2010-01-12 19:58:09 Re: mailing list archiver chewing patches