Re: mailing list archiver chewing patches

From: Matteo Beccati <php(at)beccati(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mailing list archiver chewing patches
Date: 2010-01-12 20:37:50
Message-ID: 4B4CDD9E.7010204@beccati.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

Il 12/01/2010 21:04, Magnus Hagander ha scritto:
> On Tue, Jan 12, 2010 at 20:56, Matteo Beccati<php(at)beccati(dot)com> wrote:
>> Il 12/01/2010 10:30, Magnus Hagander ha scritto:
>>>
>>> The problem is usually with strange looking emails with 15 different
>>> MIME types. If we can figure out the proper way to render that, the
>>> rest really is just a SMOP.
>>
>> Yeah, I was expecting some, but all the message I've looked at seemed to be
>> working ok.
>
> Have you been looking at old or new messages? Try grabbing a couple of
> MBOX files off archives.postgresql.org from several years back, you're
> more likely to find weird MUAs then I think.

Both. pgsql-hacker and -general are subscribed and getting new emails
and pgsql-www is just an import of the archives:

http://archives.beccati.org/pgsql-www/by/date (sorry, no paging)

(just fixed a 500 error that was caused by the fact that I've been
playing with the db a bit and a required helper table was missing)

>>> (BTW, for something to actually be used In Production (TM), we want
>>> something that uses one of our existing frameworks. So don't go
>>> overboard in code-wise implementations on something else - proof of
>>> concept on something else is always ok, of course)
>>
>> OK, that's something I didn't know, even though I expected some kind of
>> limitations. Could you please elaborate a bit more (i.e. where to find
>> info)?
>
> Well, the framework we're moving towards is built on top of django, so
> that would be a good first start.
>
> There is also whever the commitfest thing is built on, but I'm told
> that's basically no framework.

I'm afraid that's outside on my expertise. But I can get as far as
having a proof of concept and the required queries / php code.

>> Having played with it, here's my feedback about AOX:
>>
>> pros:
>> - seemed to be working reliably;
>> - does most of the dirty job of parsing emails, splitting parts, etc
>> - highly normalized schema
>> - thread support (partial?)
>
> A killer will be if that thread support is enough. If we have to build
> that completely ourselves, it'll take a lot more work.

Looks like we need to populate a helper table with hierarchy
information, unless Ahijit has a better idea and knows how to get it
from the aox main schema.

>> cons:
>> - directly publishing the live email feed might not be desirable
>
> Why not?

The scenario I was thinking at was the creation of a static snapshot and
potential inconsistencies that might occur if the threads get updated
during that time.

>> - queries might end up being a bit complicate for simple tasks
>
> As long as we don't have to hit them too often, which is solve:able
> with caching. And we do have a pretty good RDBMS to run the queries on
> :)

True :)

>>> I don't think you can trust the NNTP gateway now or in the past,
>>> messages are sometimes lost there. The mbox files are as complete as
>>> anything we'll ever get.
>>
>> Importing the whole pgsql-www archive with a perl script that bounces
>> messages via SMTP took about 30m. Maybe there's even a way to skip SMTP, I
>> haven't looked into it that much.
>
> Um, yes. There is an MBOX import tool.

Cool.

>> With all that said, I can't promise anything as it all depends on how much
>> spare time I have, but I can proceed with the evaluation if you think it's
>> useful. I have a feeling that AOX is not truly the right tool for the job,
>> but we might be able to customise it to suit our needs. Are there any other
>> requirements that weren't specified?
>
> Well, I think we want to avoid customizing it. Using a custom
> frontend, sure. But we don't want to end up customizing the
> parser/backend. That's the road to unmaintainability.

Sure. I guess my wording wasn't right... I was more thinking about
adding new tables, materialized views or whatever else might be missing
to make it fit out purpose.

Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-01-12 20:39:00 Re: Streaming replication status
Previous Message Aidan Van Dyk 2010-01-12 20:16:47 Re: mailing list archiver chewing patches

Browse pgsql-www by date

  From Date Subject
Next Message Dimitri Fontaine 2010-01-12 21:28:03 Re: mailing list archiver chewing patches
Previous Message Aidan Van Dyk 2010-01-12 20:16:47 Re: mailing list archiver chewing patches