Re: mailing list archiver chewing patches

From: Matteo Beccati <php(at)beccati(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Joe Conway <mail(at)joeconway(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, David Fetter <david(at)fetter(dot)org>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Dave Page <dpage(at)pgadmin(dot)org>, Abhijit Menon-Sen <ams(at)toroid(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tim Bunce <Tim(dot)Bunce(at)pobox(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mailing list archiver chewing patches
Date: 2010-02-13 12:34:56
Message-ID: 4B769C70.8060106@beccati.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-www

On 01/02/2010 17:28, Tom Lane wrote:
> Matteo Beccati<php(at)beccati(dot)com> writes:
>> My main concern is that we'd need to overcomplicate the thread detection
>> algorithm so that it better deals with delayed messages: as it currently
>> works, the replies to a missing message get linked to the
>> "grand-parent". Injecting the missing message afterwards will put it at
>> the same level as its replies. If it happens only once in a while I
>> guess we can live with it, but definitely not if it happens tens of
>> times a day.
>
> That's quite common unfortunately --- I think you're going to need to
> deal with the case. Even getting a direct feed from the mail relays
> wouldn't avoid it completely: consider cases like
>
> * A sends a message
> * B replies, cc'ing A and the list
> * B's reply to list is delayed by greylisting
> * A replies to B's reply (cc'ing list)
> * A's reply goes through immediately
> * B's reply shows up a bit later
>
> That happens pretty frequently IME.

I've improved the threading algorithm by keeping an ordered backlog of
unresolved references, i.e. when a message arrives:

1. Search for a parent message using:

1a. In-Reply-To header. If referenced message is not found insert its
Message-Id to the backlog table with position 0

1b. References header. For each missing referenced message insert its
Message-Id to the backlog table with position N

1c. MS Exchange Thread-Index and Thread-Topic headers

2. Message is stored along with its parent ID, if any.

3. Compare the Message-Id header with the backlog table. Update the
parent field of any referencing message and clean up positions >= n in
the references table.

Now I just need some time to do a final clean up and I'd be ready to
publish the code, which hopefully will be clearer than my words ;)

Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-02-13 14:32:31 Re: Package namespace and Safe init cleanup for plperl [PATCH]
Previous Message Tim Bunce 2010-02-13 10:17:55 Re: Package namespace and Safe init cleanup for plperl [PATCH]

Browse pgsql-www by date

  From Date Subject
Next Message Thom Brown 2010-02-22 08:59:37 PGSQL_Announce spamming Twitter via identi.ca
Previous Message Greg Sabino Mullane 2010-02-11 16:28:51 Re: Versions RSS page is missing version(s)