Re: strange buildfarm failures

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: alvherre(at)commandprompt(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange buildfarm failures
Date: 2007-04-26 04:44:51
Message-ID: 46302E43.4020509@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>>> Stefan Kaltenbrunner wrote:
>>>> two of my buildfarm members had different but pretty weird looking
>>>> failures lately:
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-04-25%2002:03:03
>>>> and
>>>>
>>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=emu&dt=2007-04-24%2014:35:02
>>>>
>>>> any ideas on what might causing those ?
>
> Just for the record, quagga and emu failures don't seem related to the
> report below. They don't crash; the regression.diffs contains data that
> suggests that there may be data corruption of some sort.
>
> INSERT INTO INET_TBL (c, i) VALUES ('192.168.1.2/30', '192.168.1.226');
> ERROR: invalid cidr value: "%{"
>
> This doesn't seem to make much sense.

yeah on further reflection it looks like the failures from emu and
quagga seem unrelated to the issue lionfish is experiencing

>
>
>>> lionfish just failed too:
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-25%2005:30:09
>> And had a similar failure a few days ago. The curious thing is that
>> what we get in the postmaster log is
>>
>> LOG: server process (PID 23405) was terminated by signal 6: Aborted
>> LOG: terminating any other active server processes
>>
>> You would think SIGABRT would come from an assertion failure, but
>> there's no preceding assertion message in the log. The other
>> characteristic of these crashes is that *all* of the failing regression
>> instances report "terminating connection because of crash of another
>> server process", which suggests strongly that the crash was in an
>> autovacuum process (if it were bgwriter or stats collector the
>> postmaster would've said so). So I think the recent autovac patches
>> are at fault. I spent a bit of time trolling for a spot where the code
>> might abort() without having printed anything, but didn't find one.
>
> Hmm. I kept an eye on the buildfarm for a few days, but saw nothing
> that could be connected to autovacuum so I neglected it.
>
> This is the other failure:
>
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfish&dt=2007-04-20%2005:30:14
>
> It shows the same pattern. I am baffled -- I don't understand how it
> can die without reporting the error.

I should have mentioned that initially - but I think the failure from
2007-04-20 is not related at all.
The failure from 2007-04-20 was very likely caused due to the kernel
running totally out of memory (lionfish is a very resource starved box
at only 48MB of RAM and 128MB of swap at that time - do we have a recent
patch that is increasing memory usage quite a lot?).
I immediatly added another 128MB of swap after that and I don't think
the failure from yesterday is the same (at least there are no kernel
logs that indicate a similiar issue)
>
> Apparently it crashes rather frequently, so it shouldn't be too
> difficult to reproduce on manual runs. If we could get it to run with a
> higher debug level, it might prove helpful to further pinpoint the
> problem.

a manual run of the buildfarm script takes ~4,5 hours on lionfish ;-)

>
> The core file would be much better obviously (first and foremost to
> confirm that it's autovacuum that's crashing ... )

I will see what I can come up with ...

Stefan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-04-26 07:21:15 Re: Vacuum-full very slow
Previous Message Alvaro Herrera 2007-04-26 03:04:45 Re: strange buildfarm failures