Re: thread safety on clients

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>, Greg Smith <greg(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: thread safety on clients
Date: 2009-12-11 18:08:18
Message-ID: 20810.1260554898@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> The crash is real --- I've replicated it here. Still trying to figure
> out what is the real cause.

Okay, I've sussed it. The crash itself is due to a memory management
mistake in the recently-rewritten EvalPlanQual code (pfree'ing a tuple
that we still have live Datum references to). The reason the "broken"
pgbench is able to trigger it is that it generates concurrent updates
for the same row in pgbench_accounts with much higher frequency than
normal. This forces the backends through the EvalPlanQual code, which
is broken and crashes. QED.

I'll commit a fix for that shortly, but this reminds me once again that
the EvalPlanQual logic is desperately under-tested in our normal
regression testing. We really need some kind of testing infrastructure
that would let us exercise concurrent-update cases a bit more than "not
at all".

Also, the reason why Bruce's mistake exposed this is interesting.
Omitting the #define for ENABLE_THREAD_SAFETY does not actually break
"pgbench -j" at all -- it has a fallback strategy that uses multiple
subprocesses instead of multiple threads. However, it has only one
srandom() call, which occurs in the parent process before forking.
This means that the subprocesses all start with the same random number
state, which means they generate the same sequence of "random" account
IDs to update, which is why the probability of an update collision
requiring EvalPlanQual resolution is so high, especially at the start
of the run (and the crash does happen pretty much immediately after
starting pgbench).

I'm inclined to think this is bad, and we should fix pgbench to
re-randomize after forking. If we don't, we'll have synchronized
behavior on some platforms and not others, which doesn't seem like a
good idea. On the other hand, if you did want that type of behavior,
it's hard to see how you'd get it. Is it worth trying to provide that
as a (probably non-default) mode in pgbench? If so, how would you
do it on threaded platforms?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-12-11 18:12:27 Re: [PATCH] dtrace probes for memory manager
Previous Message Zdenek Kotala 2009-12-11 18:04:31 Re: [PATCH] dtrace probes for memory manager