Skip site navigation (1) Skip section navigation (2)

Re: systable_getnext_ordered

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
Cc: pgsql-novice(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: systable_getnext_ordered
Date: 2011-01-27 00:12:52
Message-ID: 29246.1296087172@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-novice
I wrote:
> yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi) writes:
>> after systable_getnext_ordered returned NULL, is it ok to call it again?

> I wouldn't rely on it working.

>> i'm wondering because inv_truncate seems to do it and expecting NULL.

> Hmm, that may well be a bug.  Have you tested it?

I looked at this a bit more closely, and basically the answer is that
inv_truncate is accidentally failing to fail.  What will actually happen
if systable_getnext_ordered is called another time, at least when using
a btree index, is that it will start the same requested scan over again,
ie deliver all the tuples it did the first time (which is none, in this
case).  That's an implementation artifact, but since the behavior is
undefined in the first place, it's not wrong.

Now, if inv_truncate's initial call on systable_getnext_ordered returns
NULL (ie, the truncation point is past the current EOF page), it will
fall through to the "Write a brand new page" code, which will create and
insert a partial page at the truncation point.  It then goes to the
delete-all-remaining-pages loop.  Because that starts a fresh scan with
the very same scan key conditions, you might expect that it would find
and delete the page it just inserted --- causing the apparent EOF of the
blob to be wrong afterwards.  It accidentally fails to do that because
the new tuple postdates the snapshot it's scanning with.  So the loop
terminates having found no matching tuples, and all is well.

So this code is confusing, inefficient (performing a useless search of
the index), only works because of an obscure consideration not explained
in the comments, and sets a bad precedent for people to follow.  I'm
going to go change it to explicitly not do the final loop if the initial
search failed.  It's not a bug, exactly, but it's sure lousy coding.
Thanks for pointing it out.

			regards, tom lane

In response to

Responses

pgsql-novice by date

Next:From: Josh KupershmidtDate: 2011-01-27 04:10:34
Subject: Re: new to postgreSql
Previous:From: Tom LaneDate: 2011-01-26 19:22:56
Subject: Re: decimal seperator

pgsql-hackers by date

Next:From: Bruce MomjianDate: 2011-01-27 00:18:16
Subject: Re: new compiler warnings
Previous:From: Bruce MomjianDate: 2011-01-27 00:06:41
Subject: Re: Re: In pg_test_fsync, use K(1024) rather than k(1000) for write size units.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group