keys_are_unique optimization causes out-of-buffers failure

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Joseph Shraibman <jks(at)selectacast(dot)net>
Subject: keys_are_unique optimization causes out-of-buffers failure
Date: 2003-03-21 21:36:37
Message-ID: 15534.1048282597@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've just tracked down the reason for this failure:
http://archives.postgresql.org/pgsql-general/2003-03/msg01052.php
It seems the reason I didn't see it in my previous test was that I
didn't happen to declare the index as unique. A many-way indexscan
on unique indexes *will* run 7.3.* and CVS tip out of buffers.

The problem is the optimization I added to index_getnext() to skip
calling the index access method after the first call, if the index key
is known unique. With that optimization in place, index_getnext() can
return NULL while still holding a buffer pin on the heap page containing
the single interesting tuple. (Furthermore, the underlying AM is still
holding a pin on the corresponding index page.) This is no big problem
in the normal case --- but when we are doing a multiple-indexscan plan,
nodeIndexscan.c will then move on to the next indexscan in the plan. If
that acts the same, two more buffers will be pinned when nodeIndexscan.c
advances to the third indexscan. Etc. Sooner or later you run out of
buffers.

This isn't a resource leak in the classic sense, because all these
buffers will be released when the plan is closed down; but it's
effectively an intraquery leak, and it will cause problems for many-way
indexscans.

This is the second major bug found in the keys_are_unique hack (the
first being that it didn't cope with backwards scanning). At this point
I'm a little bit disgusted with the hack, and would be willing to
consider ripping it out entirely. But it does offer some performance
benefits, and probably we ought to look first to see if it can be saved.

A minimal-change solution could be made by disabling the keys_are_unique
optimization in the context of a multiple-index scan. This is pretty
grotty though, both from a performance standpoint and because it
introduces coupling between levels that ought to be independent.

A better way to proceed would be to get index_getnext() to release all
the resources involved before returning NULL. I believe this is safe
even though index_getnext() might later need to back up and re-fetch
the heap tuple --- if it returned the tuple once, then the tuple is
visible to the current transaction and can't get deleted from under us,
even if we release the pin on its page. A bit of ugliness is that the
only available way to get the index AM to release its index-level pins
is to call index_rescan, which would normally reset index_getnext's own
state.

So this seems doable but ugly. Can anyone see a better way? Are there
(still) holes in this plan? Anyone want to lobby for dropping the
keys_are_unique code completely?

A related bug has been latent all along: because "mark" positions are
remembered per-indexscan, it'd be possible for successive ExecMarkPos
calls applied to a multi-indexscan plan node to build up an indefinitely
large number of marks and thereby run the system out of buffers.
I don't see any way to fix this without adding a new index-AM API call
to clear out a remembered mark position. However, this problem cannot
actually occur in the current system or any readily-foreseeable version,
because ExecMarkPos is only used by merge joins and a multi-indexscan
couldn't be a direct child of a merge join (since a multi-indexscan
would never be thought to produce any useful sort order). So I'm going
to ignore the potential future problem, but I thought I'd mention it for
the archives.

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-03-21 22:00:03 Re: [Fwd: Bug#184566: security threat to postgresql
Previous Message Barry Lind 2003-03-21 21:32:29 Re: [INTERFACES] Roadmap for FE/BE protocol redesign