Re: Optimize single tuple fetch from nbtree index

From: Floris Van Nee <florisvannee(at)Optiver(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize single tuple fetch from nbtree index
Date: 2019-08-05 11:34:26
Message-ID: 1565004866513.6952@Optiver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter,

> Actually, having looked at the test case in more detail, that now
> seems less likely. The test case seems designed to reward making it
> cheaper to access one specific tuple among a fairly large group of
> related tuples -- reducing the number of inner scans is not going to
> be possible there.

> If this really is totally representative of the case that Floris cares
> about, I suppose that the approach taken more or less makes sense.
> Unfortunately, it doesn't seem like an optimization that many other
> users would find compelling, partly because it's only concerned with
> fixed overheads, and partly because most queries don't actually look
> like this.

Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case wouldn't come up more often by other users though. We have many large tables with timeseries data and it seems to me that with timeseries data, two of the most common queries are:
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at timepoint t - so you're left with trying to find the latest update smaller than t)
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both provide a temperature value and update independently from each other).
Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just doing a nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself versus the sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1 approach is much faster).

Note that there is actually some related work to this - in the Index Skip Scan thread [1] a function called _bt_read_closest was developed which also partially reads the page. A Skip Scan has a very similar access pattern to the use case I describe here, because it's also very likely to just require one tuple from the page. Even though the implementation in that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit faster if it had a correct implementation of this partial page-read and it wouldn't have to read the full page every time.

I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this:

* Note that if we hold a pin on the target page continuously from initially
* reading the items until applying this function, VACUUM cannot have deleted
* any items from the page, and so there is no need to search left from the
* recorded offset. (This observation also guarantees that the item is still
* the right one to delete, which might otherwise be questionable since heap
* TIDs can get recycled.) This holds true even if the page has been modified
* by inserts and page splits, so there is no need to consult the LSN.

Still, exactly this case happens in practice. In my tests I was able to get behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset.

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin?

-Floris

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Igrishin 2019-08-05 11:38:40 psql's meta command \d is broken as of 12 beta2.
Previous Message Laurenz Albe 2019-08-05 11:30:43 Re: Identity columns should own only one sequence