Re: Minmax indexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minmax indexes
Date: 2014-08-08 08:29:00
Message-ID: 53E48A4C.7010401@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/06/2014 05:35 AM, Alvaro Herrera wrote:
> FWIW I think I haven't responded appropriately to the points raised by
> Heikki. Basically, as I see it there are three main items:
>
> 1. the revmap physical-to-logical mapping is too complex; let's use
> something else.
>
> We had revmap originally in a separate fork. The current approach grew
> out of the necessity of putting it in the main fork while ensuring that
> fast access to individual pages is possible. There are of course many
> ways to skin this cat; Heikki's proposal is to have it always occupy the
> first few physical pages, rather than require a logical-to-physical
> mapping table. To implement this he proposes to move other pages out of
> the way as the index grows. I don't really have much love for this
> idea. We can change how this is implemented later in the cycle, if we
> find that a different approach is better than my proposal. I don't want
> to spend endless time meddling with this (and I definitely don't want to
> have this delay the eventual commit of the patch.)

Please also note that LockTuple is pretty expensive, compared to
lightweight locks. Remember how Robert made hash indexes signifcantly
faster a couple of years ago (commit 76837c15) by removing the need for
heavy-weight locks during queries. To demonstrate that, I applied your
patch, and ran a very simple test:

create table numbers as select i*1000+j as n from generate_series(0,
19999) i, generate_series(1, 1000) j;
create index number_minmax on numbers using minmax (n) with
(pages_per_range=1);

I ran "explain analyze select * from numbers where n = 10;" a few times
under "perf" profiler. The full profile is attached, but here's the top 10:

Samples: 3K of event 'cycles', Event count (approx.): 2332550418

+ 24.15% postmaster postgres [.] hash_search_with_hash_value
+ 10.55% postmaster postgres [.] LWLockAcquireCommon
+ 7.12% postmaster postgres [.] hash_any
+ 6.77% postmaster postgres [.] minmax_deform_tuple
+ 6.67% postmaster postgres [.] LWLockRelease
+ 5.55% postmaster postgres [.] AllocSetAlloc
+ 4.37% postmaster postgres [.] SetupLockInTable.isra.2
+ 2.79% postmaster postgres [.] LockRelease
+ 2.67% postmaster postgres [.] LockAcquireExtended
+ 2.54% postmaster postgres [.] mmgetbitmap

If you drill into those functions, you'll see that most of the time
spent in hash_search_with_hash_value, LWLockAcquireCommon and hash_any
are coming from heavy-weight lock handling. At a rough estimate, about
1/3 of the CPU time is spent on LockTuple/UnlockTuple.

Maybe we don't care because it's fast enough anyway, but it just seems
like we're leaving a lot of money on the table. Because of that, and all
the other reasons already discussed, I strongly feel that this design
should be changed.

> 3. avoid MMTuple as it is just unnecessary extra complexity.
>
> The main thing that MMTuple adds is not the fact that we save 2 bytes
> by storing BlockNumber as is instead of within a TID field. Instead,
> it's that we can construct and deconstruct using our own design, which
> means we can use however many Datum entries we want and however many
> "null" flags. In normal heap and index tuples, there are always the
> same number of datum/nulls. In minmax, the number of nulls is twice the
> number of indexed columns; the number of datum values is determined by
> how many datum values are stored per opclass ("sortable" opclasses
> store 2 columns, but geometry would store only one).

Hmm. Why is the number of null bits 2x the number of indexed columns? I
would expect there to be one null bit per stored Datum.

(/me looks at the patch):

> /*
> * We need a double-length bitmap on an on-disk minmax index tuple;
> * the first half stores the "allnulls" bits, the second stores
> * "hasnulls".
> */

So, one bit means whether there are any heap tuples with a NULL in the
indexed column, and the other bit means if the value stored for that
column is a NULL. Does that mean that it's not possible to store a NULL
minimum, but non-NULL maximum, for a single column? I can't immediately
think of an example where you'd want to do that, but I'm also not
convinced that no opclass would ever want that. Individual bits are
cheap, so I'm inclined to rather have too many of them than regret later.

In any case, it should be documented in minmax_tuple.h what those
null-bits are and how they're laid out in the bitmap. The comment there
currently just says that there are "two null bits for each value stored"
(which isn't actually wrong, because you're storing two bits per indexed
column, not two bits per value stored (but I just suggested changing
that, after which the comment would be correct)).

PS. Please add regression tests. It would also be good to implement at
least one other opclass than the b-tree based ones, to make sure that
the code actually works with something else too. I'd suggest
implementing the bounding box opclass for points, that seems simple.

- Heikki

Attachment Content-Type Size
minmax-locktuple-is-expensive-profile text/plain 140.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-08-08 09:01:44 Re: Minmax indexes
Previous Message Jeff Davis 2014-08-08 08:16:32 Re: 9.5: Better memory accounting, towards memory-bounded HashAgg