Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-12-02 21:21:34
Message-ID: CAM3SWZTO7MzGLKSfvVXTN_wEt=cOJdXfgW2JqOsFu+_xOk4UqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 2, 2014 at 1:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'd prefer not to have a #define in pg_config_manual.h. Only stuff
> that we expect a reasonably decent number of users to need to change
> should be in that file, and this is too marginal for that. If anybody
> other than the developers of the feature is disabling this, the whole
> thing is going to be ripped back out.

I agree. The patch either works well in general and it goes in, or it
doesn't and should be rejected. That doesn't mean that the standard
applied is that regressions are absolutely unacceptable, but the
standard shouldn't be too far off that. I feel pretty confident that
we'll be able to meet that standard, though, because database luminary
Jim Gray recommended this technique in about 1995. Even if the details
of what I have here could stand to be tweaked, there is no getting
around the fact that a good sort routine needs to strongly consider
locality. That was apparent even in 1995, but now it's a very major
trend.

Incidentally, I think that an under-appreciated possible source of
regressions here is that attributes abbreviated have a strong
physical/logical correlation. I could see a small regression for one
such case even though my cost model indicated that it should be very
profitable. On the other hand, on other occasions my cost model (i.e.
considering how good a proxy for full key cardinality abbreviated key
cardinality is) was quite pessimistic. Although, at least it was only
a small regression, even though the correlation was something like
0.95. And at least the sort will be very fast in any case.

You'll recall that Heikki's test case involved correlation like that,
even though it was mostly intended to make a point about the entropy
in abbreviated keys. Correlation was actually the most important
factor there. I think it might be generally true that it's the most
important factor, in practice more important even than capturing
sufficient entropy in the abbreviated key representation.

>> I'm not sure about that. I'd prefer to have tuplesort (and one or two
>> other sites) set the "abbreviation is possible in principle" flag.
>> Otherwise, sortsupport needs to assume that the leading attribute is
>> going to be the abbreviation-applicable one, which might not always be
>> true. Still, it's not as if I feel strongly about it.
>
> When wouldn't that be true?

It just feels a bit wrong to me. There might be a future in which we
want to use the datum1 field for a non-leading attribute. For example,
when it is known ahead of time that there are low cardinality integers
in the leading key/attribute. Granted, that's pretty speculative, but
then it's not as if I'm insisting that it must be done that way. I
defer to you.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-12-02 21:50:01 Re: using custom scan nodes to prototype parallel sequential scan
Previous Message Emre Hasegeli 2014-12-02 21:14:00 Re: Selectivity estimation for inet operators