Re: B-Tree support function number 3 (strxfrm() optimization)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-09-03 21:18:54
Message-ID: CA+Tgmoa+OjyVHV1aXHgnWTMY5s6ZsQogS9UrUy9J4RpaO-6E_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 2, 2014 at 4:41 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> HyperLogLog isn't sample-based - it's useful for streaming a set and
> accurately tracking its cardinality with fixed overhead.

OK.

>> Is it the right decision to suppress the abbreviated-key optimization
>> unconditionally on 32-bit systems and on Darwin? There's certainly
>> more danger, on those platforms, that the optimization could fail to
>> pay off. But it could also win big, if in fact the first character or
>> two of the string is enough to distinguish most rows, or if Darwin
>> improves their implementation in the future. If the other defenses
>> against pathological cases in the patch are adequate, I would think
>> it'd be OK to remove the hard-coded checks here and let those cases
>> use the optimization or not according to its merits in particular
>> cases. We'd want to look at what the impact of that is, of course,
>> but if it's bad, maybe those other defenses aren't adequate anyway.
>
> I'm not sure. Perhaps the Darwin thing is a bad idea because no one is
> using Macs to run real database servers. Apple haven't had a server
> product in years, and typically people only use Postgres on their Macs
> for development. We might as well have coverage of the new code for
> the benefit of Postgres hackers that favor Apple machines. Or, to look
> at it another way, the optimization is so beneficially that it's
> probably worth the risk, even for more marginal cases.
>
> 8 primary weights (the leading 8 bytes, frequently isomorphic to the
> first 8 Latin characters, regardless of whether or not they have
> accents/diacritics, or punctuation/whitespace) is twice as many as 4.
> But every time you add a byte of space to the abbreviated
> representation that can resolve a comparison, the number of
> unresolvable-without-tiebreak comparisons (in general) is, I imagine,
> reduced considerably. Basically, 8 bytes is way better than twice as
> good as 4 bytes in terms of its effect on the proportion of
> comparisons that are resolved only with abbreviated keys. Even still,
> I suspect it's still worth it to apply the optimization with only 4.
>
> You've seen plenty of suggestions on assessing the applicability of
> the optimization from me. Perhaps you have a few of your own.

My suggestion is to remove the special cases for Darwin and 32-bit
systems and see how it goes.

> That wouldn't be harmless - it would probably result in incorrect
> answers in practice, and would certainly be unspecified. However, I'm
> not reading uninitialized bytes. I call memset() so that in the event
> of the final strxfrm() blob being less than 8 bytes (which can happen
> even on glibc with en_US.UTF-8). It cannot be harmful to memcmp()
> every Datum byte if the remaining bytes are always initialized to NUL.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2014-09-03 21:19:29 Re: PL/pgSQL 1.2
Previous Message Robert Haas 2014-09-03 21:17:17 Re: Need Multixact Freezing Docs