Re: Abbreviated keys for Numeric

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-03-21 03:58:10
Message-ID: CAM3SWZTLZT8oNLd2TniXsX1SQ65_gMiOnFSn92X3X+cWz6nq1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 20, 2015 at 7:10 PM, Andrew Gierth
<andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:
> Peter> Other than that, I've tried to keep things closer to the text
> Peter> opclass. For example, the cost model now has a few debugging
> Peter> traces (disabled by default). I have altered the ad-hoc cost
> Peter> model so that it no longer concerns itself with NULL inputs,
> Peter> which seemed questionable (not least since the abbreviation
> Peter> conversion function isn't actually called for NULL inputs. Any
> Peter> attempt to track the full count within numeric code therefore
> Peter> cannot work.).

> This is simply wrong. The reason why the cost model (in my version)
> tracks non-null values by having its own counter is precisely BECAUSE
> the passed-in memtupcount includes nulls, and therefore the code will
> UNDERESTIMATE the fraction of successfully abbreviated values if the
> comparison is based on memtupcount.

Oh, right. It's the other way around in your original.

I don't really buy it, either way. In what sense is a NULL value ever
abbreviated? It isn't. Whatever about the cost model, that's the truth
of the matter. There is always going to be a sort of tension in any
cost model, between whether or not it's worth making it more
sophisticated, and the extent to which tweaking the model is chasing
diminishing returns.

> In your version, if there are 9999
> null values at the start of the input, then on the first non-null value
> after that, memtupcount will be 10000 and there will be only 1 distinct
> abbreviated value, causing abbreviation to spuriously abort.

By what objective standard is that spurious? As things stand, I
hesitate to get these ad-hoc cost models into the business of worrying
about sorts with many NULL values, because of the additional
complexity. Sorts with many NULL values, such as your example, have
always been very fast, but also very rare in the real world.

Sure, things might change in the event of many NULLs, and so you might
consider that it's worth hanging on at no extra cost. But these cost
models are really about preventing the very worst case. It seems
worthwhile to not over-complicate them. They're based on the
assumption that a sample of the first n values are representative of
the whole, which, in general, could certainly be false. We do what we
can. Your example has one abbreviated key in it, which is exactly
worthless among 9999 NULL values. If it is representative of the next
10K rows, or the next 100K, then we probably should abort. Maybe that
isn't exactly the right thing here, but if so that's only because
numeric abbreviation is relatively cheap. in general, amortizing the
cost of comparisons through encoding is a lousy strategy when there
will be so few real comparisons.

I'd like to hear what other people think. We could certainly consider
adding that back, since it isn't especially complicated. Perhaps I was
hasty there.

> The test to clamp the estimate to 1.0 is just nonsensical and serves no
> purpose whatever, and the comment for it is wrong.
>
> You should fix the text abbreviation code, not propagate your mistakes
> further.
>
> (BTW, there's an outright typo in your code, ';;' for ';' at the end of
> a line. Sloppy.)

We're really going to call out minor typos like that as sloppy? If so,
let me name a few of yours:

* Wrong ordering of header includes

* Trailing whitespace

* "...but this time is it he original weight in digit..." (not "it is"?)

I also think that your explanation of the encoding schemes was
perfunctory. And, the VARATT_IS_SHORT() hack that you added seemed
wholly unnecessary.

You better remind the committer that's going to "consider my
[Andrew's] original version unchanged in preference to this" to go
over these points again. Or you could try and work it out with me,
the reviewer, rather than behaving so petulantly.

> Peter> I also now allocate a buffer of scratch memory separately from
> Peter> the main sortsupport object - doing one allocation for all
> Peter> sortsupport state, bunched together as a buffer seemed like a
> Peter> questionable micro-optimization.
>
> It's yet another cache line... I admit I did not benchmark that choice,
> but then neither did you.

You're right, I didn't. There are two reasons why:

1) It doesn't work that way. I am not required to make sure that a
patch I'm reviewing doesn't take advantage of every possible
micro-optimization. Clarity is a more pressing concern. If I changed
existing code in the master branch, that would be another story.
You're the patch author here, remember? If it's such a loss, then
prove it.

2) This patch is extremely effective in general. Well done! It seemed
silly to worry about a micro-optimization like that, especially given
the current time pressures for *both* of us. It can always be
revisited.

> Peter> It seemed unwise to silently disable abbreviation when someone
> Peter> happened to build with DEC_DIGITS != 4. A static assertion now
> Peter> gives these unusual cases the opportunity to make an informed
> Peter> decision about either disabling abbreviation or not changing
> Peter> DEC_DIGITS in light of the performance penalty, in a
> Peter> self-documenting way.
>
> A) Nobody in their right minds is ever going to do that anyway
>
> B) Anybody who does that is either not concerned about performance or is
> concerned only about performance of the low-level numeric ops, and
> abbreviation is the last thing they're going to be worried about in
> either case.

You're probably right about that, but then if we do things my way, we
don't have to plaster #ifdef rubbish all over the place. That seems
like another distinct advantage to me (it's mostly what I was thinking
of, actually) - it makes sense to reduce the places that have defenses
against this to one if we can, because, as you point out, it'll
probably never come up anyway.

Get off your high horse. I was perfectly respectful in my comments to
you. I expect the same in return. If you insist in being so abrasive
at the slightest perceived provocation, no one will want to work with
you. This process is consensus driven. You can make objections without
being a jerk.
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-03-21 04:29:22 Re: Remove fsync ON/OFF as a visible option?
Previous Message Bruce Momjian 2015-03-21 02:59:41 Re: double vacuum in initdb