Re: Abbreviated keys for Numeric

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-03-23 19:04:34
Message-ID: CA+TgmoYwShrH=8Szk52V3eGY+PQQtLsDvwdLSUiWq9XeTGjz+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 21, 2015 at 2:58 AM, Andrew Gierth
<andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:
> Peter> I don't really buy it, either way. In what sense is a NULL value
> Peter> ever abbreviated? It isn't. Whatever about the cost model,
> Peter> that's the truth of the matter. There is always going to be a
> Peter> sort of tension in any cost model, between whether or not it's
> Peter> worth making it more sophisticated, and the extent to which
> Peter> tweaking the model is chasing diminishing returns.
>
> Comparisons between nulls and nulls, or between nulls and non-nulls, are
> cheap; only comparisons between non-nulls and non-nulls can be
> expensive.
>
> The purpose of abbreviation is to replace expensive comparisons by cheap
> ones where possible, and therefore the cost model used for abbreviation
> should ignore nulls entirely; all that matters is the number of non-null
> values and the probability of saving time by abbreviating them.
>
> So if you're sorting a million rows of which 900,000 are null and
> 100,000 contain 50 different non-null values, then the absolute time
> saved (not the proportion) by doing abbreviation should be on the same
> order as the absolute time saved by abbreviation when sorting just the
> 100,000 non-null rows.
>
> But if the cost model does 1,000,000/50 and gets 20,000, and decides
> "that's worse than my 1 in 10,000 target, I'll abort abbreviations",
> then you have sacrificed the time gain for no reason at all. This is
> what I mean by "spurious". This is why the cost model must compute the
> fraction as 100,000/50, ignoring the null inputs, if it's going to
> perform anything like optimally in the presence of nulls.

I think Andrew is right.

> Peter> I also think that your explanation of the encoding schemes was
> Peter> perfunctory.
>
> I'm interested in other opinions on that, because I find your
> replacement for it both confusingly organized and a bit misleading (for
> example saying the top bit is "wasted" is wrong, it's reserved because
> we need it free for the sign).
>
> (It is true that mine assumes that the reader knows what "excess-N"
> means, or can find out.)
>
> Here's mine, which is given as a single block comment:
>
> [ long explanatory comment ]
>
> Peter's, inline with the code (omitted here):
>
> [ long explanatory comment ]

In my opinion, Andrew's version is far clearer. Peter's version is
full of jargon that I can't understand. I could probably figure it
out with a few hours and a search engine, but that really shouldn't be
necessary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2015-03-23 19:06:51 Re: Exposing PG_VERSION_NUM in pg_config
Previous Message Fabien COELHO 2015-03-23 19:00:04 Re: PATCH: pgbench - merging transaction logs