Re: Abbreviated keys for Numeric

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-02-21 18:57:29
Message-ID: CAM3SWZR0eWv9zix4pAnTFP3=4ajyoXio5txcGUsZV9W+10X0Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 20, 2015 at 9:18 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> The gains for text are also very nice, although in this case that only
> happens for the smallest scale (1M rows), and for larger scales it's
> actually slower than current master :-(

That's odd. I have a hard time thinking of why the datum sort patch
could be at fault, though. I bet the cost model of the text
sortsupport routine is somehow hitting a snag on those larger sized
sets. They should be just as accelerated, and probably more so, than
your 1M sized set that was sped up 4x here.

Can you see what is output with debugging of text abbreviation turned
on? Put "#define DEBUG_ABBREV_KEYS" at the top of varlena.c and
rebuild. Report on the debug1 output, and see if and when abbreviation
is aborted.

I suspected that the cost model was too conservative (or, more
lightly, just too simplistic). I ought to revisit my patch to give the
ad-hoc cost model a sense of proportion about how far along we are,
which was previously deferred [1]. When there is a strong
physical/logical correlation, that can be essential.

Did you first index the text field, and then run CLUSTER for the
larger sized sets on that index (to test abbreviation)? That would
cause there to be a lot of abbreviated keys that seemed to poorly
capture the entropy of their underlying values, when in fact that was
entirely down to our only considering the first 10 tuples in a 100
million tuple set. Having some patience is important there, and a hint
at how far in we are gives the ad-hoc cost model a much better sense
of proportion...it then has a sense of how patient it should be.

[1] http://www.postgresql.org/message-id/CA+TgmoaSXpD73cOj-vSFRfk0nmxjAN6WOQ_Hd9SkmZbOTi+6CQ@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-02-21 19:15:21 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0
Previous Message Kevin Grittner 2015-02-21 18:34:51 Re: Query Rewrite with Postgres' materialized views