Re: Disk-based hash aggregate's cost model

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Disk-based hash aggregate's cost model
Date: 2020-09-03 00:35:13
Message-ID: CAH2-WzmSOS9O_ko_pkgHJS0WfA-SOMWATUZuaVGc_ktPoK_DQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> create table text10m(t text collate "C.UTF-8", i int, n numeric);
> insert into text10m select s.g::text, s.g, s.g::numeric from (select
> (random()*1000000000)::int as g from generate_series(1,10000000)) s;
> explain analyze select distinct t from text10m;

Note that you won't get what Postgres considers to be the C collation
unless you specify "collate C" -- "C.UTF-8" is the C collation exposed
by glibc. The difference matters a lot, because only the former can
use abbreviated keys (unless you manually #define TRUST_STRXFRM). And
even without abbreviated keys it's probably still significantly faster
for other reasons.

This doesn't undermine your point, because we don't take the
difference into account in cost_sort() -- even though abbreviated keys
will regularly make text sorts 2x-3x faster. My point is only that it
would be more accurate to say that the costing unfairly boosts sorts
on collated texts specifically. Though maybe not when an ICU collation
is used (since abbreviated keys will be enabled generally).

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2020-09-03 01:52:43 Re: describe-config issue
Previous Message Jeff Davis 2020-09-03 00:18:23 Re: Disk-based hash aggregate's cost model