Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Maxim Ivanov <hi(at)yamlcoder(dot)me>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations
Date: 2020-03-22 09:32:58
Message-ID: CAFj8pRB4RfW52KyRoApBS2zcUEh7UTGgA4jf6E+uwRs8NWwDmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

ne 22. 3. 2020 v 10:12 odesílatel Maxim Ivanov <hi(at)yamlcoder(dot)me> napsal:

> Hi All,
>
> It is known, that collation "C" significantly speeds up string
> comparisons and as a result sorting. I was wondering, whether it is
> possible to use it regardless of collation set on a column in sorts not
> visible to users?
>
> Example I have in mind is sorting performed for GroupAggregate. Purpose
> of that sort is to bring equal values next to each other, so as long as:
> 1) user didn't request ORDER BY in addition to GROUP BY
> 2) source column has any deterministic collation (as per docs all
> builtin collations are deterministic)
>
> it seems to be possible to do sorting with any deterministic collation,
> regardless of what user specifid for the column being sorted. "C" collation
> is deterministic and fastest.
>
> In other words, couldn't PostgreSQL convert this:
>
> -> GroupAggregate (cost=15726557.87..22944558.69 rows=7200001 width=176)
> (actual time=490103.209..771536.389 rows=36000000 loops=1)
> Group Key: ec_180days.msn, ec_180days.to_date_time
> -> Sort (cost=15726557.87..15906557.89 rows=72000008 width=113)
> (actual time=490094.849..524854.662 rows=72000000 loops=1)
> Sort Key: ec_180days.msn, ec_180days.to_date_time
> Sort Method: external merge Disk: 7679136kB
>
> To this:
>
> -> GroupAggregate (cost=14988274.87..22206275.69 rows=7200001 width=155)
> (actual time=140497.729..421510.001 rows=36000000 loops=1)
> Group Key: ec_180days.msn, ec_180days.to_date_time
> -> Sort (cost=14988274.87..15168274.89 rows=72000008 width=92)
> (actual time=140489.807..174228.722 rows=72000000 loops=1)
> Sort Key: ec_180days.msn COLLATE "C", ec_180days.to_date_time
> Sort Method: external merge Disk: 7679136kB
>
>
> which is 3 times faster in my tests.
>

I had a same idea. It is possible only if default collation is
deterministic. Probably it will be less important if abbreviate sort will
be enabled, but it is disabled now.

p.s. can be interesting repeat your tests with ICU locale where abbreviate
sort is enabled.

Regards

Pavel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2020-03-22 10:34:38 Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM
Previous Message Maxim Ivanov 2020-03-22 09:11:45 optimisation? collation "C" sorting for GroupAggregate for all deterministic collations