RE: BUG #16031: Group by returns duplicate groups

From: David Raymond <David(dot)Raymond(at)tomtom(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #16031: Group by returns duplicate groups
Date: 2019-10-01 14:55:37
Message-ID: VI1PR07MB6029A84A61C5A88328B50AC3879D0@VI1PR07MB6029.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

psql output for those two queries, along with explain output, pasted below.

testing=> set enable_indexscan = off;
SET
Time: 0.265 ms
testing=> set enable_bitmapscan = off;
SET
Time: 0.236 ms
testing=> select count(*) c0,
testing-> count(*) filter (where bttextcmp(name,'DK') < 0) c1_lt,
testing-> count(*) filter (where bttextcmp(name,'DK') = 0) c1_eq,
testing-> count(*) filter (where bttextcmp(name,'DK') > 0) c1_gt,
testing-> count(*) filter (where bttextcmp('DK',name) > 0) c2_lt,
testing-> count(*) filter (where bttextcmp('DK',name) = 0) c2_eq,
testing-> count(*) filter (where bttextcmp('DK',name) < 0) c2_gt
testing-> from big_table;
c0 | c1_lt | c1_eq | c1_gt | c2_lt | c2_eq | c2_gt
-------------+------------+-------+------------+------------+-------+------------
108,565,086 | 27,900,023 | 25 | 80,665,038 | 27,900,023 | 25 | 80,665,038
(1 row)

Time: 311710.895 ms (05:11.711)
testing=> explain (analyze, verbose, costs, buffers, timing, summary) select count(*) c0,
testing-> count(*) filter (where bttextcmp(name,'DK') < 0) c1_lt,
testing-> count(*) filter (where bttextcmp(name,'DK') = 0) c1_eq,
testing-> count(*) filter (where bttextcmp(name,'DK') > 0) c1_gt,
testing-> count(*) filter (where bttextcmp('DK',name) > 0) c2_lt,
testing-> count(*) filter (where bttextcmp('DK',name) = 0) c2_eq,
testing-> count(*) filter (where bttextcmp('DK',name) < 0) c2_gt
testing-> from big_table;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=4119436.80..4119436.81 rows=1 width=56) (actual time=315815.338..315815.338 rows=1 loops=1)
Output: count(*), count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) < 0)), count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) = 0)), count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) > 0)), count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) > 0)), count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) = 0)), count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) < 0))
Buffers: shared hit=258 read=1517140
-> Gather (cost=4119436.55..4119436.76 rows=2 width=56) (actual time=315814.594..315834.986 rows=3 loops=1)
Output: (PARTIAL count(*)), (PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) < 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) = 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) > 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) > 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) = 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) < 0)))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=258 read=1517140
-> Partial Aggregate (cost=4118436.55..4118436.56 rows=1 width=56) (actual time=315571.326..315571.326 rows=1 loops=3)
Output: PARTIAL count(*), PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) < 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) = 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) > 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) > 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) = 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) < 0))
Buffers: shared hit=258 read=1517140
Worker 0: actual time=315491.284..315491.284 rows=1 loops=1
Buffers: shared hit=80 read=324165
Worker 1: actual time=315411.542..315411.542 rows=1 loops=1
Buffers: shared hit=81 read=592546
-> Parallel Seq Scan on name_stuff.big_table (cost=0.00..1969752.53 rows=45235453 width=22) (actual time=8.834..245716.269 rows=36188362 loops=3)
Output: id_1, field_2, name, field_4, field_5, field_6, field_7, field_8, arr_field_1, arr_field_2, arr_field_3, arr_field_4, arr_field_5, arr_field_6, field_15, field_16, id_2
Buffers: shared hit=258 read=1517140
Worker 0: actual time=13.873..213354.668 rows=23076339 loops=1
Buffers: shared hit=80 read=324165
Worker 1: actual time=0.232..260652.945 rows=42461913 loops=1
Buffers: shared hit=81 read=592546
Planning Time: 1.746 ms
Execution Time: 315835.098 ms
(24 rows)

Time: 315860.795 ms (05:15.861)
testing=> with sd as (select name, row_number() over (order by name) rnum
testing(> from big_table)
testing-> select name from sd
testing-> where rnum >= (select min(rnum) from sd where name='DK')
testing-> and rnum <= (select max(rnum) from sd where name='DK')
testing-> and name <> 'DK';
name
-----------------------------------
Dk'bus Marine
Dk's Auto's
Dk's Bar & Grill
Dk's Barbers & Stylist
Dk's Beach Boutique
Dk's Cabinets & Countertops
Dk's Cleaning Service
Dk's Clothing
Dk's Communications
Dk's Dancewear & Fitnesswear
Dk's Dancewear Boutique
Dk's Discount Dance & Fitnesswear
DK's Drywall Service
DK'S DUSTBUSTERS
Dk's Family Five Star Trophies
DK's Family Five Star Trophies
Dk's Food Mart
Dk'S Group Pte. Ltd.
Dk's Hair Designs
Dk's Hair Happenings
Dk's Hair Supply
Dk's Home Decor
DK's Informática
Dk's Janitorial
DK's Liquors
Dk's Market
Dk's Moda Masculina
Dk's Nails And Spa
DK's Pawn Shop
Dk's Pet Grooming
DK's Quality Service
DK's Restoration
Dk's Sports Center
Dk's Statuary
Dk's Style Hut
Dk's Temiskaming Shore Taxi
Dk's Towing
DK's Travel
Dk'Style
DK'Z Car Wash
Dk-
(41 rows)

Time: 848889.096 ms (14:08.889)
testing=> explain (analyze, verbose, costs, buffers, timing, summary) with sd as (select name, row_number() over (order by name) rnum
testing(> from big_table)
testing-> select name from sd
testing-> where rnum >= (select min(rnum) from sd where name='DK')
testing-> and rnum <= (select max(rnum) from sd where name='DK')
testing-> and name <> 'DK';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on sd (cost=30560557.18..33546097.10 rows=540111 width=516) (actual time=794958.814..805397.008 rows=41 loops=1)
Output: sd.name
Filter: ((sd.rnum >= $1) AND (sd.rnum <= $2) AND ((sd.name)::text <> 'DK'::text))
Rows Removed by Filter: 108565045
Buffers: shared hit=386 read=1517012, temp read=3202642 written=2621799
CTE sd
-> WindowAgg (cost=23772525.03..25672414.07 rows=108565088 width=30) (actual time=668354.380..753149.449 rows=108565086 loops=1)
Output: big_table.name, row_number() OVER (?)
Buffers: shared hit=386 read=1517012, temp read=2035717 written=2038337
-> Sort (cost=23772525.03..24043937.75 rows=108565088 width=22) (actual time=668354.361..724059.209 rows=108565086 loops=1)
Output: big_table.name
Sort Key: big_table.name
Sort Method: external merge Disk: 3453888kB
Buffers: shared hit=386 read=1517012, temp read=2035717 written=2038337
-> Seq Scan on name_stuff.big_table (cost=0.00..2603048.88 rows=108565088 width=22) (actual time=18.886..106731.751 rows=108565086 loops=1)
Output: big_table.name
Buffers: shared hit=386 read=1517012
InitPlan 2 (returns $1)
-> Aggregate (cost=2444071.54..2444071.55 rows=1 width=8) (actual time=111513.923..111513.923 rows=1 loops=1)
Output: min(sd_1.rnum)
Buffers: temp read=431506 written=583461
-> CTE Scan on sd sd_1 (cost=0.00..2442714.48 rows=542825 width=8) (actual time=28796.843..111513.916 rows=25 loops=1)
Output: sd_1.name, sd_1.rnum
Filter: ((sd_1.name)::text = 'DK'::text)
Rows Removed by Filter: 108565061
Buffers: temp read=431506 written=583461
InitPlan 3 (returns $2)
-> Aggregate (cost=2444071.54..2444071.55 rows=1 width=8) (actual time=12135.464..12135.464 rows=1 loops=1)
Output: max(sd_2.rnum)
Buffers: temp read=583462
-> CTE Scan on sd sd_2 (cost=0.00..2442714.48 rows=542825 width=8) (actual time=2785.283..12135.457 rows=25 loops=1)
Output: sd_2.name, sd_2.rnum
Filter: ((sd_2.name)::text = 'DK'::text)
Rows Removed by Filter: 108565061
Buffers: temp read=583462
Planning Time: 0.159 ms
Execution Time: 807095.184 ms
(37 rows)

Time: 807103.763 ms (13:27.104)
testing=>

-----Original Message-----
From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Sent: Tuesday, October 1, 2019 1:41 AM
To: David Raymond <David(dot)Raymond(at)tomtom(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16031: Group by returns duplicate groups

>>>>> "David" == David Raymond <David(dot)Raymond(at)tomtom(dot)com> writes:

David> Looking possibly like indexing is part of the issue at the
David> moment.

Your original EXPLAIN didn't show any index scans being used...?

I can think of a possible explanation if there's some other value in the
big table which, due to some collation bug, is not consistently being
compared as < 'DK' or > 'DK'.

Unfortunately, we have two code paths for comparison, and one of them
can't easily be exercised directly from SQL, since it is only used for
sorts (and therefore also index builds).

Can you try these queries:

set enable_indexscan=off;
set enable_bitmapscan=off;

select count(*) c0,
count(*) filter (where bttextcmp(name,'DK') < 0) c1_lt,
count(*) filter (where bttextcmp(name,'DK') = 0) c1_eq,
count(*) filter (where bttextcmp(name,'DK') > 0) c1_gt,
count(*) filter (where bttextcmp('DK',name) > 0) c2_lt,
count(*) filter (where bttextcmp('DK',name) = 0) c2_eq,
count(*) filter (where bttextcmp('DK',name) < 0) c2_gt
from big_table;

with sd as (select name, row_number() over (order by name) rnum
from big_table)
select name from sd
where rnum >= (select min(rnum) from sd where name='DK')
and rnum <= (select max(rnum) from sd where name='DK')
and name <> 'DK';

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrew Gierth 2019-10-01 15:21:20 Re: BUG #16031: Group by returns duplicate groups
Previous Message Naima Shaikh 2019-10-01 12:54:31 psql: FATAL: database “<text>” does not exist