| From: | David Raymond <David(dot)Raymond(at)tomtom(dot)com> | 
|---|---|
| To: | Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> | 
| Subject: | RE: BUG #16031: Group by returns duplicate groups | 
| Date: | 2019-10-01 14:55:37 | 
| Message-ID: | VI1PR07MB6029A84A61C5A88328B50AC3879D0@VI1PR07MB6029.eurprd07.prod.outlook.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-bugs | 
psql output for those two queries, along with explain output, pasted below.
testing=> set enable_indexscan = off;
SET
Time: 0.265 ms
testing=> set enable_bitmapscan = off;
SET
Time: 0.236 ms
testing=> select count(*) c0,
testing->        count(*) filter (where bttextcmp(name,'DK') < 0) c1_lt,
testing->        count(*) filter (where bttextcmp(name,'DK') = 0) c1_eq,
testing->        count(*) filter (where bttextcmp(name,'DK') > 0) c1_gt,
testing->        count(*) filter (where bttextcmp('DK',name) > 0) c2_lt,
testing->        count(*) filter (where bttextcmp('DK',name) = 0) c2_eq,
testing->        count(*) filter (where bttextcmp('DK',name) < 0) c2_gt
testing->   from big_table;
     c0      |   c1_lt    | c1_eq |   c1_gt    |   c2_lt    | c2_eq |   c2_gt
-------------+------------+-------+------------+------------+-------+------------
 108,565,086 | 27,900,023 |    25 | 80,665,038 | 27,900,023 |    25 | 80,665,038
(1 row)
Time: 311710.895 ms (05:11.711)
testing=> explain (analyze, verbose, costs, buffers, timing, summary) select count(*) c0,
testing->        count(*) filter (where bttextcmp(name,'DK') < 0) c1_lt,
testing->        count(*) filter (where bttextcmp(name,'DK') = 0) c1_eq,
testing->        count(*) filter (where bttextcmp(name,'DK') > 0) c1_gt,
testing->        count(*) filter (where bttextcmp('DK',name) > 0) c2_lt,
testing->        count(*) filter (where bttextcmp('DK',name) = 0) c2_eq,
testing->        count(*) filter (where bttextcmp('DK',name) < 0) c2_gt
testing->   from big_table;
                                                                                                                                                                                                                                                    QUERY PLAN                                                                                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=4119436.80..4119436.81 rows=1 width=56) (actual time=315815.338..315815.338 rows=1 loops=1)
   Output: count(*), count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) < 0)), count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) = 0)), count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) > 0)), count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) > 0)), count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) = 0)), count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) < 0))
   Buffers: shared hit=258 read=1517140
   ->  Gather  (cost=4119436.55..4119436.76 rows=2 width=56) (actual time=315814.594..315834.986 rows=3 loops=1)
         Output: (PARTIAL count(*)), (PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) < 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) = 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) > 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) > 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) = 0))), (PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) < 0)))
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=258 read=1517140
         ->  Partial Aggregate  (cost=4118436.55..4118436.56 rows=1 width=56) (actual time=315571.326..315571.326 rows=1 loops=3)
               Output: PARTIAL count(*), PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) < 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) = 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp((name)::text, 'DK'::text) > 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) > 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) = 0)), PARTIAL count(*) FILTER (WHERE (bttextcmp('DK'::text, (name)::text) < 0))
               Buffers: shared hit=258 read=1517140
               Worker 0: actual time=315491.284..315491.284 rows=1 loops=1
                 Buffers: shared hit=80 read=324165
               Worker 1: actual time=315411.542..315411.542 rows=1 loops=1
                 Buffers: shared hit=81 read=592546
               ->  Parallel Seq Scan on name_stuff.big_table  (cost=0.00..1969752.53 rows=45235453 width=22) (actual time=8.834..245716.269 rows=36188362 loops=3)
                     Output: id_1, field_2, name, field_4, field_5, field_6, field_7, field_8, arr_field_1, arr_field_2, arr_field_3, arr_field_4, arr_field_5, arr_field_6, field_15, field_16, id_2
                     Buffers: shared hit=258 read=1517140
                     Worker 0: actual time=13.873..213354.668 rows=23076339 loops=1
                       Buffers: shared hit=80 read=324165
                     Worker 1: actual time=0.232..260652.945 rows=42461913 loops=1
                       Buffers: shared hit=81 read=592546
 Planning Time: 1.746 ms
 Execution Time: 315835.098 ms
(24 rows)
Time: 315860.795 ms (05:15.861)
testing=> with sd as (select name, row_number() over (order by name) rnum
testing(>               from big_table)
testing-> select name from sd
testing->  where rnum >= (select min(rnum) from sd where name='DK')
testing->    and rnum <= (select max(rnum) from sd where name='DK')
testing->    and name <> 'DK';
               name
-----------------------------------
 Dk'bus Marine
 Dk's Auto's
 Dk's Bar & Grill
 Dk's Barbers & Stylist
 Dk's Beach Boutique
 Dk's Cabinets & Countertops
 Dk's Cleaning Service
 Dk's Clothing
 Dk's Communications
 Dk's Dancewear & Fitnesswear
 Dk's Dancewear Boutique
 Dk's Discount Dance & Fitnesswear
 DK's Drywall Service
 DK'S DUSTBUSTERS
 Dk's Family Five Star Trophies
 DK's Family Five Star Trophies
 Dk's Food Mart
 Dk'S Group Pte. Ltd.
 Dk's Hair Designs
 Dk's Hair Happenings
 Dk's Hair Supply
 Dk's Home Decor
 DK's Informática
 Dk's Janitorial
 DK's Liquors
 Dk's Market
 Dk's Moda Masculina
 Dk's Nails And Spa
 DK's Pawn Shop
 Dk's Pet Grooming
 DK's Quality Service
 DK's Restoration
 Dk's Sports Center
 Dk's Statuary
 Dk's Style Hut
 Dk's Temiskaming Shore Taxi
 Dk's Towing
 DK's Travel
 Dk'Style
 DK'Z Car Wash
 Dk-
(41 rows)
Time: 848889.096 ms (14:08.889)
testing=> explain (analyze, verbose, costs, buffers, timing, summary) with sd as (select name, row_number() over (order by name) rnum
testing(>               from big_table)
testing-> select name from sd
testing->  where rnum >= (select min(rnum) from sd where name='DK')
testing->    and rnum <= (select max(rnum) from sd where name='DK')
testing->    and name <> 'DK';
                                                                          QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on sd  (cost=30560557.18..33546097.10 rows=540111 width=516) (actual time=794958.814..805397.008 rows=41 loops=1)
   Output: sd.name
   Filter: ((sd.rnum >= $1) AND (sd.rnum <= $2) AND ((sd.name)::text <> 'DK'::text))
   Rows Removed by Filter: 108565045
   Buffers: shared hit=386 read=1517012, temp read=3202642 written=2621799
   CTE sd
     ->  WindowAgg  (cost=23772525.03..25672414.07 rows=108565088 width=30) (actual time=668354.380..753149.449 rows=108565086 loops=1)
           Output: big_table.name, row_number() OVER (?)
           Buffers: shared hit=386 read=1517012, temp read=2035717 written=2038337
           ->  Sort  (cost=23772525.03..24043937.75 rows=108565088 width=22) (actual time=668354.361..724059.209 rows=108565086 loops=1)
                 Output: big_table.name
                 Sort Key: big_table.name
                 Sort Method: external merge  Disk: 3453888kB
                 Buffers: shared hit=386 read=1517012, temp read=2035717 written=2038337
                 ->  Seq Scan on name_stuff.big_table  (cost=0.00..2603048.88 rows=108565088 width=22) (actual time=18.886..106731.751 rows=108565086 loops=1)
                       Output: big_table.name
                       Buffers: shared hit=386 read=1517012
   InitPlan 2 (returns $1)
     ->  Aggregate  (cost=2444071.54..2444071.55 rows=1 width=8) (actual time=111513.923..111513.923 rows=1 loops=1)
           Output: min(sd_1.rnum)
           Buffers: temp read=431506 written=583461
           ->  CTE Scan on sd sd_1  (cost=0.00..2442714.48 rows=542825 width=8) (actual time=28796.843..111513.916 rows=25 loops=1)
                 Output: sd_1.name, sd_1.rnum
                 Filter: ((sd_1.name)::text = 'DK'::text)
                 Rows Removed by Filter: 108565061
                 Buffers: temp read=431506 written=583461
   InitPlan 3 (returns $2)
     ->  Aggregate  (cost=2444071.54..2444071.55 rows=1 width=8) (actual time=12135.464..12135.464 rows=1 loops=1)
           Output: max(sd_2.rnum)
           Buffers: temp read=583462
           ->  CTE Scan on sd sd_2  (cost=0.00..2442714.48 rows=542825 width=8) (actual time=2785.283..12135.457 rows=25 loops=1)
                 Output: sd_2.name, sd_2.rnum
                 Filter: ((sd_2.name)::text = 'DK'::text)
                 Rows Removed by Filter: 108565061
                 Buffers: temp read=583462
 Planning Time: 0.159 ms
 Execution Time: 807095.184 ms
(37 rows)
Time: 807103.763 ms (13:27.104)
testing=>
-----Original Message-----
From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> 
Sent: Tuesday, October 1, 2019 1:41 AM
To: David Raymond <David(dot)Raymond(at)tomtom(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16031: Group by returns duplicate groups
>>>>> "David" == David Raymond <David(dot)Raymond(at)tomtom(dot)com> writes:
 David> Looking possibly like indexing is part of the issue at the
 David> moment.
Your original EXPLAIN didn't show any index scans being used...?
I can think of a possible explanation if there's some other value in the
big table which, due to some collation bug, is not consistently being
compared as < 'DK' or > 'DK'.
Unfortunately, we have two code paths for comparison, and one of them
can't easily be exercised directly from SQL, since it is only used for
sorts (and therefore also index builds).
Can you try these queries:
set enable_indexscan=off;
set enable_bitmapscan=off;
select count(*) c0,
       count(*) filter (where bttextcmp(name,'DK') < 0) c1_lt,
       count(*) filter (where bttextcmp(name,'DK') = 0) c1_eq,
       count(*) filter (where bttextcmp(name,'DK') > 0) c1_gt,
       count(*) filter (where bttextcmp('DK',name) > 0) c2_lt,
       count(*) filter (where bttextcmp('DK',name) = 0) c2_eq,
       count(*) filter (where bttextcmp('DK',name) < 0) c2_gt
  from big_table;
with sd as (select name, row_number() over (order by name) rnum
              from big_table)
select name from sd
 where rnum >= (select min(rnum) from sd where name='DK')
   and rnum <= (select max(rnum) from sd where name='DK')
   and name <> 'DK';
-- 
Andrew (irc:RhodiumToad)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andrew Gierth | 2019-10-01 15:21:20 | Re: BUG #16031: Group by returns duplicate groups | 
| Previous Message | Naima Shaikh | 2019-10-01 12:54:31 | psql: FATAL: database “<text>” does not exist |