RE: BUG #16031: Group by returns duplicate groups

From: David Raymond <David(dot)Raymond(at)tomtom(dot)com>
To: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #16031: Group by returns duplicate groups
Date: 2019-10-02 18:36:33
Message-ID: AM6PR07MB6022C91A384768FAB540B4C0879C0@AM6PR07MB6022.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

As an update, I've currently got a dump that consistently shows weirdness when loaded. It's just the "name" field, has 1.3 million records, is 15 MB zipped, and has things garbled enough that I don't mind sending it.

How small does it need to be before it's good to send to someone?

Output after re-loading the table:

testing=> \d+ weird_grouping
Table "name_stuff.weird_grouping"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
name | character varying(254) | | not null | | extended | |

testing=> select count(*), count(distinct name) from weird_grouping;
count | count
-----------+-----------
1,297,265 | 1,176,103
(1 row)

Time: 7616.186 ms (00:07.616)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) select count(*), count(distinct name) from weird_grouping;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=10187.10..10187.11 rows=1 width=16) (actual time=7805.463..7805.463 rows=1 loops=1)
Output: count(*), count(DISTINCT name)
Buffers: shared hit=2080 read=6236, temp read=5484 written=5506
-> Seq Scan on name_stuff.weird_grouping (cost=0.00..9563.40 rows=124740 width=516) (actual time=0.031..128.203 rows=1297265 loops=1)
Output: name
Buffers: shared hit=2080 read=6236
Planning Time: 0.027 ms
Execution Time: 7805.483 ms
(8 rows)

Time: 7805.822 ms (00:07.806)

testing=> with foo as (select name from weird_grouping group by name) select name from foo group by name having count(*) > 1;
name
------
(0 rows)

Time: 1950.401 ms (00:01.950)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) with foo as (select name from weird_grouping group by name) select name from foo group by name having count(*) > 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=9882.25..9884.75 rows=67 width=516) (actual time=1466.738..1466.738 rows=0 loops=1)
Output: foo.name
Group Key: foo.name
Filter: (count(*) > 1)
Rows Removed by Filter: 1176101
Buffers: shared hit=2144 read=6172, temp written=4533
CTE foo
-> HashAggregate (cost=9875.25..9877.25 rows=200 width=516) (actual time=494.343..734.005 rows=1176101 loops=1)
Output: weird_grouping.name
Group Key: weird_grouping.name
Buffers: shared hit=2144 read=6172
-> Seq Scan on name_stuff.weird_grouping (cost=0.00..9563.40 rows=124740 width=516) (actual time=0.283..120.898 rows=1297265 loops=1)
Output: weird_grouping.name
Buffers: shared hit=2144 read=6172
-> CTE Scan on foo (cost=0.00..4.00 rows=200 width=516) (actual time=494.346..1035.185 rows=1176101 loops=1)
Output: foo.name
Buffers: shared hit=2144 read=6172, temp written=4533
Planning Time: 0.464 ms
Execution Time: 2230.238 ms
(19 rows)

Time: 2231.291 ms (00:02.231)

testing=> analyze verbose weird_grouping;
INFO: analyzing "name_stuff.weird_grouping"
INFO: "weird_grouping": scanned 8316 of 8316 pages, containing 1297265 live rows and 0 dead rows; 30000 rows in sample, 1297265 estimated total rows
ANALYZE
Time: 474.963 ms

testing=> select count(*), count(distinct name) from weird_grouping;
count | count
-----------+-----------
1,297,265 | 1,176,103
(1 row)

Time: 7449.983 ms (00:07.450)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) select count(*), count(distinct name) from weird_grouping;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=27774.98..27774.99 rows=1 width=16) (actual time=7754.578..7754.578 rows=1 loops=1)
Output: count(*), count(DISTINCT name)
Buffers: shared hit=2240 read=6076, temp read=5484 written=5506
-> Seq Scan on name_stuff.weird_grouping (cost=0.00..21288.65 rows=1297265 width=20) (actual time=0.036..102.144 rows=1297265 loops=1)
Output: name
Buffers: shared hit=2240 read=6076
Planning Time: 0.031 ms
Execution Time: 7754.598 ms
(8 rows)

Time: 7754.902 ms (00:07.755)

testing=> with foo as (select name from weird_grouping group by name) select name from foo group by name having count(*) > 1;
name
-------
DCT
DELTA
(2 rows)

Time: 9561.382 ms (00:09.561)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) with foo as (select name from weird_grouping group by name) select name from foo group by name having count(*) > 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=209783.95..209786.45 rows=67 width=516) (actual time=8591.210..8686.132 rows=2 loops=1)
Output: foo.name
Group Key: foo.name
Filter: (count(*) > 1)
Rows Removed by Filter: 1176099
Buffers: shared hit=2304 read=6012, temp read=7800 written=12363
CTE foo
-> Group (cost=179613.72..186100.05 rows=947356 width=20) (actual time=6416.900..7842.634 rows=1176103 loops=1)
Output: weird_grouping.name
Group Key: weird_grouping.name
Buffers: shared hit=2304 read=6012, temp read=7800 written=7830
-> Sort (cost=179613.72..182856.89 rows=1297265 width=20) (actual time=6416.899..7665.158 rows=1297265 loops=1)
Output: weird_grouping.name
Sort Key: weird_grouping.name
Sort Method: external merge Disk: 39048kB
Buffers: shared hit=2304 read=6012, temp read=7800 written=7830
-> Seq Scan on name_stuff.weird_grouping (cost=0.00..21288.65 rows=1297265 width=20) (actual time=0.294..137.486 rows=1297265 loops=1)
Output: weird_grouping.name
Buffers: shared hit=2304 read=6012
-> CTE Scan on foo (cost=0.00..18947.12 rows=947356 width=516) (actual time=6416.902..8105.771 rows=1176103 loops=1)
Output: foo.name
Buffers: shared hit=2304 read=6012, temp read=7800 written=12363
Planning Time: 0.258 ms
Execution Time: 10305.891 ms
(24 rows)

Time: 10306.990 ms (00:10.307)

testing=>

-----Original Message-----
From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Sent: Tuesday, October 1, 2019 3:27 PM
To: David Raymond <David(dot)Raymond(at)tomtom(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16031: Group by returns duplicate groups

>>>>> "David" == David Raymond <David(dot)Raymond(at)tomtom(dot)com> writes:

David> I checked for other places where there was weirdness going on
David> and there are more.

You could try something like,

select name, lname
from (select name, lag(name) over (order by name) as lname
from big_table) s
where name < lname;

That should show all cases where the sort order was inconsistent with
the < operator (which obviously should never happen).

However, there's not much more I can do to help with this, since I don't
use Windows myself and have no useful access to any Windows system. You
might try and cut down the data to the smallest set that shows
inconsistencies using the above; particularly relevant is whether the
problem only shows up for external merge sorts or whether it happens for
in-memory sorts too.

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2019-10-03 06:38:33 BUG #16035: STATEMENT_TIMEOUT not working when we have single quote usage inside CTE which is used in inner sql
Previous Message Daniel Verite 2019-10-02 17:47:49 Re: BUG #16034: `\set ECHO all` doesn't work for \e command