BUG #16031: Group by returns duplicate groups

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: david(dot)raymond(at)tomtom(dot)com
Subject: BUG #16031: Group by returns duplicate groups
Date: 2019-09-30 18:12:27
Message-ID: 16031-4ef55395a5fbb687@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16031
Logged by: David Raymond
Email address: david(dot)raymond(at)tomtom(dot)com
PostgreSQL version: 11.5
Operating system: Windows 10
Description:

I have a large table which I'm running a query on, grouping on a single
field, and returning only that 1 field, filtered with a HAVING clause. The
results that I'm getting back however contain a duplicate entry for one of
the returned values, which I believe should never be able to happen.
I'm working on dropping as many extra fields and records as I can to still
get the bad result from something small enough to send as a test case, but
figured I'd post this while I'm at it to get any advice. I've done a dump of
the table and then restored it to a new table, and still get the same weird
results. On the original version of the table I had clustered it on the
index that starts with the field being grouped on, but it's still giving the
bad groups after a dump and restore.
I'm running 11.5 on Windows 10 (Enterprise DB installer) and unfortunately
don't have the resources to build a new version myself if you come up with a
patch (also why I'm hoping to shrink it down to where others can test it)

Here's output from psql:

testing=> select version();
version
------------------------------------------------------------
PostgreSQL 11.5, compiled by Visual C++ build 1914, 64-bit
(1 row)

Time: 0.272 ms

testing=> \d+ big_table
Table "name_stuff.big_table"
Column | Type | Collation | Nullable | Default |
Storage | Stats target | Description
-------------+------------------------+-----------+----------+---------+----------+--------------+-------------
id_1 | uuid | | not null | |
plain | |
field_2 | uuid | | | |
plain | |
name | character varying(254) | | not null | |
extended | |
field_4 | character varying(254) | | | |
extended | |
field_5 | numeric(2,0) | | not null | |
main | |
field_6 | character varying(4) | | | |
extended | |
field_7 | character varying(3) | | not null | |
extended | |
field_8 | character varying(3) | | | |
extended | |
arr_field_1 | character varying(254) | | | |
extended | |
arr_field_2 | character varying(254) | | | |
extended | |
arr_field_3 | character varying(254) | | | |
extended | |
arr_field_4 | character varying(254) | | | |
extended | |
arr_field_5 | character varying(254) | | | |
extended | |
arr_field_6 | character varying(254) | | | |
extended | |
field_15 | boolean | | | |
plain | |
field_16 | boolean | | | |
plain | |
id_2 | text | | not null | |
extended | |
Indexes:
"big_table_pkey" PRIMARY KEY, btree (id_1, id_2)
"big_table_name_id_1_id_2_idx" btree (name, id_1, id_2)

testing=> select count(*) from big_table;
count
-------------
108,565,086
(1 row)

Time: 273770.205 ms (04:33.770)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) create
table bad_groups_1 as select name from big_table group by name having
count(distinct array[arr_field_1, arr_field_2, arr_field_3, arr_field_4,
arr_field_5, arr_field_6]) > 1;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=29317810.81..30149735.19 rows=487417 width=22)
(actual time=758501.326..895894.082 rows=745024 loops=1)
Output: name
Group Key: big_table.name
Filter: (count(DISTINCT ARRAY[big_table.arr_field_1,
big_table.arr_field_2, big_table.arr_field_3, big_table.arr_field_4,
big_table.arr_field_5, big_table.arr_field_6]) > 1)
Rows Removed by Filter: 80610652
Buffers: shared hit=2325 read=1515081, temp read=2464481
written=2467410
-> Sort (cost=29317810.81..29589026.23 rows=108486168 width=57) (actual
time=758493.476..819035.136 rows=108565086 loops=1)
Output: name, arr_field_1, arr_field_2, arr_field_3, arr_field_4,
arr_field_5, arr_field_6
Sort Key: big_table.name
Sort Method: external merge Disk: 4174488kB
Buffers: shared hit=2317 read=1515081, temp read=2464481
written=2467410
-> Seq Scan on name_stuff.big_table (cost=0.00..2602259.68
rows=108486168 width=57) (actual time=23.216..119113.708 rows=108565086
loops=1)
Output: name, arr_field_1, arr_field_2, arr_field_3,
arr_field_4, arr_field_5, arr_field_6
Buffers: shared hit=2317 read=1515081
Planning Time: 0.196 ms
Execution Time: 897276.109 ms
(16 rows)

Time: 897285.808 ms (14:57.286)

testing=> \d+ bad_groups_1
Table "name_stuff.bad_groups_1"
Column | Type | Collation | Nullable | Default | Storage
| Stats target | Description
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
name | character varying(254) | | | | extended
| |

testing=> select count(*), count(distinct name) from bad_groups_1;
count | count
---------+---------
745,024 | 745,023
(1 row)

Time: 899.273 ms

testing=> select name from bad_groups_1 group by name having count(*) > 1;
name
------
DK
(1 row)

Time: 337.663 ms

testing=>

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-09-30 18:17:14 Re: BUG #16031: Group by returns duplicate groups
Previous Message PG Bug reporting form 2019-09-30 14:05:01 BUG #16030: yum update failed due to download the https://download.postgresql.org/pub/repos/yum/12/redhat/rhel-7