Re: Memory Issue with array_agg?

From: Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org >> PG-General Mailing List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Memory Issue with array_agg?
Date: 2013-08-21 15:56:43
Message-ID: CAH1hji1ESS-B3iwXn2XQ5WpTtUHbYYXgV+bvMnsntzzqK9tcNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Pavel,

Here are the explains you asked for:

explain analyze select string_agg(id::text,',') from things group by guid;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=400357.78..433784.93 rows=1337086 width=37) (actual
time=41434.485..53195.185 rows=2378626 loops=1)
-> Sort (cost=400357.78..403700.49 rows=1337086 width=37) (actual
time=41434.433..44992.736 rows=2378626 loops=1)
Sort Key: guid
Sort Method: quicksort Memory: 284135kB
-> Seq Scan on things (cost=0.00..264304.86 rows=1337086
width=37) (actual time=0.027..21429.179 rows=2378626 loops=1)
Total runtime: 56295.362 ms
(6 rows)

explain analyze select array_agg(id::text) from things group by guid;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=400357.78..433784.93 rows=1337086 width=37) (actual
time=23953.922..38157.059 rows=2378626 loops=1)
-> Sort (cost=400357.78..403700.49 rows=1337086 width=37) (actual
time=23953.847..27527.316 rows=2378626 loops=1)
Sort Key: guid
Sort Method: quicksort Memory: 284135kB
-> Seq Scan on things (cost=0.00..264304.86 rows=1337086
width=37) (actual time=0.007..4941.752 rows=2378626 loops=1)
Total runtime: 41280.897 ms
(6 rows)

These seem to be running on the machine now, and the memory is not
inflating, I just run this one, and it blew up.

explain with t as (select id, guid, md5(concat_ws(':', fields -> 'a',
fields -> 'b', fields -> 'c', fields -> 'd', fields -> 'e', foo_id::text))
from things) select md5, count(id), array_agg(id) from t group by 1 having
count(id) > 1;

-Robert

On Tue, Aug 20, 2013 at 1:53 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:

> Can you send a EXPLAIN result in both use cases?
>
> Pavel
>
>
> 2013/8/19 Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
>
>> At the moment, all guids are distinct, however before I zapped the
>> duplicates, there were 280 duplicates.
>>
>> Currently, there are over 2 million distinct guids.
>>
>> -Robert
>>
>>
>> On Mon, Aug 19, 2013 at 11:12 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:
>>
>>>
>>>
>>>
>>> 2013/8/19 Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
>>>
>>>> Hi Pavel,
>>>>
>>>> What kind of example do you need? I cant give you the actual data I
>>>> have in the table, but I can give you an example query and the schema
>>>> attached below. From there, I would just put in 2 million rows worth 1.2
>>>> Gigs of data. Average size of the the extended columns (using the
>>>> pg_column_size function) in bytes are:
>>>>
>>>> guid: 33
>>>> name: 2.41
>>>> currency: 4
>>>> fields: 120.32
>>>>
>>>> example query:
>>>>
>>>> -- find duplicate records using a guid
>>>> select guid, array_agg(id) from orders group by guid;
>>>>
>>>
>>> how much distinct guid is there, and how much duplicates
>>>
>>> ??
>>>
>>> regards
>>>
>>> Pavel
>>>
>>>
>>>
>>>>
>>>> example schema:
>>>> Table "public.things"
>>>>
>>>> Column | Type |
>>>> Modifiers | Storage | Stats target | Description
>>>>
>>>> ------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------
>>>> id | integer | not null default
>>>> nextval('things_id_seq'::regclass) | plain | |
>>>> version | integer | not null
>>>> | plain | |
>>>> created_at | timestamp without time zone | not null
>>>> | plain | |
>>>> updated_at | timestamp without time zone | not null
>>>> | plain | |
>>>> foo_id | integer | not null
>>>> | plain | |
>>>> bar_id | integer | not null
>>>> | plain | |
>>>> baz_id | integer | not null
>>>> | plain | |
>>>> guid | character varying | not null
>>>> | extended | |
>>>> name | character varying | not null
>>>> | extended | |
>>>> price | numeric(12,2) | not null
>>>> | main | |
>>>> currency | character varying | not null
>>>> | extended | |
>>>> amount | integer | not null
>>>> | plain | |
>>>> the_date | date | not null
>>>> | plain | |
>>>> fields | hstore |
>>>> | extended | |
>>>> Indexes:
>>>> "things_pkey" PRIMARY KEY, btree (id)
>>>> "things_foo_id_idx" btree (foo_id)
>>>> "things_bar_id_idx" btree (bar_id)
>>>> "things_baz_id_idx" btree (baz_id)
>>>> "things_guid_uidx" UNIQUE, btree (guid)
>>>> "things_lpad_lower_name_eidx" btree (lpad(lower(name::text), 10,
>>>> '0'::text))
>>>> "things_price_idx" btree (price)
>>>>
>>>> Foreign-key constraints:
>>>> "things_foo_id_fkey" FOREIGN KEY (foo_id) REFERENCES foos(id)
>>>> "things_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES bars(id)
>>>> "things_baz_id_fkey" FOREIGN KEY (baz_id) REFERENCES bazs(id)
>>>> Triggers:
>>>> timestamps_trig BEFORE INSERT OR UPDATE ON things FOR EACH ROW
>>>> EXECUTE PROCEDURE timestamps_tfun()
>>>>
>>>> Let me know if you need anything else.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Mon, Aug 19, 2013 at 3:29 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com
>>>> > wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> please, can you send some example or test?
>>>>>
>>>>> Regards
>>>>>
>>>>> Pavel Stehule
>>>>>
>>>>>
>>>>> 2013/8/19 Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
>>>>>
>>>>>> When using array_agg on a large table, memory usage seems to spike up
>>>>>> until Postgres crashes with the following error:
>>>>>>
>>>>>> 2013-08-17 18:41:02 UTC [2716]: [2] WARNING: terminating connection
>>>>>> because of crash of another server process
>>>>>> 2013-08-17 18:41:02 UTC [2716]: [3] DETAIL: The postmaster has
>>>>>> commanded this server process to roll back the current transaction and
>>>>>> exit, because another server process exited abnormally and possibly
>>>>>> corrupted shared memory.
>>>>>> 2013-08-17 18:41:02 UTC [2716]: [4] HINT: In a moment you should be
>>>>>> able to reconnect to the database and repeat your command.
>>>>>>
>>>>>> I've definitely isolated it down to using array_agg, as when I
>>>>>> changed the query to use string_agg, it worked fine. I also tried using
>>>>>> array_agg on a few different queries, all yielding the same issue.
>>>>>> Swapping in string_agg fixed the issue once more.
>>>>>>
>>>>>> This particular table has over 2 million rows and is 1.2 Gigs, and
>>>>>> when I ran the query while viewing htop, the virtual size of the
>>>>>> Postgres process ballooned to 13.9G until crashing.
>>>>>>
>>>>>> The version of Postgres I am using is: PostgreSQL 9.2.4 on
>>>>>> x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1)
>>>>>> 4.7.2, 64-bit
>>>>>>
>>>>>> Any help would be much appreciated, thanks!
>>>>>>
>>>>>> -Robert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ivan Radovanovic 2013-08-21 17:46:02 Unique constraint and unique index
Previous Message Don Parris 2013-08-21 14:47:58 Re: Locale Issue