Skip site navigation (1) Skip section navigation (2)

Re: Memory Issue with array_agg?

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org >> PG-General Mailing List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Memory Issue with array_agg?
Date: 2013-08-20 17:49:13
Message-ID: CAFj8pRBv0Wms3WL-aLiFBhXtE4Sg565SnHB84L-PRDFg4SxkFQ@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-general
Hello

It is strange. I am trying to simulate it without success. On 1 M rows
where every id is 2 times duplicated

processing string_agg .. cca 30MB

processing array_agg cca 32MB

postgres=# create table foo(a int, b varchar);
CREATE TABLE
postgres=# insert into foo select i, md5(i::text) from
generate_series(1,1000000) g(i);
INSERT 0 1000000
postgres=# insert into foo select i, md5(i::text) from
generate_series(1,1000000) g(i);
INSERT 0 1000000
postgres=# CREATE INDEX on foo(b);
CREATE INDEX
postgres=# ANALYZE foo;
ANALYZE
postgres=# explain analyze select string_agg(a::text,',') from foo group by
b;
                                                         QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=410045.19..447831.37 rows=1022895 width=37) (actual
time=10195.972..14993.493 rows=1000000 loops=1)
   ->  Sort  (cost=410045.19..415045.19 rows=2000000 width=37) (actual
time=10195.944..13659.985 rows=2000000 loops=1)
         Sort Key: b
         Sort Method: external merge  Disk: 97768kB
         ->  Seq Scan on foo  (cost=0.00..36667.00 rows=2000000 width=37)
(actual time=0.018..321.197 rows=2000000 loops=1)
 Total runtime: 15066.397 ms
(6 rows)

postgres=# explain analyze select array_agg(a::text) from foo group by b;
                                                         QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=410045.19..447831.37 rows=1022895 width=37) (actual
time=10062.095..15697.755 rows=1000000 loops=1)
   ->  Sort  (cost=410045.19..415045.19 rows=2000000 width=37) (actual
time=10062.059..13613.300 rows=2000000 loops=1)
         Sort Key: b
         Sort Method: external merge  Disk: 97768kB
         ->  Seq Scan on foo  (cost=0.00..36667.00 rows=2000000 width=37)
(actual time=0.029..311.423 rows=2000000 loops=1)
 Total runtime: 15799.226 ms
(6 rows)

Regards

Pavel


2013/8/19 Robert Sosinski <rsosinski(at)ticketevolution(dot)com>

> At the moment, all guids are distinct, however before I zapped the
> duplicates, there were 280 duplicates.
>
> Currently, there are over 2 million distinct guids.
>
> -Robert
>
>
> On Mon, Aug 19, 2013 at 11:12 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:
>
>>
>>
>>
>> 2013/8/19 Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
>>
>>> Hi Pavel,
>>>
>>> What kind of example do you need?  I cant give you the actual data I
>>> have in the table, but I can give you an example query and the schema
>>> attached below.  From there, I would just put in 2 million rows worth 1.2
>>> Gigs of data.  Average size of the the extended columns (using the
>>> pg_column_size function) in bytes are:
>>>
>>> guid: 33
>>> name: 2.41
>>> currency: 4
>>> fields: 120.32
>>>
>>> example query:
>>>
>>> -- find duplicate records using a guid
>>> select guid, array_agg(id) from orders group by guid;
>>>
>>
>> how much distinct guid is there, and how much duplicates
>>
>> ??
>>
>> regards
>>
>> Pavel
>>
>>
>>
>>>
>>> example schema:
>>>                                      Table "public.things"
>>>
>>>    Column   |            Type             |
>>>  Modifiers                      | Storage  | Stats target | Description
>>>
>>> ------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------
>>>  id         | integer                     | not null default
>>> nextval('things_id_seq'::regclass) | plain    |              |
>>>  version    | integer                     | not null
>>>                        | plain    |              |
>>>  created_at | timestamp without time zone | not null
>>>                        | plain    |              |
>>>  updated_at | timestamp without time zone | not null
>>>                        | plain    |              |
>>>  foo_id     | integer                     | not null
>>>                        | plain    |              |
>>>  bar_id     | integer                     | not null
>>>                        | plain    |              |
>>>  baz_id     | integer                     | not null
>>>                        | plain    |              |
>>>  guid       | character varying           | not null
>>>                        | extended |              |
>>>  name       | character varying           | not null
>>>                        | extended |              |
>>>  price      | numeric(12,2)               | not null
>>>                        | main     |              |
>>>  currency   | character varying           | not null
>>>                        | extended |              |
>>>  amount     | integer                     | not null
>>>                        | plain    |              |
>>>  the_date   | date                        | not null
>>>                        | plain    |              |
>>>  fields     | hstore                      |
>>>                         | extended |              |
>>> Indexes:
>>>     "things_pkey" PRIMARY KEY, btree (id)
>>>     "things_foo_id_idx" btree (foo_id)
>>>     "things_bar_id_idx" btree (bar_id)
>>>     "things_baz_id_idx" btree (baz_id)
>>>     "things_guid_uidx" UNIQUE, btree (guid)
>>>     "things_lpad_lower_name_eidx" btree (lpad(lower(name::text), 10,
>>> '0'::text))
>>>     "things_price_idx" btree (price)
>>>
>>> Foreign-key constraints:
>>>     "things_foo_id_fkey" FOREIGN KEY (foo_id) REFERENCES foos(id)
>>>     "things_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES bars(id)
>>>     "things_baz_id_fkey" FOREIGN KEY (baz_id) REFERENCES bazs(id)
>>> Triggers:
>>>     timestamps_trig BEFORE INSERT OR UPDATE ON things FOR EACH ROW
>>> EXECUTE PROCEDURE timestamps_tfun()
>>>
>>> Let me know if you need anything else.
>>>
>>> Thanks,
>>>
>>>
>>> On Mon, Aug 19, 2013 at 3:29 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:
>>>
>>>> Hello
>>>>
>>>> please, can you send some example or test?
>>>>
>>>> Regards
>>>>
>>>> Pavel Stehule
>>>>
>>>>
>>>> 2013/8/19 Robert Sosinski <rsosinski(at)ticketevolution(dot)com>
>>>>
>>>>> When using array_agg on a large table, memory usage seems to spike up
>>>>> until Postgres crashes with the following error:
>>>>>
>>>>> 2013-08-17 18:41:02 UTC [2716]: [2] WARNING: terminating connection
>>>>> because of crash of another server process
>>>>> 2013-08-17 18:41:02 UTC [2716]: [3] DETAIL: The postmaster has
>>>>> commanded this server process to roll back the current transaction and
>>>>> exit, because another server process exited abnormally and possibly
>>>>> corrupted shared memory.
>>>>> 2013-08-17 18:41:02 UTC [2716]: [4] HINT: In a moment you should be
>>>>> able to reconnect to the database and repeat your command.
>>>>>
>>>>> I've definitely isolated it down to using array_agg, as when I changed
>>>>> the query to use string_agg, it worked fine.  I also tried using array_agg
>>>>> on a few different queries, all yielding the same issue.  Swapping in
>>>>> string_agg fixed the issue once more.
>>>>>
>>>>> This particular table has over 2 million rows and is 1.2 Gigs, and
>>>>> when I ran the query while viewing htop, the virtual size of the
>>>>> Postgres process ballooned to 13.9G until crashing.
>>>>>
>>>>> The version of Postgres I am using is: PostgreSQL 9.2.4 on
>>>>> x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1)
>>>>> 4.7.2, 64-bit
>>>>>
>>>>> Any help would be much appreciated, thanks!
>>>>>
>>>>> -Robert
>>>>>
>>>>
>>>>
>>>
>>
>

In response to

pgsql-general by date

Next:From: Pavel StehuleDate: 2013-08-20 17:53:29
Subject: Re: Memory Issue with array_agg?
Previous:From: Adrian KlaverDate: 2013-08-20 13:23:05
Subject: Re: Handling of tz-aware literals in non-tz-aware fields

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group