From: | Marc Mamin <M(dot)Mamin(at)intershop(dot)de> |
---|---|
To: | Björn Wittich <Bjoern_Wittich(at)gmx(dot)de>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: query a table with lots of coulmns |
Date: | 2014-09-20 10:36:34 |
Message-ID: | B6F6FD62F2624C4C9916AC0175D56D8828ACF793@jenmbs01.ad.intershop.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
>At first, thanks for your fast and comprehensive help.
>
>The structure of my cache table is
>
>a text , b text NOT NULL , c text , d text , e timestamp without
>timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer
>DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0
>additional constraints: primary key (b) , Unique(b), Unique(a)
>Indexes : Index on a, Index on b
This looks redundant. e.g. you don't need a unique index on b if you already have a primary key on it.
Can you post the complete table definition ?
...
>One remark which might help: overall 90 - 95 % of the s1-s512 columns
>are 0. I am only interested in columns not equals 0. Perhaps it would
>make sense to use and array of json and enumerate only values not equals 0.
Could you change that to replace 0 values with NULLs?
This would greatly reduce your table space as Postgres is very efficient about NULLs storage:
It marks all null values in a bit map within the row header so you just need about one bit per null
instead of 4 bytes for zeros, and hence get rid of your I/O issue.
regards,
Marc Mamin
________________________________________
Von: pgsql-performance-owner(at)postgresql(dot)org [pgsql-performance-owner(at)postgresql(dot)org]" im Auftrag von "Björn Wittich [Bjoern_Wittich(at)gmx(dot)de]
Gesendet: Samstag, 20. September 2014 09:19
An: Josh Berkus; pgsql-performance(at)postgresql(dot)org
Betreff: Re: [PERFORM] query a table with lots of coulmns
At first, thanks for your fast and comprehensive help.
The structure of my cache table is
a text , b text NOT NULL , c text , d text , e timestamp without
timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0
additional constraints: primary key (b) , Unique(b), Unique(a)
Indexes : Index on a, Index on b
This table has 30 Mio rows ( will increase to 50 Mio) in future
My working table is
b text, g integer
Indexes on b and c
This table has 5 Mio rows
Scenario:
What I want to achieve :
SELECT s1,s2,s3,...s512,g,d from <worktable> INNER JOIN <cachetable>
USING(b) ORDER BY g
The inner join will match at least 95 % of columns of the smaller
worktable in this example 4,75 mio rows.
Running this query takes several hours until I receive the first
results. Query analyzing shows that the execution plan is doing 2 seq
table scans on cache and work table.
When I divide this huge statement into
SELECT s1,s2,s3,...s512,g,d from <worktable> INNER JOIN <cachetable>
USING(b) WHERE g BETWEEN 1 and 10000 ORDER BY g, SELECT
s1,s2,s3,...s512,g,d from <worktable> INNER JOIN <cachetable> USING(b)
WHERE g BETWEEN 10001 and 20000 ORDER BY g, ....
(I can do this because g i unique and continous id from 1 to N)
The result is fast but fireing parallel requests (4-8 times parallel)
slows down the retrieval.
Execution plan changes when adding "BETWEEN 1 and 10000" to use the indexes.
One remark which might help: overall 90 - 95 % of the s1-s512 columns
are 0. I am only interested in columns not equals 0. Perhaps it would
make sense to use and array of json and enumerate only values not equals 0.
Statistics on the large table:
table size: 80 GB
toast-tablesize: 37 GB
size of indexes: 17 GB
Thanks for your help and ideas
Björn
Am 19.09.2014 23:40, schrieb Josh Berkus:
> On 09/19/2014 04:51 AM, Björn Wittich wrote:
>> I am relatively new to postgres. I have a table with 500 coulmns and
>> about 40 mio rows. I call this cache table where one column is a unique
>> key (indexed) and the 499 columns (type integer) are some values
>> belonging to this key.
>>
>> Now I have a second (temporary) table (only 2 columns one is the key of
>> my cache table) and I want do an inner join between my temporary table
>> and the large cache table and export all matching rows. I found out,
>> that the performance increases when I limit the join to lots of small
>> parts.
>> But it seems that the databases needs a lot of disk io to gather all 499
>> data columns.
>> Is there a possibilty to tell the databases that all these colums are
>> always treated as tuples and I always want to get the whole row? Perhaps
>> the disk oraganization could then be optimized?
> PostgreSQL is already a row store, which means by default you're getting
> all of the columns, and the columns are stored physically adjacent to
> each other.
>
> If requesting only 1 or two columns is faster than requesting all of
> them, that's pretty much certainly due to transmission time, not disk
> IO. Otherwise, please post your schema (well, a truncated version) and
> your queries.
>
> BTW, in cases like yours I've used a INT array instead of 500 columns to
> good effect; it works slightly better with PostgreSQL's compression.
>
--
Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
From | Date | Subject | |
---|---|---|---|
Next Message | Björn Wittich | 2014-09-20 11:51:04 | Re: query a table with lots of coulmns |
Previous Message | Björn Wittich | 2014-09-20 07:19:09 | Re: query a table with lots of coulmns |