Re: Hash join on int takes 8..114 seconds

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: Andrus <kobruleht2(at)hot(dot)ee>
Cc: Scott Carey <scott(at)richrelevance(dot)com>, PFC <lists(at)peufeu(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash join on int takes 8..114 seconds
Date: 2008-11-23 23:39:48
Message-ID: 4929E9C4.1010706@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> Scott,
>
> thank you.
>
>>> > work_mem = 512
>> This is very easy to try. You can change work_mem for just a single
>> session, and this can in some cases help performance quite a bit, and in
>> others not at all.
>> I would not recommend having it lower than at least 4MB on a server like
>> that unless you have a lot of concurrently active queries / connections.
>> To try it, simply use the SET command. To try out 32MB, just do:
>> SET work_mem = '32MB';
>
> 8.1.4 requires int value.
> SET work_mem = 33554432;
> causes:
> ERROR: 33554432 is outside the valid range for parameter "work_mem" (64 ..
> 2097151)
>
> So max allowed value seems to be 2 MB
> I tested it when this server was idle by running both queries several times
> after VACUUM FULL was running
>
> SET work_mem = 2097151;

No, not really. The work_mem value is specified in kilobytes, so you've
set it to 2GB :-)

> set search_path to firma2,public;
> explain analyze
> SELECT sum(1)
> FROM dok JOIN rid USING (dokumnr)
> JOIN toode USING (toode)
> WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'
> "Aggregate (cost=177291.36..177291.37 rows=1 width=0) (actual
> time=5153.856..5153.859 rows=1 loops=1)"
> " -> Nested Loop (cost=73607.45..177229.96 rows=24561 width=0) (actual
> time=1395.935..5071.247 rows=21541 loops=1)"
> " -> Index Scan using toode_pkey on toode (cost=0.00..6.01 rows=1
> width=24) (actual time=0.078..0.087 rows=1 loops=1)"
> " Index Cond: ('X05'::bpchar = toode)"
> " -> Hash Join (cost=73607.45..176978.33 rows=24561 width=24)
> (actual time=1395.836..4911.425 rows=21541 loops=1)"
> " Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
> " -> Bitmap Heap Scan on rid (cost=4083.10..101802.05
> rows=270316 width=28) (actual time=238.578..2367.189 rows=278247 loops=1)"
> " Recheck Cond: (toode = 'X05'::bpchar)"
> " -> Bitmap Index Scan on rid_toode_idx
> (cost=0.00..4083.10 rows=270316 width=0) (actual time=150.868..150.868
> rows=278248 loops=1)"
> " Index Cond: (toode = 'X05'::bpchar)"
> " -> Hash (cost=69242.72..69242.72 rows=112651 width=4)
> (actual time=1146.081..1146.081 rows=110170 loops=1)"
> " -> Bitmap Heap Scan on dok (cost=1492.28..69242.72
> rows=112651 width=4) (actual time=46.210..696.803 rows=110170 loops=1)"
> " Recheck Cond: (kuupaev >= '2008-09-01'::date)"
> " -> Bitmap Index Scan on dok_kuupaev_idx
> (cost=0.00..1492.28 rows=112651 width=0) (actual time=33.938..33.938
> rows=110232 loops=1)"
> " Index Cond: (kuupaev >=
> '2008-09-01'::date)"
> "Total runtime: 5154.911 ms"
>
>
> SET work_mem to default;
> set search_path to firma2,public;
> explain analyze
> SELECT sum(1)
> FROM dok JOIN rid USING (dokumnr)
> JOIN toode USING (toode)
> WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'
> "Aggregate (cost=181869.36..181869.37 rows=1 width=0) (actual
> time=7807.867..7807.871 rows=1 loops=1)"
> " -> Nested Loop (cost=74048.45..181807.96 rows=24561 width=0) (actual
> time=2607.429..7728.138 rows=21541 loops=1)"
> " -> Index Scan using toode_pkey on toode (cost=0.00..6.01 rows=1
> width=24) (actual time=0.079..0.091 rows=1 loops=1)"
> " Index Cond: ('X05'::bpchar = toode)"
> " -> Hash Join (cost=74048.45..181556.33 rows=24561 width=24)
> (actual time=2607.332..7569.612 rows=21541 loops=1)"
> " Hash Cond: ("outer".dokumnr = "inner".dokumnr)"
> " -> Bitmap Heap Scan on rid (cost=4083.10..101802.05
> rows=270316 width=28) (actual time=1147.071..4528.845 rows=278247 loops=1)"
> " Recheck Cond: (toode = 'X05'::bpchar)"
> " -> Bitmap Index Scan on rid_toode_idx
> (cost=0.00..4083.10 rows=270316 width=0) (actual time=1140.337..1140.337
> rows=278248 loops=1)"
> " Index Cond: (toode = 'X05'::bpchar)"
> " -> Hash (cost=69242.72..69242.72 rows=112651 width=4)
> (actual time=1240.988..1240.988 rows=110170 loops=1)"
> " -> Bitmap Heap Scan on dok (cost=1492.28..69242.72
> rows=112651 width=4) (actual time=68.053..769.448 rows=110170 loops=1)"
> " Recheck Cond: (kuupaev >= '2008-09-01'::date)"
> " -> Bitmap Index Scan on dok_kuupaev_idx
> (cost=0.00..1492.28 rows=112651 width=0) (actual time=61.358..61.358
> rows=110232 loops=1)"
> " Index Cond: (kuupaev >=
> '2008-09-01'::date)"
> "Total runtime: 7808.174 ms"
>
> In both cases vmstat 2 shows only cpu activity when queries are running:
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id
> wa
> ...
> 0 0 232 123692 0 1891044 0 0 0 0 252 17 0 0
> 100 0
> 0 0 232 123692 0 1891044 0 0 0 0 252 17 0 0
> 100 0
> 0 0 232 123684 0 1891044 0 0 0 162 254 22 0 0
> 100 0
> 0 0 232 123684 0 1891044 0 0 0 0 252 18 0 0
> 100 0
> 1 0 232 123056 0 1891444 0 0 0 13 254 21 62 5 34
> 0 <---- start of slower query
> 1 0 232 102968 0 1911060 0 0 0 16 252 18 26 75 0
> 0
> 1 0 232 77424 0 1936996 0 0 0 0 252 18 37 63 0
> 0
> 1 0 232 71464 0 1941928 0 0 0 73 260 34 38 62 0
> 0
> 0 0 232 123420 0 1891044 0 0 0 32 257 31 8 15 77
> 0 <-------- end of slower query
> 0 0 232 123420 0 1891044 0 0 0 25 255 24 0 0
> 100 0
> 0 0 232 123420 0 1891044 0 0 0 28 255 27 0 0
> 100 0
>

Well, this parameter specifies how much memory may be used for in-memory
sorting and hash tables. If more memory is needed, the sort / hash table
will be performed on-disk.

I guess the difference in speed is not huge, so in this case the hash
table (used for join) probably fits into the 1024kB (which is a default
value).

> Is it safe to set
>
> work_mem = 2097151
>
> in postgresql.conf file ?

Each sort in a query uses a separate area in a memory (up to work_mem).
So if you have 10 sessions running a query with 2 sorts, that may occupy

10 * 2 * work_mem

of memory. Let's suppose you set a reasonable value (say 8096) instead
of 2GB. That gives about 160MB.

Anyway this depends - if you have a lot of slow queries caused by
on-disk sorts / hash tables, use a higher value. Otherwise leave it as
it is.

>> First, demonstrate that it is all or mostly in memory -- use iostat or
>> other tools to ensure that there is not much disk activity during the
>> query. If your system doesn't have iostat installed, it should be
>> installed. It is a very useful tool.
>
> # iostat
> bash: iostat: command not found
> # locate iostat
> /usr/src/linux-2.6.16-gentoo-r9/Documentation/iostats.txt
>
> I have few experience in Linux. No idea how to install or compile iostat in
> this system.
>
>> If it is all cached in memory, you may want to ensure that your
>> shared_buffers is a reasonalbe size so that there is less shuffling of
>> data
>> from the kernel to postgres and back. Generally, shared_buffers works
>> best
>> between 5% and 25% of system memory.
>
> currently shared_buffers = 15000

That's 120MB, i.e. about 6% of the memory. Might be a little bit higher,
but seems reasonable.

>> If it is completely CPU bound then the work done for the query has to be
>> reduced by altering the plan to a more optimal one or making the work it
>> has to do at each step easier. Most of the ideas in this thread revolve
>> around those things.
>
> When running on loaded server even after VACUUM FULL, response time for
> original work_mem is longer probably because it must fetch blocks from
> disk.
>
>> Based on the time it took to do the vacuum, I suspect your disk subsystem
>> is a bit slow. If it can be determined that there is much disk I/O in
>> your
>> use cases, there are generally several things that can be done to tune
>> Linux I/O. The main ones in my experience are the 'readahead' value for
>> each disk which helps sequential reads significantly, and trying out the
>> linux 'deadline' scheduler and comparing it to the more commonly used
>> 'cfq'
>> scheduler. If the system is configured with the anticipatory scheduler,
>> absolutely switch to cfq or deadline as the anticipatory scheduler will
>> perform horribly poorly for a database.
>
> This is 3 year old cheap server.
> No idea what to config.

Well, how could we know that if you don't?

Anyway the options mentioned by Scott Carey are related to linux kernel,
so it shouldn't be a problem to change.

> There is also other similar server which as 1.2 GB more usable memory.
> No idea is it worth to switch into it.
> After some years sales data will still exceed this more memory.

Given the fact that the performance issues are caused by bloated tables
and / or slow I/O subsystem, moving to a similar system won't help I guess.

regards
Tomas

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Richard Huxton 2008-11-24 09:36:17 Re: Increasing pattern index query speed
Previous Message Tomas Vondra 2008-11-23 23:11:40 Re: Hash join on int takes 8..114 seconds