Re: hash join vs nested loop join

From: Huan Ruan <huan(dot)ruan(dot)it(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: Huan Ruan <leohuanruan(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: hash join vs nested loop join
Date: 2012-12-13 23:51:27
Message-ID: CAD1stZuZX0aGwF1LRsHsD=SZ3jy6qHMnU2LLBKj87E03NFL8Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Kevin

Again, many thanks for your time and help.

On 14 December 2012 02:26, Kevin Grittner <kgrittn(at)mail(dot)com> wrote:

> Huan Ruan wrote:
>
> > Hash 1st run
>
> > "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
> > (actual time=2182.450..88158.645 rows=48257 loops=1)"
>
> > " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
> > rows=168121728 width=108) (actual time=0.051..32581.052
> > rows=168121657 loops=1)"
>
> 194 nanoseconds per row suggests 100% cache hits.
>
> > NL 1st run
>
> > "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> > time=0.056..551.438 rows=48257 loops=1)"
>
> > " -> Index Scan using pk_invtran on invtran bigtable
> > (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> > rows=1 loops=48261)"
>
> 10 microseconds per index scan (each index scan requiring multiple
> "random" accesses) also suggests 100% cache hits.
>

Interesting to see how you derived 100% cache hits. I assume by 'cache' you
mean the pg shared buffer plus the OS cache? Because the table is 23GB but
the shared buffer is only 6GB. Even then, I'm not completely convinced
because the total RAM is just 24GB, part of which will have to be used for
other data and indexes.

I read somewhere that a pg shared buffer that's too big can hurt the
performance and it's better just leave it to the OS cache. I'm not sure why
but for now, I just configured the shared buffer to be 1/4 of the total RAM.

> > I originally reduced random_page_cost to 2 to achieve the nested
> > loop join. Now I set cpu_tuple_cost to 0.05 and reset
> > random_page_cost back to 4, I can also achieve a nested loop
> > join.
> >
> > I'm still new in Postgres, but I'm worried about random_page_cost
> > being 2 is too low, so maybe increasing cpu_tuple_cost is a
> > better choice.
>
> If these are typical of what you would expect in production, then
> the fact that with default cost factors the costs are barely
> different (by 0.6%) for actual run times which differ by two orders
> of magnitude (the chosen plan is 160 times slower) means that the
> modeling of cost factors is off by a lot.
>
> If you expect the active portion of your database to be fully
> cached like this, it makes sense to reduce random_page_cost to be
> equal to seq_page_cost. But that only adjusts the costs by at most
> a factor of four, and we've established that in the above query
> they're off by a factor of 160. To help make up the difference, it
> makes sense to de-emphasize page access compared to cpu-related
> costs by reducing both page costs to 0.1. Combined, these
> adjustments still can't compensate for how far off the estimate
> was.

> In my experience default cpu_tuple_cost is understated compared to
> other cpu-related costs, so I would do the above *plus* a boost to
> cpu_tuple_cost. Personally, I have never seen a difference between
> plans chosen with that set to 0.03 and 0.05, so I can't say where
> in that range is the ideal value; you should feel free to
> experiment if there is a query which seems to be choosing a bad
> plan. If the above results really do represent cache hit levels you
> expect in production, the combination of the above changes should
> come reasonably close to modeling costs realistically, resulting in
> better plan choice.
>

In production, 60% of the database would be able to fit in the RAM. But
roughly, all the active data we need to use should be able to fit in 100%.
On the test server I'm playing with now, RAM is only 8% of the database
size. Nonetheless, I will play with these parameters like you suggested.

I was wondering on our production server where the effetive_cache_size will
be much bigger, will pg then guess that probably most data is cached anyway
therefore leaning towards nested loop join rather than a scan for hash join?

Even on a test server where the cache hit rate is much smaller, for a big
table like this, under what circumstances, will a hash join perform better
than nested loop join though?

>
> If you don't expect such high cache hit ratios in production, you
> probably don't want to go so low with page costs.
>
> >>> - shared_buffers = 6GB
> >>> - effective_cache_size = 18GB
> >>> - work_mem = 10MB
> >>> - maintenance_work_mem = 3GB
>
> > Can you see any obvious issues with the other memory settings I
> > changed?
>
> I might bump up work_mem to 20MB to 60MB, as long as you're not
> going crazy with max_connections. I would probably take
> maintenance_work_mem down to 1GB to 2GB -- you can have several of
> these allocations at one time, and you don't want to blow away your
> cache. (I think it might actually be adjusted down to 2GB
> internally anyway; but I would need to check.)
>

Yes, I had bumped up work_mem yesterday to speed up another big group by
query. I used 80MB. I assumed this memory will only be used if the query
needs it and will be released as soon as it's finished, so it won't be too
much an issue as long as I don't have too many concurrently sorting queries
running (which is true in our production). Is this correct?

I increased maintenance_work_mem initially to speed up the index creation
when I first pump in the data. In production environment, we don't do run
time index creation, so I think only the vacuum and analyze will consume
this memory?

Thanks
Huan

>
> -Kevin
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2012-12-14 00:38:53 Re: Why does the number of rows are different in actual and estimated.
Previous Message Evgeny Shishkin 2012-12-13 23:50:19 Re: Why does the number of rows are different in actual and estimated.