Re: speedup tidbitmap patch: cache page

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speedup tidbitmap patch: cache page
Date: 2014-12-14 11:58:07
Message-ID: CAApHDvp_+c1hPRiVfKcT1ty__6q5wnz2cMc_tNyVx6mhbbzSAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 October 2014 at 00:52, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
>
> In specific workload postgres could spend a lot of time in
> tbm_add_tuples, up to 50% of query time. hash_search call is expensive and
> called twice for each ItemPointer to insert. Suggested patch tries to cache
> last PagetableEntry pointer in hope that next ItemPointer will be on the
> same relation page.
>
> Patch is rather trivial and I don't believe that it could cause
> performance degradation. But it could increase performance on some workload.
>
> An example:
> # create extension btree_gin;
> # select (v / 10)::int4 as i into t from generate_series(1, 5000000) as v;
> # create index idx on t using gin (i);
> # set enable_seqscan = off;
>
> # explain analyze select * from t where i >= 0;
> without patch: Execution time: 2427.692 ms
> with patch: Execution time: 2058.718 ms
>
> # explain analyze select * from t where i >= 100 and i<= 100;
> without patch: Execution time: 524.441 ms
> with patch: Execution time: 195.381 ms
>
>
This seems like quite a promising performance boost.

I've been having a look at this and I'm wondering about a certain scenario:

In tbm_add_tuples, if tbm_page_is_lossy() returns true for a given block,
and on the next iteration of the loop we have the same block again, have
you benchmarked any caching code to store if tbm_page_is_lossy() returned
true for that block on the previous iteration of the loop? This would save
from having to call tbm_page_is_lossy() again for the same block. Or are
you just expecting that tbm_page_is_lossy() returns true so rarely that
you'll end up caching the page most of the time, and gain on skipping both
hash lookups on the next loop, since page will be set in this case?

It would be nice to see a comment to explain why it might be a good idea to
cache the page lookup. Perhaps something like:

+ /*
+ * We'll cache this page as it's quite likely that
on the next loop
+ * we'll be seeing the same page again. This will
save from having
+ * to lookup the page in the hashtable again.
+ */
+ page = tbm_get_pageentry(tbm, blk);

I also wondered if there might be a small slowdown in the case where the
index only finds 1 matching tuple. So I tried the following:

select v::int4 as i into t1 from generate_series(1, 5000000) as v;
create index t1_idx on t1 using gin (i);

select count(*) from t1 where i >= 0;

In this case tbm_add_tuples is called with ntids == 1, so there's only 1
loop performed per call. I wondered if the page == NULL check would add
much overhead the function.

Times are in milliseconds:

Patch Master
2413.183 2339.979
2352.564 2428.025
2369.118 2359.498
2338.442 2350.974
2373.330 2400.942
2383.883 2342.461
2393.804 2359.49
2334.998 2359.884
2428.156 2520.842
2336.978 2356.995
avg. 2372.4456 2381.909 99.6%
med. 2371.224 2359.494 100.5%

It appears that if it does, then it's not very much.

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Cave-Ayland 2014-12-14 14:35:17 Re: Commitfest problems
Previous Message David Rowley 2014-12-14 08:03:29 Confusing comment in tidbitmap.c