Re: Proposal: Improve bitmap costing for lossy pages

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Improve bitmap costing for lossy pages
Date: 2017-07-26 17:05:44
Message-ID: CA+TgmoYW3dkA=XiThkGQRegTbEjZOF3GG4gg5YEqZaX2RWncOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 8, 2017 at 10:44 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Thu, May 18, 2017 at 8:07 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Thanks for the feedback and sorry for the delayed response.
>
>> You might need to adjust effective_cache_size.
>
> You are right. But, effective_cache_size will have the impact on the
> number of pages_fetched when it's used as parameterized path (i.e
> inner side of the nested loop). But for our case where we see the
> wrong number of pages got estimated (Q10), it was for the
> non-parameterized path.

Ah, oops. My mistake.

One thing to keep in mind that this is just an estimate. It's not
going to be right 100% of the time no matter what you do. The goal is
to make the estimates better than they are today, and the patch can
succeed in being better overall even if there are some cases where
things get worse. Have you tried to analyze what is causing the bad
estimate in this one case?

The formula that compute_bitmap_pages is using here to compute the
number of page fetches is (2.0 * T * tuples_fetched) / (2.0 * T +
tuples_fetched), where T is the number of pages in the table. Now the
idea here is that when tuples_fetched is small, the number of pages
fetched is likely to be almost equal to the number of tuples fetched,
because probably all of the tuples will be on separate pages. As the
number of tuples grows larger, we assume it's likely that sometimes
two or more of them will be on the same page, so pages_fetched grows
more slowly. When tuples_fetched = T, that is, the number of tuples
equals the number of pages, we estimate that we're fetching 2/3 of the
table, because some pages will have no tuples to fetch at all, while
others have more than one. When tuples_fetched = 2 * T or greater, we
assume we'll fetch every page in the table.

But this could be wrong. If there are 100 tuples per paged, we could
have tuples_fetched = 2 * T but actually fetch only T / 50 pages
rather than T pages, if all the rows we need to fetch are tightly
clustered. That would be a 50x estimation error; the one you're
seeing is about 3.8x. And my guess is that it's exactly this problem:
the TIDs being fetched are not spread out evenly through the whole
table, but are rather all clustered, but you could try to verify that
through some experimentation. I'm not sure we have the statistics to
solve that problem in a principled way. It seems loosely related to
the physical-to-logical correlation which we do store, but not so
closely that any way of using that information directly is obvious.

Instead of trying to immediate improve things on the optimizer side,
I'm wondering whether our first step should be to try to improve
things on the executor side - i.e. reduce the number of pages that
actually get lossified. tbm_lossify says:

* XXX Really stupid implementation: this just lossifies pages in
* essentially random order. We should be paying some attention to the
* number of bits set in each page, instead.

As the comment says, the current implementation is really stupid,
which means we're lossifying more pages than really necessary. There
is some previous discussion of this topic here:

https://www.postgresql.org/message-id/flat/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de

There are two main considerations here. One, it's better to lossify
pages with many bits set than with few bits set, because the
additional work we thereby incur is less. Two, it's better to lossify
pages that are in the same chunk as other pages which we are also
going to lossify, because that's how we actually save memory. The
current code will cheerfully lossify a chunk that contains no other
pages, or will lossify one page from a chunk but not the others,
saving no memory but hurting performance.

Maybe the simplest change here would be to make it so that when we
decide to lossify a chunk, we lossify all pages in the chunk, but only
if there's more than one. In other words, given a proposed page P to
lossify, probe the hash table for all keys in the same chunk as P and
build up a words[] array for the proposed chunk. If that words[]
array will end up with only 1 bit set, then forget the whole thing;
otherwise, delete all of the entries for the individual pages and
insert the new chunk instead.

A further refinement would be to try to do a better job picking which
chunks to lossify in the first place. I don't have a clear idea of
how we could go about doing that. There's an unused padding byte
available inside PageTableEntry, and really it's more like 28 bits,
because status only needs 2 bits and ischunk and recheck only need 1
bit each. So without increasing the memory usage at all, we could use
those bits to store some kind of information that would give us a clue
as to whether a certain entry was likely to be a good candidate for
lossification. What to store there is a little less clear, but one
idea is to store the number of page table entries that could be saved
by lossifying the chunk. We could iterate through the hash table once
and work out the correct value for every chunk, storing 0 for any
pages that wouldn't be chunk headers. As we go, we could keep a
separate array that counts the number of times we found an opportunity
for lossification that would save N pages; that is, maintain an array
chance_to_save[PAGES_PER_CHUNK] such that after looping through the
whole page table, chances_to_save[i] is equal to the number of page
table entries for which lossifying all pages in the chunk would save i
entries. Then, based on how many entries we need to save, we could
cheaply compute a threshold value and lossify all chunks that save at
least that many pages. This isn't perfect because it ignores the
number of bits set for each individual page, and it adds some cost
because you have to iterate through the hash table twice, but it seems
pretty good -- you lossify the chunks where it saves the most storage
instead of picking randomly.

As far as the patch itself is concerned, tbm_calculate_exact_pages()
is computing the number of "exact pages" which will fit into the
TIDBitmap, but I think that instead of tbm_calculate_exact_pages() you
should have something like tbm_calculate_entries() that just returns
nbuckets, and then let the caller work out how many entries are going
to be exact and how many are going to be inexact. An advantage of
that approach is that the new function could be used by tbm_create()
instead of duplicating the logic. I'm not sure that the way you are
doing the rest of the calculation is wrong, but I've got no confidence
that it's right, either: the way WORDS_PER_CHUNK is used looks pretty
random, and the comments aren't enough for me to figure it out.

It's unclear what assumptions we should make while trying to estimate
the number of lossy pages. The effectiveness of lossification depends
on the average number of pages that get folded into a chunk; but how
many will that be? If we made some of the improvements proposed
above, it would probably be higher than it is now, but either way it's
not clear what number to use. One possible assumption is that the
pages that get lossified are exactly average, so:

double entries_saved_per_lossy_page = Max(heap_pages_fetched /
tbm_max_entries - 1, 1.0);
lossy_pages = (heap_pages_fetched - tbm_max_entries) /
pages_saved_per_lossy_page;
exact_pages = heap_pages_fetched - lossy_pages;

If the TBM fits into work_mem, heap_pages_fetched / tbm_max_entries is
the average number of entries per chunk, so one less than that value
is the number of pages we expect to save by lossifying an average
chunk and all of its entries. This might even be too optimistic given
the way tbm_lossify() works today, since there's currently no
guarantee we'd save anything at all; we might lossify a bunch of extra
stuff just for fun.

Another possible assumption is that the pages that get lossified are
particularly good candidates for lossification -- they are, say, twice
as dense as the typical page. To reflect such an assumption, you'd
just make entries_saved_per_lossy_page bigger e.g. by inserting "2 *"
at the front of the formula.

There could be other ways of computing this, too -- you've got one! --
but I'm not sure that WORDS_PER_CHUNK should be involved at all. The
number of entries saved per lossy page will only be WORDS_PER_CHUNK -
1 in the really fortunate case where not only does the algorithm
always pick the chunk with the most pages as the next one to lossify,
but also that chunk always has the maximum number of possible pages in
it. That isn't likely on real data distributions.

Curious to hear more of your (or anyone's) thoughts on this. This is
a tricky problem and the performance gains you've gotten seem to show
this area is clearly worth some effort.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-07-26 17:10:18 Re: Change in "policy" on dump ordering?
Previous Message Claudio Freire 2017-07-26 16:46:25 Re: Increase Vacuum ring buffer.