Re: Stats collector's idx_blks_hit value is highly misleading in practice

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Stats collector's idx_blks_hit value is highly misleading in practice
Date: 2020-10-31 01:46:17
Message-ID: 20201031014617.ukvqjlis7tkdo5ib@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 16, 2020 at 03:35:51PM -0700, Peter Geoghegan wrote:
>It occurs to mean that statistics collector stats such as
>pg_statio_*_tables.idx_blks_hit are highly misleading in practice
>because they fail to take account of the difference between internal
>pages and leaf pages in B-Tree indexes. These two types of pages are
>in fundamentally different categories, and I think that failing to
>recognize that at the level of these system views makes them much less
>useful. Somebody should probably write a patch that makes this
>difference clear from the system views. Possibly by using some
>generalized notion of "record" pages instead of leaf pages, and
>"metadata" pages instead of internal pages. That would even work with
>hash indexes, I think.
>
>Consider the following example, which is based on a standard nbtree
>index, but could work in almost the same way with other index access
>methods:
>
>We have a pgbench_accounts pkey after initialization by pgbench at
>scale 1500. It has 409,837 leaf pages and 1,451 internal pages,
>meaning that about one third of one percent of all pages in the index
>are internal pages. Occasionally, with indexes on large text strings
>we might notice that as many as 1% of all index pages are internal
>pages, but that's very much on the high side. Generally speaking,
>we're virtually guaranteed to have *all* internal pages in
>shared_buffers once a steady state has been reached. Once the cache
>warms up, point lookups (like the queries pgbench performs) will only
>have to access one leaf page at most, which amounts to only one I/O at
>most. (This asymmetry is the main reason why B-Trees are generally
>very effective when buffered in a buffer cache.)
>
>If we run the pgbench queries against this database/example index
>we'll find that we have to access 4 index pages per query execution --
>the root, two additional internal pages, plus a leaf page. Based on
>the reasonable assumptions I'm making, 3 out of 4 of those pages will
>be hits when steady state is reached with pgbench's SELECT-only
>workload, regardless of how large shared_buffers is or how bloated the
>index is (we only need 1451 buffers for that, and those are bound to
>get hot quickly).
>
>The overall effect is idx_blks_hit changes over time in a way that
>makes no sense -- even to an expert. Let's say we start with this
>entire 3213 MB pgbench index in shared_buffers. We should only get
>increments in idx_blks_hit, never increments in idx_blks_read - that
>much makes sense. If we then iteratively shrink shared_buffers (or
>equivalently, make the index grow without adding a new level), the
>proportion of page accesses that increment idx_blks_read (rather than
>incrementing idx_blks_hit) goes up roughly linearly as misses increase
>linearly - which also makes sense. But here is the silly part: we
>cannot really have a hit rate of less than 75% if you compare
>idx_blks_hit to idx_blks_read, unless and until we can barely even fit
>1% of the index in memory (at which point it's hard to distinguish
>from noise). So if we naively consume the current view we'll see a hit
>rate that starts at 100%, and very slowly shrinks to 75%, which is
>where we bottom out (more or less, roughly speaking). This behavior
>seems pretty hard to defend to me.
>

Yeah. The behavior is technically correct, but it's not very useful for
practical purposes. And most people don't even realize it behaves like
this :-( It's possible to compensate for this effect and estimate the
actually "interesting" hit rate, but if we could have it directly that
would be great.

>If somebody fixed this by putting internal pages into their own bucket
>in the system view, then motivated users would quickly learn that
>internal page stats aren't really useful -- they are only included for
>completeness. They're such a small contributor to the overall hit rate
>that they can simply be ignored completely. The thing that users ought
>to focus on is leaf page hit rate. Now index hit rate (by which I mean
>leaf page hit rate) actually makes sense. Note that Heroku promoted
>simple heuristics like this for many years.
>
>I suppose that a change like this could end up affecting other things,
>such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
>separately for bitmap scans at the moment, so maybe it could be fairly
>well targeted. And, maybe this is unappealing given the current
>statistics collector limitations. I'm not volunteering to work on it
>right now, but it would be nice to fix this. Please don't wait for me
>to do it.
>

It seems to me this should not be a particularly difficult patch in
principle, so suitable for new contributors. The main challenge would be
passing information about what page we're dealing with (internal/leaf)
to the place actually calling pgstat_count_buffer_(read|hit). That
happens in ReadBufferExtended, which just has no idea what page it's
dealing with. Not sure how to do that cleanly ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-10-31 02:08:01 Re: bulk typos
Previous Message Tomas Vondra 2020-10-31 01:20:53 Re: [PATCH] Add extra statistics to explain for Nested Loop