Re: Explain buffers display units.

From: Greg Stark <stark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Explain buffers display units.
Date: 2010-02-16 10:36:50
Message-ID: 407d949e1002160236v4e05497ehcdb8a8e796edc94d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 16, 2010 at 2:48 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Multiplying by the block size makes it sound as if all the
> memory was read or used, which is simply not the case - especially for
> things like buffer hits, which don't actually read or allocate any
> memory at all.

In which case it represents how much data would have had to have been
read if it wasn't in the buffer cache which is a perfectly reasonable
measurement. It's exactly what a cache profiler should be measuring.
These are figures that users have to compare with their buffer cache
size and with the output of iostat or other tools. Presenting them in
arbitrary internal units makes that difficult.

> We certainly do that for GUCs, and in that context it seems to me to
> make sense.  If you set your shared buffers to a gigabyte, PG will use
> an additional GB of memory.  But if you hit a "gigabyte" of shared
> buffers, you may be examining anywhere from one 8K block over and over
> again all the way up to a full GB of memory.  Block hits and reads
> just don't add in the same way that actual memory allocations do.

Accessing the same 8kB of memory 100,1000 times is 1GB of memory
bandwidth. The output of explain doesn't give you enough information
to distinguish that from accessing 1GB of different data which is too
bad but there's a limit to how much information we can fit in a
reasonable amount of space. But 1GB of memory bandwidth is still an
interesting figure even if it's the same 8kB a hundred thousand times.
I think it's a lot more meaningful for a human reader than "131072".

> And at any rate, what we DON'T do for GUCs is produce differing output
> format for the same parameter based on the magnitude of the output
> value, as you've done here.

No, that's *exactly* what we do:

postgres=# set work_mem = 64;
SET
postgres=# show work_mem;
work_mem
----------
64kB
(1 row)

postgres=# set work_mem = 1024;
SET
postgres=# show work_mem;
work_mem
----------
1MB
(1 row)

postgres=# set work_mem = 1048576;
SET
postgres=# show work_mem;
work_mem
----------
1GB
(1 row)

> We accept input in several different
> formats, but there is only one canonical output formal for any
> particular GUC, which is furthermore always chosen in such a way that
> the exact value of the setting is preserved (again, unlike what you've
> done here).

I don't think the use case for GUCs is the same as for empirical
measurements. Empirical results are never going to come out as a round
number of megabytes so only using larger units in that case would be
useless. In the case of GUCs I assume the argument was that someone
should be able to copy the output into another postgresql.conf and get
the same value, something which is irrelevant for empirical
measurements.

In any case the machine-readable form of GUC settings is not this one
canonical format you describe for SHOW:

postgres=# select name,setting,unit,min_val,max_val,boot_val,reset_val
from pg_settings where name = 'work_mem';
name | setting | unit | min_val | max_val | boot_val | reset_val
----------+---------+------+---------+---------+----------+-----------
work_mem | 1048576 | kB | 64 | 2097151 | 1024 | 1024
(1 row)

This is similar to how I think the XML output should work. It should
have the raw internal values with enough meta data in it that a tool
can figure out how to display it or work with it.

> So, you're saying we shouldn't look at the way that the pg_stat
> functions format the output because somebody might write a view over
> it that formats it in some different way that may or may not match
> what you've done for the EXPLAIN output?  What makes you think that
> people don't just look at the raw numbers?  I certainly have, and
> there's no suggestion in the documentation that users should do
> anything else.

I'm not sure users need suggestions that they should format the data
in whatever way they want. We still have to document the programmatic
interface they use to get the raw data.

> pg_stat_statements doesn't do what you're suggesting either; it, too,
> presents raw numbers, and lets the user make of it what they will.
> They might, for example, want to compute a hit ratio, as in the
> example provided in the docs.  In the case of EXPLAIN of an index
> scan, they might want to estimate the number of seeks, on the theory
> that an inner-indexscan is going to be all random IO.

You can compute the hit ratio just fine from measurements with units.
And if you're doing it in an automated way you'll want to use
machine-readable output, rather than parsing the formatted text.

> This doesn't seem to be a very carefully thought out proposal, because
> you haven't explained how it would work for JSON or YAML output.  A
> format-neutral solution which we've already used for sort and hash
> information (and for GUCs) is to include the unit designator in the
> output..  But I generally think that trying to make the EXPLAIN output
> self-documenting to the point where programs don't need any specific
> knowledge of the meanings of the various properties is rather
> hopeless.

I think there's a lot of room between "provide the units for the
measurement" and "100% self-documenting". Most tools are just going to
display the measurement and want to know how to format the result
nicely anyways.

A fixed out-of-band schema isn't going to be able to handle more
flexible cases either. Say we add dtrace support for counting
arbitrary probes and including them in the explain output. The schema
can't possibly list every possible probe and some of them might well
be quantities of memory or time or other units.

I *do* think we should have a schema decription for the structural
tags -- eg. <join>, <append>, <sort> and the like. Tools should be
able to know which tags they should expect to have subnodes and how
many to expect. And they should be able to know which ones they should
expect to have measurements.

> Upon further review, I also notice that this patch seems to have
> falsified the EXPLAIN documentation - both the description of the
> BUFFERS option and the description of the FORMAT option are no longer
> accurate

Oops. Well, I would like to know if I'm in the minority and have to
roll this back before I fix that.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Boszormenyi Zoltan 2010-02-16 11:16:34 Re: NaN/Inf fix for ECPG
Previous Message Magnus Hagander 2010-02-16 10:20:31 Re: Streaming replication on win32, still broken