From: | Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Cc: | Gregory Bulloch <gregory(dot)bulloch(at)enterprisedb(dot)com>, Shan Shan Zhao <shanshan(dot)zhao(at)enterprisedb(dot)com> |
Subject: | Should we document the cost of pg_database_size()? Alternatives? |
Date: | 2025-07-18 00:54:47 |
Message-ID: | CAGRY4nz94+q_zVxj+dnk7zqm-McBz4mSza_wALKiw2==23MiGQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi all
(Long time!)
I recently ran into an issue where a monitoring component was calling
pg_catalog.pg_database_size() at a ~15s interval on a large
schema-sharded database. It took so long that the query was timing out
before the next 15s scrape interval would start.
The resulting filesystem metadata activity was also hammering the
inode cache etc, affecting normal operations.
================
pg_database_size() is expensive
================
The pg_database_size() function will stat every fork of every
relation, which is pretty expensive on a big DB with a lot of extents
and high relation counts. In my toy database:
test=# select pg_database_size('craig');
pg_database_size
------------------
338084643
(1 row)
# strace -c -p 1009203
[...running...]
test=# select pg_database_size('craig');
pg_database_size
------------------
338084643
(1 row)
output of strace is:
strace: Process 1009203 attached
^Cstrace: Process 1009203 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.18 0.001148 2 393 newfstatat
3.42 0.000043 14 3 sendto
1.59 0.000020 6 3 1 openat
1.35 0.000017 4 4 getdents64
1.27 0.000016 8 2 close
0.95 0.000012 6 2 1 recvfrom
0.24 0.000003 1 2 fstat
0.00 0.000000 0 1 epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00 0.001259 3 410 2 total
# find base/16385 | wc -l
394
so it's fstatat()ing (almost) every file including all forks, indexes,
extents, etc.
No surprise it was taking forever on a very complex DB.
================
Docs for pg_database_size() don't mention performance impact
================
When I checked the docs to verify my recollection that
pg_database_size() would stat() every relfilenode (all extents, all
forks including maps, etc) I was surprised to see the docs don't say
much at all about it:
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT
> pg_database_size ( oid ) → bigint
>
> Computes the total disk space used by the database with the specified name or OID. To use this function, you must have CONNECT privilege on the specified database (which is granted by default) or have privileges of the pg_read_all_stats role.
It strikes me that we should probably warn about the cost of this
function on large DBs in the docs.
================
Lower-impact alternatives?
================
Which leads to ... what's the alternative, then, if we should tell
users not to poll pg_database_size() from monitoring systems?
FS-based sizing isn't really enough
----------------
Asking users to monitor at the filesystem level works, kind-of, but
it'll lead to confusion due to WAL and temp files in simple installs.
To get decent results they will need to have a separate dedicated
volume for pg_wal. And which temp files are counted will differ; IIRC
pg_database_size() does not count extents created by an in-progress
REINDEX etc, but DOES count temp table sizes, for example. FS-based
monitoring will also include things like spilled pg_replslot spilled
reorder buffers, which can be considerable and aren't reasonably
considered part of the "database size" or included in
pg_database_size(). And of course it can see only the sum of all
database sizes on a multi-database postgres instance unless the user
has one volume per database using distinct tablespaces. So
filesystem-based monitoring is not really a proper replacement.
Using relpages from pg_class is closer but problematic
-----------------
Should the user aggregate relpages from pg_class, multiply by the page
size, and call it kind-of-good-enough?
That seems awfully crude, and user-unfriendly at that, though works in
trivial cases:
test=# select pg_size_pretty(pg_database_size('test'));
pg_size_pretty
----------------
309 MB
(1 row)
test=# select pg_size_pretty(sum(relpages)*(1024*8)) from pg_class;
pg_size_pretty
----------------
308 MB
(1 row)
However, it breaks down pretty easily with anything less trivial. For
one thing, relpages lags behind VACUUM, so quickly growing or changing
tables and temp tables won't generally be captured. E.g. run
CREATE TEMPORARY TABLE bar AS
SELECT repeat('foofoo',x) AS filler1, repeat('morefillerhere',x) AS filler2
FROM generate_series(1,10000) x;
... then pg_database_size() will reflect the temp table, but summing
relpages won't, even if you're running it in the session that created
the temp table. The relpages for bar is initially 0 until after
VACUUM.
For example, I got a pg_database_size() of 322MB, an initial relpages
sum based estimate of 308MB (unchanged from before temp table
creation) and a relpages sum estimate after VACUUM of 319MB. The
on-disk size of the temp table is 11MB.
Meanwhile the actual size of base/16385 is 323MB according to `du` so
... good enough.
But the overall size of the whole datadir is 1093MB at the moment due
to WAL. Excluding WAL it's 356, mostly due to other DBs in the same
instance's base/.
==============
Any other alternatives?
==============
I'm interested in any other options anyone might suggest on better
ways to track DB size.
Obviously there's going to be a trade-off between freshness and cost;
pg_database_size() is expensive because it's immediate and current,
whereas using relpages is inaccurate because it waits for vacuum.
Any suggestions for a middle ground?
If using relpages is considered good-enough, would anyone be
interested if I was to cook up a patch for a more user-friendly
interface like pg_database_size_cached() to present it to users?
Whew. /novel.
--
Craig Ringer
EnterpriseDB
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2025-07-18 02:10:43 | Re: Should we document the cost of pg_database_size()? Alternatives? |
Previous Message | Pierre Barre | 2025-07-17 22:57:47 | PostgreSQL on S3-backed Block Storage with Near-Local Performance |