Re: shared-memory based stats collector

From: Andres Freund <andres(at)anarazel(dot)de>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: tomas(dot)vondra(at)2ndquadrant(dot)com, alvherre(at)2ndquadrant(dot)com, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2019-02-07 21:10:08
Message-ID: 20190207211008.nc3axviivmcoaluq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 7eed5866d2..e52ae54821 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
> &sync_secs, &sync_usecs);
>
> /* Accumulate checkpoint timing summary data, in milliseconds. */
> - BgWriterStats.m_checkpoint_write_time +=
> + BgWriterStats.checkpoint_write_time +=
> write_secs * 1000 + write_usecs / 1000;
> - BgWriterStats.m_checkpoint_sync_time +=
> + BgWriterStats.checkpoint_sync_time +=
> sync_secs * 1000 + sync_usecs / 1000;

Why does this patch do renames like this in the same entry as actual
functional changes?

> @@ -1273,16 +1276,22 @@ do_start_worker(void)
> break;
> }
> }
> - if (skipit)
> - continue;
> + if (!skipit)
> + {
> + /* Remember the db with oldest autovac time. */
> + if (avdb == NULL ||
> + tmp->adw_entry->last_autovac_time <
> + avdb->adw_entry->last_autovac_time)
> + {
> + if (avdb)
> + pfree(avdb->adw_entry);
> + avdb = tmp;
> + }
> + }
>
> - /*
> - * Remember the db with oldest autovac time. (If we are here, both
> - * tmp->entry and db->entry must be non-null.)
> - */
> - if (avdb == NULL ||
> - tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
> - avdb = tmp;
> + /* Immediately free it if not used */
> + if(avdb != tmp)
> + pfree(tmp->adw_entry);
> }

This looks like another unrelated refactoring. Don't do this.

> /* Transfer stats counts into pending pgstats message */
> - BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
> - BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
> + BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
> + BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;

More unrelated renaming. I'll stop mentioning this in the rest of this
patch, but really don't do this, makes such a large unnecessary harder
to review.

pgstat.c needs a header comment explaining the architecture of the
approach.

> +/*
> + * Operation mode of pgstat_get_db_entry.
> + */
> +#define PGSTAT_FETCH_SHARED 0
> +#define PGSTAT_FETCH_EXCLUSIVE 1
> +#define PGSTAT_FETCH_NOWAIT 2
> +
> +typedef enum
> +{

Please don't create anonymous enums that are then typedef'd to a
name. The underlying name and the one not scoped to typedefs shoudl be
the same.

> +/*
> + * report withholding facility.
> + *
> + * some report items are withholded if required lock is not acquired
> + * immediately.
> + */

This comment needs polishing. The variables are named _pending_, but the
comment talks about withholding - which doesn't seem like an apt name.

> /*
> * Structures in which backends store per-table info that's waiting to be
> @@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
> * Hash table for O(1) t_id -> tsa_entry lookup
> */
> static HTAB *pgStatTabHash = NULL;
> +static HTAB *pgStatPendingTabHash = NULL;
>
> /*
> * Backends store per-function info that's waiting to be sent to the collector
> * in this hash table (indexed by function OID).
> */
> static HTAB *pgStatFunctions = NULL;
> -
> -/*
> - * Indicates if backend has some function stats that it hasn't yet
> - * sent to the collector.
> - */
> -static bool have_function_stats = false;
> +static HTAB *pgStatPendingFunctions = NULL;

So this patch leaves us with a pgStatFunctions that has a comment
explaining it's about "waiting to be sent" stats, and then additionally
a pgStatPendingFunctions?

> /* ------------------------------------------------------------
> * Public functions used by backends follow
> @@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
> * pgstat_report_stat() -
> *
> * Must be called by processes that performs DML: tcop/postgres.c, logical
> - * receiver processes, SPI worker, etc. to send the so far collected
> - * per-table and function usage statistics to the collector. Note that this
> - * is called only when not within a transaction, so it is fair to use
> - * transaction stop time as an approximation of current time.
> - * ----------
> + * receiver processes, SPI worker, etc. to apply the so far collected
> + * per-table and function usage statistics to the shared statistics hashes.
> + *
> + * This requires taking some locks on the shared statistics hashes and some

Weird mix of different indentation.

> + * of updates may be withholded on lock failure. Pending updates are
> + * retried in later call of this function and finally cleaned up by calling
> + * this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
> + * was elapsed since last cleanup. On the other hand updates by regular

s/was/has/

> +
> + /* Forecibly update other stats if any. */

s/Forecibly/Forcibly/

Typo aside, what does forcibly mean here?

> /*
> - * Scan through the TabStatusArray struct(s) to find tables that actually
> - * have counts, and build messages to send. We have to separate shared
> - * relations from regular ones because the databaseid field in the message
> - * header has to depend on that.
> + * XX: We cannot lock two dshash entries at once. Since we must keep lock

Typically we use three XXX (or alternatively NB:).

> + * while tables stats are being updated we have no choice other than
> + * separating jobs for shared table stats and that of egular tables.

s/egular/regular/

> + * Looping over the array twice isapparently ineffcient and more efficient
> + * way is expected.
> */

s/isapparently/is apparently/
s/ineffcient/inefficient/

But I don't know what this sentence is trying to say precisely. Are you
saying this cannot be committed unless this is fixed?

Nor do I understand why it's actually relevant that we cannot lock two
dshash entries at once. The same table is never shared and unshared,
no?

> +/*
> + * Subroutine for pgstat_update_stat.
> + *
> + * Appies table stats in table status array merging with pending stats if any.

s/Appies/Applies/

> + * If force is true waits until required locks to be acquired. Elsewise stats

s/elsewise/ohterwise/

> + * merged stats as pending sats and it will be processed in the next chance.

s/sats/stats/
s/in the next change/at the next chance/

> + /* if pending update exists, it should be applied along with */
> + if (pgStatPendingTabHash != NULL)

Why is any of this done if there's no pending data?

> {
> - pgstat_send_tabstat(this_msg);
> - this_msg->m_nentries = 0;
> + pentry = hash_search(pgStatPendingTabHash,
> + (void *) entry, HASH_FIND, NULL);
> +
> + if (pentry)
> + {
> + /* merge new update into pending updates */
> + pgstat_merge_tabentry(pentry, entry, false);
> + entry = pentry;
> + }
> + }
> +
> + /* try to apply the merged stats */
> + if (pgstat_apply_tabstat(cxt, entry, !force))
> + {
> + /* succeeded. remove it if it was pending stats */
> + if (pentry && entry != pentry)
> + hash_search(pgStatPendingTabHash,
> + (void *) pentry, HASH_REMOVE, NULL);

Huh, how can entry != pentry in the case of pending stats? They're
literally set to the same value above?

> + else if (!pentry)
> + {
> + /* failed and there was no pending entry, create new one. */
> + bool found;
> +
> + if (pgStatPendingTabHash == NULL)
> + {
> + HASHCTL ctl;
> +
> + memset(&ctl, 0, sizeof(ctl));
> + ctl.keysize = sizeof(Oid);
> + ctl.entrysize = sizeof(PgStat_TableStatus);
> + pgStatPendingTabHash =
> + hash_create("pgstat pending table stats hash",
> + TABSTAT_QUANTUM,
> + &ctl,
> + HASH_ELEM | HASH_BLOBS);
> + }
> +
> + pentry = hash_search(pgStatPendingTabHash,
> + (void *) entry, HASH_ENTER, &found);
> + Assert (!found);
> +
> + *pentry = *entry;
> }
> }
> - /* zero out TableStatus structs after use */
> - MemSet(tsa->tsa_entries, 0,
> - tsa->tsa_used * sizeof(PgStat_TableStatus));
> - tsa->tsa_used = 0;
> + }

I don't understand why we do this at all.

> + if (cxt->tabhash)
> + dshash_detach(cxt->tabhash);

Huh, why do we detach here?

> +/*
> + * pgstat_apply_tabstat: update shared stats entry using given entry
> + *
> + * If nowait is true, just returns false on lock failure. Dshashes for table
> + * and function stats are kept attached and stored in ctx. The caller must
> + * detach them after use.
> + */
> +bool
> +pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
> + PgStat_TableStatus *entry, bool nowait)
> +{
> + Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> + int table_mode = PGSTAT_FETCH_EXCLUSIVE;
> + bool updated = false;
> +
> + if (nowait)
> + table_mode |= PGSTAT_FETCH_NOWAIT;
> +
> + /*
> + * We need to keep lock on dbentries for regular tables to avoid race
> + * condition with drop database. So we hold it in the context variable. We
> + * don't need that for shared tables.
> + */
> + if (!cxt->dbentry)
> + cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);

Oh, wait, what? *That's* the reason why we need to hold a lock on a
second entry?

Uhm, how can this actually be an issue? If we apply pending stats, we're
connected to the database, it therefore cannot be dropped while we're
applying stats, no?

> + /* attach shared stats table if not yet */
> + if (!cxt->tabhash)
> + {
> + /* apply database stats */
> + if (!entry->t_shared)
> + {
> + /* Update database-wide stats */
> + cxt->dbentry->n_xact_commit += pgStatXactCommit;
> + cxt->dbentry->n_xact_rollback += pgStatXactRollback;
> + cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
> + cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
> + pgStatXactCommit = 0;
> + pgStatXactRollback = 0;
> + pgStatBlockReadTime = 0;
> + pgStatBlockWriteTime = 0;
> + }

Uh, this seems to have nothing to do with "attach shared stats table if
not yet".

> + /* create hash if not yet */
> + if (dbentry->functions == DSM_HANDLE_INVALID)
> + {
> + funchash = dshash_create(area, &dsh_funcparams, 0);
> + dbentry->functions = dshash_get_hash_table_handle(funchash);
> + }
> + else
> + funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);

Why is this created on-demand?

> + /*
> + * First, we empty the transaction stats. Just move numbers to pending
> + * stats if any. Elsewise try to directly update the shared stats but
> + * create a new pending entry on lock failure.
> + */
> + if (pgStatFunctions)

I don't understand why we have both pgStatFunctions and pgStatFunctions
and pgStatPendingFunctions (and the same for other such pairs). That
seems to make no sense to me. The comments for the former literally are:
/*
* Backends store per-function info that's waiting to be sent to the collector
* in this hash table (indexed by function OID).
*/

> static HTAB *
> pgstat_collect_oids(Oid catalogid)
> @@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
> /* ----------
> * pgstat_drop_database() -
> *
> - * Tell the collector that we just dropped a database.
> - * (If the message gets lost, we will still clean the dead DB eventually
> - * via future invocations of pgstat_vacuum_stat().)
> + * Remove entry for the database that we just dropped.
> + *
> + * If some stats update happens after this, this entry will re-created but
> + * we will still clean the dead DB eventually via future invocations of
> + * pgstat_vacuum_stat().
> * ----------
> */
> +
> void
> pgstat_drop_database(Oid databaseid)
> {

Mixed indentation, added newline.

> +/*
> + * snapshot_statentry() - Find an entriy from source dshash.
> + *

s/entriy/entry/

Ok, getting too tired now. Two AM in an airport lounge is not the
easiest place and time to concentrate...

I don't think this is all that close to being committable :(

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2019-02-07 21:10:35 Re: Location of pg_rewind/RewindTest.pm and ssl/ServerSetup.pm
Previous Message Andrew Dunstan 2019-02-07 21:01:04 Re: Location of pg_rewind/RewindTest.pm and ssl/ServerSetup.pm