Re: shared-memory based stats collector

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: andres(at)anarazel(dot)de
Cc: tomas(dot)vondra(at)2ndquadrant(dot)com, alvherre(at)2ndquadrant(dot)com, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2019-02-15 08:29:00
Message-ID: 20190215.172900.84235698.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello. Thank you for the comment.

At Thu, 7 Feb 2019 13:10:08 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in <20190207211008(dot)nc3axviivmcoaluq(at)alap3(dot)anarazel(dot)de>
> Hi,
>
> On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > index 7eed5866d2..e52ae54821 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
> > &sync_secs, &sync_usecs);
> >
> > /* Accumulate checkpoint timing summary data, in milliseconds. */
> > - BgWriterStats.m_checkpoint_write_time +=
> > + BgWriterStats.checkpoint_write_time +=
> > write_secs * 1000 + write_usecs / 1000;
> > - BgWriterStats.m_checkpoint_sync_time +=
> > + BgWriterStats.checkpoint_sync_time +=
> > sync_secs * 1000 + sync_usecs / 1000;
>
> Why does this patch do renames like this in the same entry as actual
> functional changes?

Just because it is no longer "messages". I'm ok to preserve them
as historcal names. Reverted.

> > @@ -1273,16 +1276,22 @@ do_start_worker(void)
> > break;
> > }
> > }
> > - if (skipit)
> > - continue;
> > + if (!skipit)
> > + {
> > + /* Remember the db with oldest autovac time. */
> > + if (avdb == NULL ||
> > + tmp->adw_entry->last_autovac_time <
> > + avdb->adw_entry->last_autovac_time)
> > + {
> > + if (avdb)
> > + pfree(avdb->adw_entry);
> > + avdb = tmp;
> > + }
> > + }
> >
> > - /*
> > - * Remember the db with oldest autovac time. (If we are here, both
> > - * tmp->entry and db->entry must be non-null.)
> > - */
> > - if (avdb == NULL ||
> > - tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
> > - avdb = tmp;
> > + /* Immediately free it if not used */
> > + if(avdb != tmp)
> > + pfree(tmp->adw_entry);
> > }
>
> This looks like another unrelated refactoring. Don't do this.

Rewrited it to less invasive way.

> > /* Transfer stats counts into pending pgstats message */
> > - BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
> > - BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
> > + BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
> > + BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
>
> More unrelated renaming. I'll stop mentioning this in the rest of this
> patch, but really don't do this, makes such a large unnecessary harder
> to review.

AFAICS it is done only to the struct. I reverted all of them.

> pgstat.c needs a header comment explaining the architecture of the
> approach.

I write some. Please check it.

> > +/*
> > + * Operation mode of pgstat_get_db_entry.
> > + */
> > +#define PGSTAT_FETCH_SHARED 0
> > +#define PGSTAT_FETCH_EXCLUSIVE 1
> > +#define PGSTAT_FETCH_NOWAIT 2
> > +
> > +typedef enum
> > +{
>
> Please don't create anonymous enums that are then typedef'd to a
> name. The underlying name and the one not scoped to typedefs shoudl be
> the same.

Sorry. I named the struct PgStat_TableLookupState and typedef'ed
with the same name.

> > +/*
> > + * report withholding facility.
> > + *
> > + * some report items are withholded if required lock is not acquired
> > + * immediately.
> > + */
>
> This comment needs polishing. The variables are named _pending_, but the
> comment talks about withholding - which doesn't seem like an apt name.

The comment was updated at v12, but it still doesn't
good. Rewrite as follows:

| * variables signals that the backend has some numbers that are waiting to be
| * written to shared stats.

> > /*
> > * Structures in which backends store per-table info that's waiting to be
> > @@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
> > * Hash table for O(1) t_id -> tsa_entry lookup
> > */
> > static HTAB *pgStatTabHash = NULL;
> > +static HTAB *pgStatPendingTabHash = NULL;
> >
> > /*
> > * Backends store per-function info that's waiting to be sent to the collector
> > * in this hash table (indexed by function OID).
> > */
> > static HTAB *pgStatFunctions = NULL;
> > -
> > -/*
> > - * Indicates if backend has some function stats that it hasn't yet
> > - * sent to the collector.
> > - */
> > -static bool have_function_stats = false;
> > +static HTAB *pgStatPendingFunctions = NULL;
>
> So this patch leaves us with a pgStatFunctions that has a comment
> explaining it's about "waiting to be sent" stats, and then additionally
> a pgStatPendingFunctions?

Mmm. Thanks . I changed the comment and separated pgSTatPending*
stuff from there and merged with pgstat_pending_*. And unified
the naming.

> > /* ------------------------------------------------------------
> > * Public functions used by backends follow
> > @@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
> > * pgstat_report_stat() -
> > *
> > * Must be called by processes that performs DML: tcop/postgres.c, logical
> > - * receiver processes, SPI worker, etc. to send the so far collected
> > - * per-table and function usage statistics to the collector. Note that this
> > - * is called only when not within a transaction, so it is fair to use
> > - * transaction stop time as an approximation of current time.
> > - * ----------
> > + * receiver processes, SPI worker, etc. to apply the so far collected
> > + * per-table and function usage statistics to the shared statistics hashes.
> > + *
> > + * This requires taking some locks on the shared statistics hashes and some
>
> Weird mix of different indentation.

Fixed. Unifed to tabs.

> > + * of updates may be withholded on lock failure. Pending updates are
> > + * retried in later call of this function and finally cleaned up by calling
> > + * this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
> > + * was elapsed since last cleanup. On the other hand updates by regular
>
> s/was/has/

Ugg. Fixed.

> > +
> > + /* Forecibly update other stats if any. */
>
> s/Forecibly/Forcibly/
>
> Typo aside, what does forcibly mean here?

Meant that it should wait for lock to be acquired. But I don't
recall why. Changed it to follow "force" flag.

> > /*
> > - * Scan through the TabStatusArray struct(s) to find tables that actually
> > - * have counts, and build messages to send. We have to separate shared
> > - * relations from regular ones because the databaseid field in the message
> > - * header has to depend on that.
> > + * XX: We cannot lock two dshash entries at once. Since we must keep lock
>
> Typically we use three XXX (or alternatively NB:).

Fixed. I thought that the number 'X's represents how bad it is.
(Just kidding).

> > + * while tables stats are being updated we have no choice other than
> > + * separating jobs for shared table stats and that of egular tables.
>
> s/egular/regular/

Fixed

> > + * Looping over the array twice isapparently ineffcient and more efficient
> > + * way is expected.
> > */
>
> s/isapparently/is apparently/
> s/ineffcient/inefficient/
>
> But I don't know what this sentence is trying to say precisely. Are you
> saying this cannot be committed unless this is fixed?

Just I wanted to say I'd happy if could do all in a
loop. So.. I rewote it as the follows.

| * Flush pending stats separately for regular tables and shared tables
| * since we cannot hold locks on two dshash entries at once.

> Nor do I understand why it's actually relevant that we cannot lock two
> dshash entries at once. The same table is never shared and unshared,
> no?

Ah, I understood. A backend acculunates statistics on shared
tables, and tables of the connecting database. Shared table
entries are stored under database entry with id = 0 so we need to
use database entries of id = 0 and id = MyDatabaseId stored in
db_stats dshash.....

Ah, I found now that database entry for shared tables necessarily
in the db_stats dshash. Okey. I'll separate shared dbentry from
db_stats hash in the next version.

> > +/*
> > + * Subroutine for pgstat_update_stat.
> > + *
> > + * Appies table stats in table status array merging with pending stats if any.
>
> s/Appies/Applies/
>
>
> > + * If force is true waits until required locks to be acquired. Elsewise stats
>
> s/elsewise/ohterwise/
>
> > + * merged stats as pending sats and it will be processed in the next chance.
>
> s/sats/stats/
> s/in the next change/at the next chance/

Sorry for many silly (but even difficult for me) mistakes. All
fixed. I'll check all by ispell later.

> > + /* if pending update exists, it should be applied along with */
> > + if (pgStatPendingTabHash != NULL)
>
> Why is any of this done if there's no pending data?

Sorry, but I don't follow it. We cannot do anything to what doesn't exist.

> > {
> > - pgstat_send_tabstat(this_msg);
> > - this_msg->m_nentries = 0;
> > + pentry = hash_search(pgStatPendingTabHash,
> > + (void *) entry, HASH_FIND, NULL);
> > +
> > + if (pentry)
> > + {
> > + /* merge new update into pending updates */
> > + pgstat_merge_tabentry(pentry, entry, false);
> > + entry = pentry;
> > + }
> > + }
> > +
> > + /* try to apply the merged stats */
> > + if (pgstat_apply_tabstat(cxt, entry, !force))
> > + {
> > + /* succeeded. remove it if it was pending stats */
> > + if (pentry && entry != pentry)
> > + hash_search(pgStatPendingTabHash,
> > + (void *) pentry, HASH_REMOVE, NULL);
>
> Huh, how can entry != pentry in the case of pending stats? They're
> literally set to the same value above?

Seems right. Removed. I might have done something complex before..

> > + else if (!pentry)
> > + {
> > + /* failed and there was no pending entry, create new one. */
> > + bool found;
> > +
> > + if (pgStatPendingTabHash == NULL)
> > + {
> > + HASHCTL ctl;
> > +
> > + memset(&ctl, 0, sizeof(ctl));
> > + ctl.keysize = sizeof(Oid);
> > + ctl.entrysize = sizeof(PgStat_TableStatus);
> > + pgStatPendingTabHash =
> > + hash_create("pgstat pending table stats hash",
> > + TABSTAT_QUANTUM,
> > + &ctl,
> > + HASH_ELEM | HASH_BLOBS);
> > + }
> > +
> > + pentry = hash_search(pgStatPendingTabHash,
> > + (void *) entry, HASH_ENTER, &found);
> > + Assert (!found);
> > +
> > + *pentry = *entry;
> > }
> > }
> > - /* zero out TableStatus structs after use */
> > - MemSet(tsa->tsa_entries, 0,
> > - tsa->tsa_used * sizeof(PgStat_TableStatus));
> > - tsa->tsa_used = 0;
> > + }
>
> I don't understand why we do this at all.

If we didn't have pending stats of the table, and failed to apply
the stats in TSA, we should move it into pending stats hash.

Though we could merge numbers in TSA into pending hash, then
flush pending hash, I prefer to avoid useless relocation of stats
numbers from TSA to pending stats hash. Does it make sense?

>
> > + if (cxt->tabhash)
> > + dshash_detach(cxt->tabhash);
>
> Huh, why do we detach here?

To release the lock on cxt->dbentry. It may be destroyed.

> > +/*
> > + * pgstat_apply_tabstat: update shared stats entry using given entry
> > + *
> > + * If nowait is true, just returns false on lock failure. Dshashes for table
> > + * and function stats are kept attached and stored in ctx. The caller must
> > + * detach them after use.
> > + */
> > +bool
> > +pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
> > + PgStat_TableStatus *entry, bool nowait)
> > +{
> > + Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> > + int table_mode = PGSTAT_FETCH_EXCLUSIVE;
> > + bool updated = false;
> > +
> > + if (nowait)
> > + table_mode |= PGSTAT_FETCH_NOWAIT;
> > +
> > + /*
> > + * We need to keep lock on dbentries for regular tables to avoid race
> > + * condition with drop database. So we hold it in the context variable. We
> > + * don't need that for shared tables.
> > + */
> > + if (!cxt->dbentry)
> > + cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
>
> Oh, wait, what? *That's* the reason why we need to hold a lock on a
> second entry?

Yeah, one of the reasons.

> Uhm, how can this actually be an issue? If we apply pending stats, we're
> connected to the database, it therefore cannot be dropped while we're
> applying stats, no?

Ah!!!!uch. You're right. I'll consider that in the next version
soon. Thnaks for the insight.

> > + /* attach shared stats table if not yet */
> > + if (!cxt->tabhash)
> > + {
> > + /* apply database stats */
> > + if (!entry->t_shared)
> > + {
> > + /* Update database-wide stats */
> > + cxt->dbentry->n_xact_commit += pgStatXactCommit;
> > + cxt->dbentry->n_xact_rollback += pgStatXactRollback;
> > + cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
> > + cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
> > + pgStatXactCommit = 0;
> > + pgStatXactRollback = 0;
> > + pgStatBlockReadTime = 0;
> > + pgStatBlockWriteTime = 0;
> > + }
>
> Uh, this seems to have nothing to do with "attach shared stats table if
> not yet".

It's because the database stats needs to be applied once and
attaching tabhash happens once for a database. But, actually it
looks somewhat strange. In other words, I used cxt->tabhash as
the flat that indicates whether applying database stats is
requried. I rewote the comment there as the follows. It is
acceptable?

| * If we haven't attached the tabhash, we didn't apply database stats
| * yet. So apply it now..

> > + /* create hash if not yet */
> > + if (dbentry->functions == DSM_HANDLE_INVALID)
> > + {
> > + funchash = dshash_create(area, &dsh_funcparams, 0);
> > + dbentry->functions = dshash_get_hash_table_handle(funchash);
> > + }
> > + else
> > + funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
>
> Why is this created on-demand?

The reason is that function stats are optional, but one dshash
takes 1MB memory at creation time.

> > + /*
> > + * First, we empty the transaction stats. Just move numbers to pending
> > + * stats if any. Elsewise try to directly update the shared stats but
> > + * create a new pending entry on lock failure.
> > + */
> > + if (pgStatFunctions)
>
> I don't understand why we have both pgStatFunctions and pgStatFunctions
> and pgStatPendingFunctions (and the same for other such pairs). That
> seems to make no sense to me. The comments for the former literally are:
> /*
> * Backends store per-function info that's waiting to be sent to the collector
> * in this hash table (indexed by function OID).
> */

I should have did that naively comparing to pgStatTabHash. I'll
remove pgStatPendingFunctions in the next version.

> > static HTAB *
> > pgstat_collect_oids(Oid catalogid)
> > @@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
> > /* ----------
> > * pgstat_drop_database() -
> > *
> > - * Tell the collector that we just dropped a database.
> > - * (If the message gets lost, we will still clean the dead DB eventually
> > - * via future invocations of pgstat_vacuum_stat().)
> > + * Remove entry for the database that we just dropped.
> > + *
> > + * If some stats update happens after this, this entry will re-created but
> > + * we will still clean the dead DB eventually via future invocations of
> > + * pgstat_vacuum_stat().
> > * ----------
> > */
> > +
> > void
> > pgstat_drop_database(Oid databaseid)
> > {
>
> Mixed indentation, added newline.

Fixed.

> > +/*
> > + * snapshot_statentry() - Find an entriy from source dshash.
> > + *
>
> s/entriy/entry/
>
>
> Ok, getting too tired now. Two AM in an airport lounge is not the
> easiest place and time to concentrate...

Thank you very much for reviewing and sorry for the slow
response.

> I don't think this is all that close to being committable :(

I'm going to work harder on this. The remaining taks just now are
the follows:

- Separte shared database stats from db_stats hash.

- Consider relaxing dbentry locking.

- Try removing pgStatPendingFunctions

- ispell on it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v13-0001-sequential-scan-for-dshash.patch text/x-patch 10.6 KB
v13-0002-Add-conditional-lock-feature-to-dshash.patch text/x-patch 5.0 KB
v13-0003-Make-archiver-process-an-auxiliary-process.patch text/x-patch 12.0 KB
v13-0004-Shared-memory-based-stats-collector.patch text/x-patch 452.9 KB
v13-0005-Remove-the-GUC-stats_temp_directory.patch text/x-patch 10.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2019-02-15 11:09:49 Re: Problems with plan estimates in postgres_fdw
Previous Message Michael Paquier 2019-02-15 08:17:11 Re: 2019-03 CF Summary / Review - Tranche #1