Quick Links

Re: shared-memory based stats collector

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	andres(at)anarazel(dot)de
Cc:	sfrost(at)snowman(dot)net, alvherre(at)2ndquadrant(dot)com, michael(at)paquier(dot)xyz, thomas(dot)munro(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, ah(at)cybertec(dot)at, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: shared-memory based stats collector
Date:	2020-10-01 00:07:22
Message-ID:	20201001.090722.322196928507744460.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

At Fri, 25 Sep 2020 09:27:26 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> Thanks for reviewing!
>
> At Mon, 21 Sep 2020 19:47:04 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in
> > Hi,
> >
> > On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> > > Locks on the shared statistics is acquired by the units of such like
> > > tables, functions so the expected chance of collision are not so high.
> >
> > I can't really parse that...
>
> Mmm... Is the following readable?
>
> Shared statistics locks are acquired by units such as tables,
> functions, etc., so the chances of an expected collision are not so
> high.
>
> Anyway, this is found to be wrong, so I removed it.

01: (Fixed?)
> > > Furthermore, until 1 second has elapsed since the last flushing to
> > > shared stats, lock failure postpones stats flushing so that lock
> > > contention doesn't slow down transactions.
> >
> > I think I commented on that before, but to me 1s seems way too low to
> > switch to blocking lock acquisition. What's the reason for such a low
> > limit?
>
> It was 0.5 seconds previously. I don't have a clear idea of a
> reasonable value for it. One possible rationale might be to have 1000
> clients each have a writing time slot of 10ms.. So, 10s as the minimum
> interval. I set maximum interval to 60, and retry interval to
> 1s. (Fixed?)

02: (I'd appreciate it if you could suggest the appropriate one.)
> > > /*
> > > - * Clean up any dead statistics collector entries for this DB. We always
> > > + * Clean up any dead activity statistics entries for this DB. We always
> > > * want to do this exactly once per DB-processing cycle, even if we find
> > > * nothing worth vacuuming in the database.
> > > */
> >
> > What is "activity statistics"?
>
> I don't get your point. It is formally the replacement word for
> "statistics collector". The "statistics collector (process)" no longer
> exists, so I had to invent a name for the successor mechanism that is
> distinguishable with data/column statistics. If it is not the proper
> wording, I'd appreciate it if you could suggest the appropriate one.

03: (Fixed. Replaced with far simpler cache implement.)
> > > @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
> > > }
> > >
> > > /* fetch the pgstat table entry */
> > > - tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> > > - shared, dbentry);
> > > + tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> > > + relid);
> >
> > Why do all of these places deal with a snapshot? For most it seems to
> > make much more sense to just look up the entry and then copy that into
> > local memory? There may be some place that need some sort of snapshot
> > behaviour that's stable until commit / pgstat_clear_snapshot(). But I
> > can't reallly see many?
>
> Ok, I reread this thread and agree that there's a (vague) consensus to
> remove the snapshot stuff. Backend-statistics (bestats) still are
> stable during a transaction.

If we nuked the snapshot stuff completely, pgstatfuns.c needed many
additional pfree()s since it calls pgstat_fetch* many times for the
same object. I choosed to make pgstat_fetch_stat_*() functions return
a result stored in static variables. It doesn't work a transactional
way as before but keeps the last result for a while then invalidated
by transaction end time at most.

04: (Fixed.)
> > > +#define PGSTAT_MIN_INTERVAL 1000 /* Minimum interval of stats data
> > >
> > > +#define PGSTAT_MAX_INTERVAL 10000 /* Longest interval of stats data
> > > + * updates */
> >
> > These don't really seem to be in line with the commit message...
>
> Oops! Sorry. Fixed both of this value and the commit message (and the
> file comment).

05: (The struct is gone.)
> > > + * dshash pgStatSharedHash
> > > + * -> PgStatHashEntry (dshash entry)
> > > + * (dsa_pointer)-> PgStatEnvelope (dsa memory block)
> >
> > I don't like 'Envelope' that much. If I understand you correctly that's
> > a common prefix that's used for all types of stat objects, correct? If
> > so, how about just naming it PgStatEntryBase or such? I think it'd also
> > be useful to indicate in the "are stored as" part that PgStatEnvelope is
> > just the common prefix for an allocation.
>
> The name makes sense. Thanks! (But the struct is now gone..)

06: (Fixed.)
> > > -typedef struct TabStatHashEntry
> > > +static size_t pgstat_entsize[] =
> >
> > > +/* Ditto for local statistics entries */
> > > +static size_t pgstat_localentsize[] =
> > > +{
> > > + 0, /* PGSTAT_TYPE_ALL: not an entry */
> > > + sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> > > + sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> > > + sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> > > +};
> >
> > These probably should be const as well.
>
> Right. Fixed.

07: (Fixed.)
> > > /*
> > > - * Backends store per-function info that's waiting to be sent to the collector
> > > - * in this hash table (indexed by function OID).
> > > + * Stats numbers that are waiting for flushing out to shared stats are held in
> > > + * pgStatLocalHash,
> > > */
> > > -static HTAB *pgStatFunctions = NULL;
> > > +typedef struct PgStatHashEntry
> > > +{
> > > + PgStatHashEntryKey key; /* hash key */
> > > + dsa_pointer env; /* pointer to shared stats envelope in DSA */
> > > +} PgStatHashEntry;
> > > +
> > > +/* struct for shared statistics entry pointed from shared hash entry. */
> > > +typedef struct PgStatEnvelope
> > > +{
> > > + PgStatTypes type; /* statistics entry type */
> > > + Oid databaseid; /* databaseid */
> > > + Oid objectid; /* objectid */
> >
> > Do we need this information both here and in PgStatHashEntry? It's
> > possible that it's worthwhile, but I am not sure it is.
>
> Same key values were stored in PgStatEnvelope, PgStat(Local)HashEntry,
> and PgStat_Stats*Entry. And I thought the same while developing. After
> some thoughts, I managed to remove the duplicate values other than
> PgStat(Local)HashEntry. Fixed.

08: (Fixed.)
> > > + size_t len; /* length of body, fixed per type. */
> >
> > Why do we need this? Isn't that something that can easily be looked up
> > using the type?
>
> Not only they are virtually fixed values, but they were found to be
> write-only variables. Removed.

09: (Fixed. "Envelope" is embeded in stats entry structs.)
> > > + LWLock lock; /* lightweight lock to protect body */
> > > + int body[FLEXIBLE_ARRAY_MEMBER]; /* statistics body */
> > > +} PgStatEnvelope;
> >
> > What you're doing here with 'body' doesn't provide enough guarantees
> > about proper alignment. E.g. if one of the entry types wants to store a
> > double, this won't portably work, because there's platforms that have 4
> > byte alignment for ints, but 8 byte alignment for doubles.
> >
> >
> > Wouldn't it be better to instead embed PgStatEnvelope into the struct
> > that's actually stored? E.g. something like
> >
> > struct PgStat_TableStatus
> > {
> > PgStatEnvelope header; /* I'd rename the type */
> > TimestampTz vacuum_timestamp; /* user initiated vacuum */
> > ...
> > }
> >
> > or if you don't want to do that because it'd require declaring
> > PgStatEnvelope in the header (not sure that'd really be worth avoiding),
> > you could just get rid of the body field and just do the calculation
> > using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))
>
> As the result of the modification so far, there is only one member,
> lock, left in the PgStatEnvelope (or PgStatEntryBase) struct. I chose
> to embed it to each PgStat_Stat*Entry structs as
> PgStat_StatEntryHeader.

10: (Fixed. Same as #03)
> > > + * Snapshot is stats entry that is locally copied to offset stable values for a
> > > + * transaction.
> ...
> > The amount of code needed for this snapshot stuff seems unreasonable to
> > me, especially because I don't see why we really need it. Is this just
> > so that there's no skew between all the columns of pg_stat_all_tables()
> > etc?
> >
> > I think this needs a lot more comments explaining what it's trying to
> > achieve.
>
> I don't insist on keeping the behavior. Removed snapshot stuff only
> of pgstat stuff. (beentry snapshot is left alone.)

11: (Fixed. Per-entry-type initialize is gone.)
> > > +/*
> > > + * Newly created shared stats entries needs to be initialized before the other
> > > + * processes get access it. get_stat_entry() calls it for the purpose.
> > > + */
> > > +typedef void (*entry_initializer) (PgStatEnvelope * env);
> >
> > I think we should try to not need it, instead declaring that all fields
> > are zero initialized. That fits well together with my suggestion to
> > avoid duplicating the database / object ids.
>
> Now that entries don't have type-specific fields that need a special
> care, I removed that stuff altogether.

12: (Fixed. Global stats memories are merged.)
> > > +static void
> > > +attach_shared_stats(void)
> > > +{
> ...
> > > + shared_globalStats = (PgStat_GlobalStats *)
> > > + dsa_get_address(area, StatsShmem->global_stats);
> > > + shared_archiverStats = (PgStat_ArchiverStats *)
> > > + dsa_get_address(area, StatsShmem->archiver_stats);
> > > +
> > > + shared_SLRUStats = (PgStatSharedSLRUStats *)
> > > + dsa_get_address(area, StatsShmem->slru_stats);
> > > + LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
> >
> > I don't think it makes sense to use dsa allocations for any of the fixed
> > size stats (global_stats, archiver_stats, ...). They should just be
> > direct members of StatsShmem? Then we also don't need the shared_*
> > helper variables
>
> I intended to reduce the amount of fixed-allocated shared memory, or
> make maximum use of DSA. However, you're right. Now they are members
> of StatsShmem.

13: (I couldn't address this fully..)
> > > + /* Load saved data if any. */
> > > + pgstat_read_statsfiles();
> >
> > Hm. Is it a good idea to do this as part of the shmem init function?
> > That's a lot more work than we normally do in these.
> >
> > > +/* ----------
> > > + * detach_shared_stats() -
> > > + *
> > > + * Detach shared stats. Write out to file if we're the last process and told
> > > + * to do so.
> > > + * ----------
> > > */
> > > static void
> > > -pgstat_reset_remove_files(const char *directory)
> > > +detach_shared_stats(bool write_stats)
> >
> > I think it'd be better to have an explicit call in the shutdown sequence
> > somewhere to write out the data, instead of munging detach and writing
> > stats out together.
>
> It is actually strange that attach_shared_stats reads file in a
> StatsLock section while it attaches existing shared memory area
> deliberately outside the same lock section. So I moved the call to
> pg_stat_read/write_statsfiles() out of StatsLock section as the first
> step. But I couldn't move pgstat_write_stats_files() out of (or,
> before or after) detach_shared_stats(), because I didn't find a way to
> reliably check if the exiting process is the last detacher by a
> separate function from detach_shared_stats().
>
> (continued)
> =====

14: (I believe it is addressed.)
> > + if (nowait)
> > + {
> > + /*
> > + * Don't flush stats too frequently. Return the time to the next
> > + * flush.
> > + */
>
> I think it's confusing to use nowait in the if when you actually mean
> !force.

Agreed. I'm hovering between using !force to the parameter "nowait"
of flush_tabstat() or using the relabeled variable nowait. I choosed
to use nowait in the attached.

15: (Not addressed.)
> > - for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
> > + if (pgStatLocalHash)
> > {
> > - for (i = 0; i < tsa->tsa_used; i++)
> > + /* Step 1: flush out other than database stats */
...
> > + case PGSTAT_TYPE_DB:
> > + if (ndbentries >= dbentlistlen)
> > + {
> > + dbentlistlen *= 2;
> > + dbentlist = repalloc(dbentlist,
> > + sizeof(PgStatLocalHashEntry *) *
> > + dbentlistlen);
> > + }
> > + dbentlist[ndbentries++] = lent;
> > + break;
>
> Why do we need this special behaviour for database statistics?

Some of the table stats numbers are also counted as database stats
numbers. It is currently added at stats-sending time (in
pgstat_recv_tabstat()) and this follows that design. If we add such
table stats numbers to database stats before flushing out table stats,
we need to remember whether that number are already added to database
stats or not yet.

16: (Fixed. Used List.)
> If we need it,it'd be better to just use List here rather than open
> coding a replacement (List these days basically has the same complexity
> as what you do here).

Agreed. (I noticed that lappend is faster than lcons now.) Fixed.

17: (Fixed. case-default is removed, and PGSTAT_TYPE_ALL is removed by #28)
> > + case PGSTAT_TYPE_TABLE:
> > + if (flush_tabstat(lent->env, nowait))
> > + remove = true;
> > + break;
> > + case PGSTAT_TYPE_FUNCTION:
> > + if (flush_funcstat(lent->env, nowait))
> > + remove = true;
> > + break;
> > + default:
> > + Assert(false);
>
> Adding a default here prevents the compiler from issuing a warning when
> new types of stats are added...

Agreed. Another instance of switch on the same enum doesn't have
default:. (Fixed.)

18: (Fixed.)
> > + /* Remove the successfully flushed entry */
> > + pfree(lent->env);
>
> Probably worth zeroing the pointer here, to make debugging a little
> easier.

Agreed. I did the same to another instance of freeing a memory chunk
pointed from non-block-local pointers.

19: (Fixed. LWLocks is replaced with atmoic update.)
> > + /* Publish the last flush time */
> > + LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > + if (shared_globalStats->stats_timestamp < now)
> > + shared_globalStats->stats_timestamp = now;
> > + LWLockRelease(StatsLock);
>
> Ugh, that seems like a fairly unnecessary global lock acquisition. What
> do we need this timestamp for? Not clear to me that it's still
> needed. If it is needed, it'd probably worth making this an atomic and
> doing a compare-exchange loop instead.

The value is exposed via a system view. I used pg_atomic but I didn't
find a clean way to store TimestampTz into pg_atomic_u64.

20: (Wrote a comment to explain the reason.)
> > /*
> > - * Send partial messages. Make sure that any pending xact commit/abort
> > - * gets counted, even if there are no table stats to send.
> > + * If we have pending local stats, let the caller know the retry interval.
> > */
> > - if (regular_msg.m_nentries > 0 ||
> > - pgStatXactCommit > 0 || pgStatXactRollback > 0)
> > - pgstat_send_tabstat(&regular_msg);
> > - if (shared_msg.m_nentries > 0)
> > - pgstat_send_tabstat(&shared_msg);
> > + if (HAVE_ANY_PENDING_STATS())
>
> I think this needs a comment explaining why we still may have pending
> stats.

Does the following work?

| * Some of the local stats may have not been flushed due to lock
| * contention. If we have such pending local stats here, let the caller
| * know the retry interval.

21: (Fixed. Local cache of shared stats entry is added.)
> > + * flush_tabstat - flush out a local table stats entry
> > + *
...
> Could we cache the address of the shared entry in the local entry for a
> while? It seems we have a bunch of contention (that I think you're
> trying to address in a prototoype patch posted since) just because we
> will over and over look up the same address in the shared hash table.
>
> If we instead kept the local hashtable alive for longer and stored a
> pointer to the shared entry in it, we could make this a lot
> cheaper. There would be some somewhat nasty edge cases probably. Imagine
> a table being dropped for which another backend still has pending
> stats. But that could e.g. be addressed with a refcount.

Yeah, I noticed that and did that in the previous version (with a
silly bug..) The cache is based on the simple hash. All the entries
were dropped after a vacuum removed at least one shared stats entry in
the previous version. However, this version uses refcount and drops
only the entries actually needed to be dropped.

22: (vacuum/analyze immediately writes to shared stats according to #34)
> > + /* retrieve the shared table stats entry from the envelope */
> > + shtabstats = (PgStat_StatTabEntry *) &shenv->body;
> > +
> > + /* lock the shared entry to protect the content, skip if failed */
> > + if (!nowait)
> > + LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > + else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > + return false;
> > +
> > + /* add the values to the shared entry. */
> > + shtabstats->numscans += lstats->t_counts.t_numscans;
> > + shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
> > + shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
> > + shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
> > + shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
> > + shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
> > + shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
> > +
> > + /*
> > + * If table was truncated or vacuum/analyze has ran, first reset the
> > + * live/dead counters.
> > + */
> > + if (lstats->t_counts.t_truncated ||
> > + lstats->t_counts.vacuum_count > 0 ||
> > + lstats->t_counts.analyze_count > 0 ||
> > + lstats->t_counts.autovac_vacuum_count > 0 ||
> > + lstats->t_counts.autovac_analyze_count > 0)
> > + {
> > + shtabstats->n_live_tuples = 0;
> > + shtabstats->n_dead_tuples = 0;
> > + }
>
> > + /* clear the change counter if requested */
> > + if (lstats->t_counts.reset_changed_tuples)
> > + shtabstats->changes_since_analyze = 0;
>
> I know this is largely old code, but it's not obvious to me that there's
> no race conditions here / that the race condition didn't get worse. What
> prevents other backends to since have done a lot of inserts into this
> table? Especially in case the flushes were delayed due to lock
> contention.

# I noticed that I carelessly dropped inserts_since_vacuum code.

Well. if vacuum report is delayed after a massive insert commit, the
massive insert would be omitted. It seems to me that your suggestion
in #34 below gets the point.

> > + /*
> > + * Update vacuum/analyze timestamp and counters, so that the values won't
> > + * goes back.
> > + */
> > + if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
> > + shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
>
> It seems to me that if these branches are indeed a necessary branches,
> my concerns above are well founded...

I'm not sure it is simply a talisman against evil or basing on an
actual trouble, but I don't believe it's possible that a vacuum ends
after another vacuum that started later ends...

23: (ids are no longer stored in duplicate.)
> > +init_tabentry(PgStatEnvelope * env)
> > {
> > - int n;
> > - int len;
> > + PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
> > +
> > + /*
> > + * If it's a new table entry, initialize counters to the values we just
> > + * got.
> > + */
> > + Assert(env->type == PGSTAT_TYPE_TABLE);
> > + tabent->tableid = env->objectid;
>
> It seems over the top to me to have the object id stored in yet another
> place. It's now in the hash entry, in the envelope, and the type
> specific part.

Agreed, and fixed. (See #11 above)

24: (Fixed. Don't check for all-zero of a function stats entry at flush.)
> > +/*
> > + * flush_funcstat - flush out a local function stats entry
> > + *
> > + * If nowait is true, this function returns false on lock failure. Otherwise
> > + * this function always returns true.
> > + *
> > + * Returns true if the entry is successfully flushed out.
> > + */
> > +static bool
> > +flush_funcstat(PgStatEnvelope * env, bool nowait)
> > +{
> > + /* we assume this inits to all zeroes: */
> > + static const PgStat_FunctionCounts all_zeroes;
> > + PgStat_BackendFunctionEntry *localent; /* local stats entry */
> > + PgStatEnvelope *shenv; /* shared stats envelope */
> > + PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
> > + bool found;
> > +
> > + Assert(env->type == PGSTAT_TYPE_FUNCTION);
> > + localent = (PgStat_BackendFunctionEntry *) &env->body;
> > +
> > + /* Skip it if no counts accumulated for it so far */
> > + if (memcmp(&localent->f_counts, &all_zeroes,
> > + sizeof(PgStat_FunctionCounts)) == 0)
> > + return true;
>
> Why would we have an entry in this case?

Right. A function entry was zeroed out in master but the entry is not
created in that case with this patch. Removed it. (Fixed)

25: (Perhaps fixed. I'm not confident, though.)
> > + /* find shared table stats entry corresponding to the local entry */
> > + shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
> > + nowait, init_funcentry, &found);
> > + /* skip if dshash failed to acquire lock */
> > + if (shenv == NULL)
> > + return false; /* failed to acquire lock, skip */
> > +
> > + /* retrieve the shared table stats entry from the envelope */
> > + sharedent = (PgStat_StatFuncEntry *) &shenv->body;
> > +
> > + /* lock the shared entry to protect the content, skip if failed */
> > + if (!nowait)
> > + LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > + else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > + return false; /* failed to acquire lock, skip */
>
> It doesn't seem great that we have a separate copy of all of this logic
> again. It seems to me that most of the code here is (or should be)
> exactly the same as in table case. I think only the the below should be
> in here, rather than in common code.

I failed to get the last phrase, but I guess you suggested that I
should factor-out the common code.

> > +/*
> > + * flush_dbstat - flush out a local database stats entry
> > + *
> > + * If nowait is true, this function returns false on lock failure. Otherwise
...
> > + /* lock the shared entry to protect the content, skip if failed */
> > + if (!nowait)
> > + LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > + else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > + return false;
>
> Dito re duplicating all of this.

26: (Fixed. Now all stats are saved in one file.)
> > +/*
> > + * Create the filename for a DB stat file; filename is output parameter points
> > + * to a character buffer of length len.
> > + */
> > +static void
> > +get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
> > +{
> > + int printed;
> > +
> > + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
> > + printed = snprintf(filename, len, "%s/db_%u.%s",
> > + PGSTAT_STAT_PERMANENT_DIRECTORY,
> > + databaseid,
> > + tempname ? "tmp" : "stat");
> > + if (printed >= len)
> > + elog(ERROR, "overlength pgstat path");
> > }
>
> Do we really want database specific storage after all of these changes?
> Seems like there's no point anymore?

Sounds reasonable. Since we no longer keep the file format,
pgstat_read/write_statsfiles() gets far simpler. (Fixed)

27: (Fixed. added CFI to the same kind of loops.)
> > + dshash_seq_init(&hstat, pgStatSharedHash, false);
> > + while ((p = dshash_seq_next(&hstat)) != NULL)
> > {
> > - Oid tabid = tabentry->tableid;
> > -
> > - CHECK_FOR_INTERRUPTS();
> > -
>
> Given that this could take a while on a database with a lot of objects
> it might worth keeping the CHECK_FOR_INTERRUPTS().

Agreed. It seems like a mistake. (Fixed pstat_read/write_statsfile()).

28: (Fixed. collect_stat_entries is removed along with PGSTAT_TYPE_ALL.)
> > /* ----------
> > - * pgstat_vacuum_stat() -
> > + * collect_stat_entries() -
> > *
> > - * Will tell the collector about objects he can get rid of.
> > + * Collect the shared statistics entries specified by type and dbid. Returns a
> > + * list of pointer to shared statistics in palloc'ed memory. If type is
> > + * PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
> > + * type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
> > + * PGSTAT_TYPE_DB entries.
> > * ----------
> > */
> > -void
> > -pgstat_vacuum_stat(void)
> > +static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
> > {
>
> > - if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
> > + if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
> > + (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
> > continue;
>
> I don't like this interface much. Particularly not that it requires
> adding a PGSTAT_TYPE_ALL that's otherwise not needed. And the thing
> where PGSTAT_TYPE_DB doesn't actually works as one would expect isn't
> nice either.

Sounds reasonable. It was annoying that dbid=InvalidOid is a valid
value for this interface. But now that the function is called only
from two places and it is now simpler to use dshash seqscan
directly. The function and the enum item PGSTAT_TYPE_ALL are gone.
(Fixed)

29: (Fixed. collect_stat_entries is gone.)
> > + if (n >= listlen - 1)
> > + listlen *= 2;
> > + envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
> > + envlist[n++] = dsa_get_address(area, p->env);
> > }
>
> I'd use List here as well.

So the function no longer exists. (Fixed)

30: (...)
> > + dshash_seq_term(&hstat);
>
> Hm, I didn't immediately see which locking makes this safe? Is it just
> that nobody should be attached at this point?

I'm not sure I get your point, but I try to elaborate.

All the callers of collect_stat_entries have been replaced with a bare
loop of dshash_seq_next.

There are two levels of lock here. One is dshash partition lock that
is needed to continue in-partition scan safely. Another is a lock of
stats entry that is pointed from a dshash entry.

---
((PgStatHashEntry) shent).body -(dsa_get_address)-+-> PgStat_StatEntryHeader
|
((PgStatLocalHashEntry) lent).body ---------------^
---

Dshash scans are used for dropping and resetting stats entries. Entry
dropping is performed in the following steps.

(delete_current_stats_entry())
- Drop the dshash entry (needs exlock of dshash partition).

- If refcount of the stats entry body is already zero, free the memory
immediately .

- If not, set the "dropped" flag of the body. No lock is required
because the "dropped" flag won't be even referred to by other
backends until the next step is done.

- Increment deletion count of the shared hash. (this is used as the
"age" of local pointer cache hash (pgstat_cache).

(get_stat_entry())

- If dshash deletion count is different from the local cache age, scan
over the local cache hash to find "dropped" entries.

- Decrements refcount of the dropped entry and free the shared entry
if it is no longer referenced. Apparently no lock is required.

pgstat_drop_database() and pgstat_vacuum_stat() have concurrent
backends so the locks above are required. pgstat_write_statsfile() is
guaranteed to run alone so it doesn't matter either taking locks or
not.

pgstat_reset_counters() doesn't drop or modify dshash entries so
dshash scan requires shared lock. The stats entry body is updated so it
needs exclusive lock.

31: (Fixed. Use List instead of the open coding.)
> > +void
> > +pgstat_vacuum_stat(void)
> > +{
...
> > + /* collect victims from shared stats */
> > + arraylen = 16;
> > + victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
> > + nvictims = 0;
>
> Same List comment as before.

The function uses a list now. (Fixed)

32: (Fixed.)
> > void
> > pgstat_reset_counters(void)
> > {
> > - PgStat_MsgResetcounter msg;
> > + PgStatEnvelope **envlist;
> > + PgStatEnvelope **p;
> >
> > - if (pgStatSock == PGINVALID_SOCKET)
> > - return;
> > + /* Lookup the entries of the current database in the stats hash. */
> > + envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
> > + for (p = envlist; *p != NULL; p++)
> > + {
> > + PgStatEnvelope *env = *p;
> > + PgStat_StatDBEntry *dbstat;
> >
> > - pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
> > - msg.m_databaseid = MyDatabaseId;
> > - pgstat_send(&msg, sizeof(msg));
> > + LWLockAcquire(&env->lock, LW_EXCLUSIVE);
> > +
>
> What locking prevents this entry from being freed between the
> collect_stat_entries() and this LWLockAcquire?

Mmm. They're not protected. The attached version no longer uses the
intermediate list and the fetched dshash entry is protected by dshash
partition lock. (Fixed)

33: (Will keep the current code.)
> > /* ----------
> > @@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
> > void
> > pgstat_report_autovac(Oid dboid)
> > {
> > - PgStat_MsgAutovacStart msg;
> > + PgStat_StatDBEntry *dbentry;
> > + TimestampTz ts;
> >
> > - if (pgStatSock == PGINVALID_SOCKET)
> > + /* return if activity stats is not active */
> > + if (!area)
> > return;
> >
> > - pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
> > - msg.m_databaseid = dboid;
> > - msg.m_start_time = GetCurrentTimestamp();
> > + ts = GetCurrentTimestamp();
> >
> > - pgstat_send(&msg, sizeof(msg));
> > + /*
> > + * Store the last autovacuum time in the database's hash table entry.
> > + */
> > + dbentry = get_local_dbstat_entry(dboid);
> > + dbentry->last_autovac_time = ts;
> > }
>
> Why did you introduce the local ts variable here?

The function used to assign the timestamp within a LWLock section. In
the last version it writes to local entry so the lock was useless but
the amendment following to the comment #34 just below introduces
LWLocks again.

34: (Fixed. Vacuum/analyze write shared stats instantly.)
> > /* --------
> > * pgstat_report_analyze() -
> > *
> > - * Tell the collector about the table we just analyzed.
> > + * Report about the table we just analyzed.
> > *
> > * Caller must provide new live- and dead-tuples estimates, as well as a
> > * flag indicating whether to reset the changes_since_analyze counter.
> > @@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
> > PgStat_Counter livetuples, PgStat_Counter deadtuples,
> > bool resetcounter)
> > {
> > }
>
> It seems to me that the analyze / vacuum cases would be much better
> dealth with by synchronously operating on the shared entry, instead of
> going through the local hash table. ISTM that that'd make it a lot

Blocking at the beginning and end of such operations doesn't
matter. Sounds reasonbale.

> going through the local hash table. ISTM that that'd make it a lot
> easier to avoid most of the ordering issues.

Agreed. That avoid at least the case of delayed vacuum report (#22).

35: (Fixed, needing a change of how relcache uses local stats.)
> > +static PgStat_TableStatus *
> > +get_local_tabstat_entry(Oid rel_id, bool isshared)
> > +{
> > + PgStatEnvelope *env;
> > + PgStat_TableStatus *tabentry;
> > + bool found;
> >
> > - /*
> > - * Now we can fill the entry in pgStatTabHash.
> > - */
> > - hash_entry->tsa_entry = entry;
> > + env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
> > + isshared ? InvalidOid : MyDatabaseId,
> > + rel_id, true, &found);
> >
> > - return entry;
> > + tabentry = (PgStat_TableStatus *) &env->body;
> > +
> > + if (!found)
> > + {
> > + tabentry->t_id = rel_id;
> > + tabentry->t_shared = isshared;
> > + tabentry->trans = NULL;
> > + MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
> > + tabentry->vacuum_timestamp = 0;
> > + tabentry->autovac_vacuum_timestamp = 0;
> > + tabentry->analyze_timestamp = 0;
> > + tabentry->autovac_analyze_timestamp = 0;
> > + }
> > +
>
> As with shared entries, I think this should just be zero initialized
> (and we should try to get rid of the duplication of t_id/t_shared).

Ah! Yeah, they are removable since we already converted them into the
key of hash entry. Removed oids and the intialization code from all
types of local stats entry types.

One annoyance doing that was pgstat_initstats, which assumes the
pgstat_info linked from relation won't be freed. Finally I tightned
up the management of pgstat_info link. The link between relcache and
table stats entry is now a bidirectional link and explicitly de-linked
by a new function pgstat_delinkstats().

36: (Perhaps fixed. I'm not confident, though.)
> > + return tabentry;
> > }
> >
> > +
> > /*
> > * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
> > *
> > - * If no entry, return NULL, don't create a new one
> > + * Find any existing PgStat_TableStatus entry for rel from the current
> > + * database then from shared tables.
>
> What do you mean with "from the current database then from shared
> tables"?

It is rewritten as the following, is this readable?

| * Find any existing PgStat_TableStatus entry for rel_id in the current
| * database. If not found, try finding from shared tables.

37: (Maybe fixed.)
> > void
> > -pgstat_send_archiver(const char *xlog, bool failed)
> > +pgstat_report_archiver(const char *xlog, bool failed)
> > {
..
> > + if (failed)
> > + {
> > + /* Failed archival attempt */
> > + LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > + ++shared_archiverStats->failed_count;
> > + memcpy(shared_archiverStats->last_failed_wal, xlog,
> > + sizeof(shared_archiverStats->last_failed_wal));
> > + shared_archiverStats->last_failed_timestamp = now;
> > + LWLockRelease(StatsLock);
> > + }
> > + else
> > + {
> > + /* Successful archival operation */
> > + LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > + ++shared_archiverStats->archived_count;
> > + memcpy(shared_archiverStats->last_archived_wal, xlog,
> > + sizeof(shared_archiverStats->last_archived_wal));
> > + shared_archiverStats->last_archived_timestamp = now;
> > + LWLockRelease(StatsLock);
> > + }
> > }
>
> Huh, why is this duplicating near equivalent code?

To avoid branches within a lock section, or since it is simply
expanded from the master. They can be reset by backends so I couldn't
change it to use changecount protocol. So it still uses LWLock but the
common code is factored out in the attached version.

In connection with this, While I was looking at bgwriter and
checkpointer to see if the statistics of the two could be split, I
found the following comment in checkpoiner.c.

| * Send off activity statistics to the activity stats facility. (The
| * reason why we re-use bgwriter-related code for this is that the
| * bgwriter and checkpointer used to be just one process. It's
| * probably not worth the trouble to split the stats support into two
| * independent stats message types.)

So I split the two to try getting rid of LWLock for the global stats,
but resetting counter prevented me from doing that. In the attached
version, I left it as it is because I've done it..

38: (Haven't addressed.)
> > /* ----------
> > * pgstat_write_statsfiles() -
> > - * Write the global statistics file, as well as requested DB files.
> > - *
> > - * 'permanent' specifies writing to the permanent files not temporary ones.
> > - * When true (happens only when the collector is shutting down), also remove
> > - * the temporary files so that backends starting up under a new postmaster
> > - * can't read old data before the new collector is ready.
> > - *
> > - * When 'allDbs' is false, only the requested databases (listed in
> > - * pending_write_requests) will be written; otherwise, all databases
> > - * will be written.
> > + * Write the global statistics file, as well as DB files.
> > * ----------
> > */
> > -static void
> > -pgstat_write_statsfiles(bool permanent, bool allDbs)
> > +void
> > +pgstat_write_statsfiles(void)
> > {
>
> Whats the locking around this?

No locking is used there. The code is (currently) guaranteed to be the
only process that reads it. Added a comment and an assertion. I did
the same to pgstat_read_statsfile().

39: (Fixed.)
> > -pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> > +pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
> > {
> > - HASH_SEQ_STATUS tstat;
> > - HASH_SEQ_STATUS fstat;
> > - PgStat_StatTabEntry *tabentry;
> > - PgStat_StatFuncEntry *funcentry;
> > + PgStatEnvelope **envlist;
> > + PgStatEnvelope **penv;
> > FILE *fpout;
> > int32 format_id;
> > Oid dbid = dbentry->databaseid;
> > @@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> > char tmpfile[MAXPGPATH];
> > char statfile[MAXPGPATH];
> >
> > - get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
> > - get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
> > + get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
> > + get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
> >
> > elog(DEBUG2, "writing stats file \"%s\"", statfile);
> >
> > @@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> > /*
> > * Walk through the database's access stats per table.
> > */
> > - hash_seq_init(&tstat, dbentry->tables);
> > - while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
> > + envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
> > + for (penv = envlist; *penv != NULL; penv++)
>
> In several of these collect_stat_entries() callers it really bothers me
> that we basically allocate an array as large as the number of objects
> in the database (That's fine for databases, but for tables...). Without
> much need as far as I can see.

collect_stat_entries() is removed (#28) and the callers now handles
entries directly in the dshash_seq_next loop.

40: (Fixed.)
> > {
> > + PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
> > +
> > fputc('T', fpout);
> > rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
> > (void) rc; /* we'll check for error with ferror */
> > }
> > + pfree(envlist);
> >
> > /*
> > * Walk through the database's function stats table.
> > */
> > - hash_seq_init(&fstat, dbentry->functions);
> > - while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
> > + envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
> > + for (penv = envlist; *penv != NULL; penv++)
> > {
> > + PgStat_StatFuncEntry *funcentry =
> > + (PgStat_StatFuncEntry *) &(*penv)->body;
> > +
> > fputc('F', fpout);
> > rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
> > (void) rc; /* we'll check for error with ferror */
> > }
> > + pfree(envlist);
>
> Why do we need separate loops for every type of object here?

Just to keep the file format. But we decided to change it (#26) and it
is now a juble of all kinds of stats
entries. pgstat_write/read_statsfile() become far simpler.

41: (Fixed.)
> > +/* ----------
> > + * create_missing_dbentries() -
> > + *
> > + * There may be the case where database entry is missing for the database
> > + * where object stats are recorded. This function creates such missing
> > + * dbentries so that so that all stats entries can be written out to files.
> > + * ----------
> > + */
> > +static void
> > +create_missing_dbentries(void)
> > +{
>
> In which situation is this necessary?

It is because the old file format required that entries. It is no
longer needed and removed in #26.

42: (Sorry, but I didn't get your point..)
> > +static PgStatEnvelope *
> > +get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
> > + bool nowait, entry_initializer initfunc, bool *found)
> > +{
>
> > + bool create = (initfunc != NULL);
> > + PgStatHashEntry *shent;
> > + PgStatEnvelope *shenv = NULL;
> > + PgStatHashEntryKey key;
> > + bool myfound;
> > +
> > + Assert(type != PGSTAT_TYPE_ALL);
> > +
> > + key.type = type;
> > + key.databaseid = dbid;
> > + key.objectid = objid;
> > + shent = dshash_find_extended(pgStatSharedHash, &key,
> > + create, nowait, create, &myfound);
> > + if (shent)
> > {
> > - get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
> > + if (create && !myfound)
> > + {
> > + /* Create new stats envelope. */
> > + size_t envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
> > + dsa_pointer chunk = dsa_allocate0(area, envsize);
>
> > + /*
> > + * The lock on dshsh is released just after. Call initializer
> > + * callback before it is exposed to other process.
> > + */
> > + if (initfunc)
> > + initfunc(shenv);
> > +
> > + /* Link the new entry from the hash entry. */
> > + shent->env = chunk;
> > + }
> > + else
> > + shenv = dsa_get_address(area, shent->env);
> > +
> > + dshash_release_lock(pgStatSharedHash, shent);
>
> Doesn't this mean that by this time the entry could already have been
> removed by a concurrent backend, and the dsa allocation freed?

Does "by this time" mean before the dshash_find_extended, or after it
and until dshash_release_lock?

We can create an entry for a just droppted object but it should be
removed again by the next vacuum.

The newly created entry (or its partition) is exclusively locked so no
concurrent backend does not find it until the dshash_release_lock.

The shenv could be removed until the caller accesses it. But since the
function is requested for an existing object, that cannot be removed
until the first vacuum after the transaction end. I added a comment
just before the dshash_release_lock in get_stat_entry().

43: (Fixed. But has a side effect.)
> > Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory
> >
> > The GUC used to specify the directory to store temporary statistics
> > files. It is no longer needed by the stats collector but still used by
> > the programs in bin and contrib, and maybe other extensions. Thus this
> > patch removes the GUC but some backing variables and macro definitions
> > are left alone for backward compatibility.
>
> I don't see what this achieves? Which use of those variables / macros
> would would be safe? I think it'd be better to just remove them.

pg_stat_statements used PG_STAT_TMP directory to store a temporary
file. I just replaced it with PGSTAT_STAT_PERMANENT_DIRECTORY. As the
result basebackup copies the temporary file of pg_stat_statements.

By the way, basebackup exludes pg_stat_tmp diretory but sends pg_stat
direcoty. On the other hand when we start a server from a base backup,
it starts crash recovery first and removes stats files anyway. Why
does basebackup send pg_stat direcoty then? (Added as 0007.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v38-0001-sequential-scan-for-dshash.patch	text/x-patch	8.8 KB
v38-0002-Add-conditional-lock-feature-to-dshash.patch	text/x-patch	6.2 KB
v38-0003-Make-archiver-process-an-auxiliary-process.patch	text/x-patch	17.7 KB
v38-0004-Shared-memory-based-stats-collector.patch	text/x-patch	268.5 KB
v38-0005-Doc-part-of-shared-memory-based-stats-collector.patch	text/x-patch	20.7 KB
v38-0006-Remove-the-GUC-stats_temp_directory.patch	text/x-patch	13.6 KB
v38-0007-Exclude-pg_stat-directory-from-base-backup.patch	text/x-patch	1.2 KB

In response to

Re: shared-memory based stats collector at 2020-09-25 00:27:26 from Kyotaro Horiguchi

Responses

Re: shared-memory based stats collector at 2020-10-06 01:06:44 from Kyotaro Horiguchi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2020-10-01 00:24:30	Re: BUG #16419: wrong parsing BC year in to_date() function
Previous Message	Fujii Masao	2020-10-01 00:05:19	Re: New statistics for tuning WAL buffer size