Re: shared-memory based stats collector

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: ah(at)cybertec(dot)at
Cc: andres(at)anarazel(dot)de, magnus(at)hagander(dot)net, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared-memory based stats collector
Date: 2018-11-08 11:42:22
Message-ID: 20181108.204222.203165659.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the comments.

This message contains the whole refactored patch set.

At Mon, 29 Oct 2018 15:10:10 +0100, Antonin Houska <ah(at)cybertec(dot)at> wrote in <28855(dot)1540822210(at)localhost>
> Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> > This is more saner version of previous v5-0008, which didn't pass
> > regression test. v6-0008 to v6-0010 are attached and they are
> > applied on top of v5-0001-0007.
> >
> > - stats collector has been removed.
> >
> > - modified dshash further so that deletion is allowed during
> > sequential scan.
> >
> > - I'm not sure about the following existing comment at the
> > beginning of pgstat.c
> >
> > * - Add a pgstat config column to pg_database, so this
> > * entire thing can be enabled/disabled on a per db basis.
>
> Following is the next handful of my comments:
>
> * If you remove the stats collector, I think the remaining code in pgstat.c
> does no longer fit into the backend/postmaster/ directory.

I didn't condier that, but I still don't find nice place among
existing directories. backend/statistics may be that but I feel
somewhat uneasy.. Finally I split pgstat.c into two files
(suggested in the file comment) and put both of them into a new
directory backend/statmon. One part is backend status faclity,
named bestatus (BackEnd Status) and one is pgstat, access
statistics part. The last 0008 patch does that. I tried to move
it ealier but it was a bit tough.

> * I'm not sure it's o.k. to call pgstat_write_statsfiles() from
> postmaster.c:reaper(): the function can raise ERROR (I see at least one code
> path: pgstat_write_statsfile() -> get_dbstat_filename()) and, as reaper() is
> a signal handler, it's hard to imagine the consequences. Maybe a reason to
> leave some functionality in a separate worker, although the "stats
> collector" would have to be changed.

I was careless about that. longjmp() in signal handler is
inhibited so we mustn't emit ERROR there. In the first place they
are placed in wrong places from several perspective. I changed
the way load and store. In the attached patch (0003), load is
performed on the way initializing shared memory on postmaster and
storing is done in shutdown hook on postmaster. Since the shared
memory area is inherited to all children so no process actually
does initial attaching any longer. Addition to that archiver
process became an auxiliary process (0004) since it writes to the
shared stats.

> * Question still remains whether all the statistics should be loaded into
> shared memory, see the note on paging near the bottom of [1].

Even counting possible page-in latency on stats writing, I agree
to what Robert said in the message that we will win by average
regarding to users who don't create so many databases. If some
stats page to write were paged-out, also related heap page would
have been evicted out from shared buffers (or the buffer page
itself may have been paged-out) and every resources that can be
stashed out may be stashed out. So I don't think it becomes a
serious problem. On reading stats, it is currently reading a file
and sometimes waiting for maiking a up-to-date file. I think we
are needless to say about the case.

For cluster with many-many databases, a backend running on a
database will mainly see only stats for the current database (and
about shared tables) we can split stats by that criteria in the
next step.

> * if dshash_seq_init() is passed consistent=false, shouldn't we call
> ensure_valid_bucket_pointers() also from dshash_seq_next()? If the scan
> needs to access the next partition and the old partition lock got released,
> the table can be resized before the next partition lock is acquired, and
> thus the backend-local copy of buckets becomes obsolete.

Oops. You're right. Addition to that, resizing can happen while
dshash_seq_next moves the lock to the next partition. Resizing
happens on the way breaks sequential scan semantics. I added
ensure_valid_bucket_pointers() after the initial acquisition of
partition lock and move the lock seamlessly during scan. (0001)

> * Neither snapshot_statentry_all() nor backend_snapshot_all_db_entries() seems
> to be used in the current patch version.

Thanks. This is not used since we concluded that we no longer
need strict consistency in stats numbers. Removed. (0003)

> * pgstat_initstats(): I think WaitLatch() should be used instead of sleep().

Bgwriter and checkpointer waited for postmaster's loading of
stats files. But I changed the startup sequence (as mentioned
above), so the wait became useless. Most of them are reaplced
with Assert. (0003)

> * pgstat_get_db_entry(): "return false" should probably be "return NULL".

I don't find that. (Isn't it caught by compiler?) Maybe it is
"found = false"? (it might be a bit tricky)

> * Is the PGSTAT_TABLE_WRITE flag actually used? Unlike PGSTAT_TABLE_CREATE, I
> couldn't find a place where it's value is tested.

Thank you for fiding that. As pointed, PGSTAT_TABLE_WRITE is
finally not used since WRITE is always accompanied by CREATE in
the patch. I think WRITE is more readable than CREATE there so I
removed CREATE. I renamed all PGSTAT_TABLE_ symbols as the
follows while fixing this.

PGSTAST_TABLE_READ -> PGSTAT_FETCH_SHARED
PGSTAST_TABLE_WRITE -> PGSTAT_FETCH_EXCLUSIVE
PGSTAST_TABLE_NOWAIT -> PGSTAT_FETCH_NOWAIT

PGSTAST_TABLE_NOT_FOUND -> PGSTAT_ENTRY_NOT_FOUND
PGSTAST_TABLE_FOUND -> PGSTAT_ENTRY_FOUND
PGSTAST_TABLE_LOCK_FAILED -> PGSTAT_LOCK_FAILED

> * dshash_seq_init(): does it need to be called with consistent=true from
> pgstat_vacuum_stat() when the the entries returned by the scan are just
> dropped?
>
> dshash_seq_init(&dshstat, db_stats, true, true);
>
> I suspect this is a thinko because another call from the same function looks
> like
>
> dshash_seq_init(&dshstat, dshtable, false, true);

It's a left-behind, which was just migrated from the previous (in
v4) snapshot-based code. Snapshots had such consistency. But it
no longer looks useful. (0003)

As the result consistent=false in all caller site so I can remove
the parameter but I leave it alone for a while.

> * I'm not sure about usefulness of dshash_get_num_entries(). It passes
> consistent=false to dshash_seq_init(), so the number of entries can change
> during the execution. And even if the function requested a "consistent
> scan", entries can be added / removed as soon as the scan is over.
>
> If the returned value is used to decide whether the hashtable should be
> scanned or not, I think you don't break anything if you simply start the
> scan unconditionally and see if you find some entries.
> And if you really need to count the entries, I suggest that you use the
> per-partition counts (dshash_partition.count) instead of scanning individual
> entries.

It is mainly to avoid useless call to pgstat_collect_oid(). The
shortcut is useful because funcstat is not taken ususally.
Instead, I removed the function and create function stats dshash
on-demand. Then changed the condition "dshash_get_num_entries() >
0" to "dbentry->functions != DSM_HANDLE_INVALID". (0003)

> [1] https://www.postgresql.org/message-id/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com

New version of this patch is attched to reply to Tomas's message.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-11-08 11:46:48 Re: shared-memory based stats collector
Previous Message Yuzuko Hosoya 2018-11-08 11:35:09 RE: A typo in partprune.c