Re: About to add WAL write/fsync statistics to pg_stat_wal view

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject: Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date: 2021-03-12 05:25:27
Message-ID: 98541fbe-efa0-1d50-0a1a-f0ead6bd3858@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/03/11 23:33, Fujii Masao wrote:
>
>
> On 2021/03/11 21:29, Masahiro Ikeda wrote:
>> On 2021-03-11 11:52, Fujii Masao wrote:
>>> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>>>> On 2021-03-10 17:08, Fujii Masao wrote:
>>>>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>>>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...).  This is also
>>>>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>>>>> statistics.  This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
>>>>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>>>>> receiver path does not.  It seems technically straight-forward to
>>>>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>>>>> writing the WAL).  Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead.  In either case, this
>>>>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, I fixed it.
>>>>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>>>>
>>>>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>>>>> +    if (enableFsync &&
>>>>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>>>>> +    {
>>>>>>>>>>>>
>>>>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>>>>> to simplify the code more?
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the comments.
>>>>>>>>>>> I added the above code in v12 patch.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> +        /*
>>>>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>>>>> +         */
>>>>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>>>>
>>>>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>>>>
>>>>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>>>>
>>>>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>>>>
>>>>>>>>>>> +               /*
>>>>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>>>>> +                * msec since we last sent one
>>>>>>>>>>> +                */
>>>>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>>>>> +               {
>>>>>>>>>>> +                       pgstat_send_wal();
>>>>>>>>>>> +                       last_report = now;
>>>>>>>>>>> +               }
>>>>>>>>>>> +
>>>>>>>>>>>
>>>>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>>>>
>>>>>>>>> Agreed.
>>>>>>>>>
>>>>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>>>
>>>>>>>>> I think that we can do that. What about the attached patch?
>>>>>>>>
>>>>>>>> Thanks, I thought it's better.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I forgot to remove an unused variable.
>>>>>>>>>> The attached v13 patch is fixed.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patch!
>>>>>>>>>
>>>>>>>>> +        w.wal_write,
>>>>>>>>> +        w.wal_write_time,
>>>>>>>>> +        w.wal_sync,
>>>>>>>>> +        w.wal_sync_time,
>>>>>>>>>
>>>>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>>>>> each other? That is, what about the following order of columns?
>>>>>>>>>
>>>>>>>>> wal_write
>>>>>>>>> wal_sync
>>>>>>>>> wal_write_time
>>>>>>>>> wal_sync_time
>>>>>>>>
>>>>>>>> Yes, I fixed it.
>>>>>>>>
>>>>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>>> -            /* write synced it already */
>>>>>>>>> -            break;
>>>>>>>>>
>>>>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>>>>> here, as follows. Thought?
>>>>>>>>>
>>>>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>>> +            /* not reachable */
>>>>>>>>> +            Assert(false);
>>>>>>>>
>>>>>>>> I agree.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>>>>> this issue in this patch?
>>>>>>>>
>>>>>>>> Thanks, I thought it's better to do so.
>>>>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>>>>
>>>>>>> Thanks for 0003 patch!
>>>>>>>
>>>>>>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>>>>>>> just send the stats only when ShutdownRequestPending is true in the walwriter
>>>>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>>>>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>>>>>>> But that's ok because FATAL error on walwriter causes the server to crash.
>>>>>>> Thought?
>>>>>>
>>>>>> Thanks for your comments!
>>>>>> Yes, I agree.
>>>>>>
>>>>>>
>>>>>>> Also ISTM that we don't need to use the callback for that purpose in
>>>>>>> checkpointer because of the same reason. That is, we can send the stats
>>>>>>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>>>>>>> Thought?
>>>>>>
>>>>>> Yes, I think so too.
>>>>>>
>>>>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>>>>> it might generate WAL records.
>>>>>>
>>>>>>
>>>>>>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>>>>>>> when shutdown is requested, the stats collector seems to exit even when
>>>>>>> there are outstanding stats messages. So if checkpointer and walwriter send
>>>>>>> the stats in their last cycles, those stats might not be collected.
>>>>>>>
>>>>>>> On the other hand, I can think that sending the stats in the last cycles would
>>>>>>> improve the situation a bit than now. So I'm inclined to apply those changes...
>>>>>>
>>>>>> I didn't notice that. I agree this is an important aspect.
>>>>>> I understood there is a case that the stats collector exits before the checkpointer
>>>>>> or the walwriter exits and some stats might not be collected.
>>>>>
>>>>> IIUC the stats collector basically exits after checkpointer and walwriter exit.
>>>>> But there seems no guarantee that the stats collector processes
>>>>> all the messages that other processes have sent during the shutdown of
>>>>> the server.
>>>>
>>>> Thanks, I understood the above postmaster behaviors.
>>>>
>>>> PMState manages the status and after checkpointer is exited, the postmaster sends
>>>> SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
>>>> (IIUC, although the postmaster kill the walsender, the archiver and
>>>> the stats collector at the same time, it's ok because the walsender
>>>> and the archiver doesn't send stats to the stats collector now.)
>>>>
>>>> But, there might be a corner case to lose stats sent by background workers like
>>>> the checkpointer before they exit (although this is not implemented yet.)
>>>>
>>>> For example,
>>>>
>>>> 1. checkpointer send the stats before it exit
>>>> 2. stats collector receive the signal and break before processing
>>>>     the stats message from checkpointer. In this case, 1's message is lost.
>>>> 3. stats collector writes the stats in the statsfiles and exit
>>>>
>>>> Why don't you recheck the coming message is zero just before the 2th procedure?
>>>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>>>
>>> Yes, I was thinking the same. This is the straight-forward fix for this issue.
>>> The stats collector should process all the outstanding messages when
>>> normal shutdown is requested, as the patch does. On the other hand,
>>> if immediate shutdown is requested or emergency bailout (by postmaster death)
>>> is requested, maybe the stats collector should skip those processings
>>> and exit immediately.
>>>
>>> But if we implement that, we would need to teach the stats collector
>>> the shutdown type (i.e., normal shutdown or immediate one). Because
>>> currently SIGQUIT is sent to the collector whichever shutdown is requested,
>>> and so the collector cannot distinguish the shutdown type. I'm afraid that
>>> change is a bit overkill for now.
>>>
>>> BTW, I found that the collector calls pgstat_write_statsfiles() even at
>>> emergency bailout case, before exiting. It's not necessary to save
>>> the stats to the file in that case because subsequent server startup does
>>> crash recovery and clears that stats file. So it's better to make
>>> the collector exit immediately without calling pgstat_write_statsfiles()
>>> at emergency bailout case? Probably this should be discussed in other
>>> thread because it's different topic from the feature we're discussing here,
>>> though.
>>
>> IIUC, only the stats collector has another hander for SIGQUIT although
>> other background processes have a common hander for it and just call _exit(2).
>> I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
>> collector shutdown before other background processes are shutdown.
>>
>> I will make another thread to discuss that the stats collector should know the shutdown type or not.
>> If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
>> is an immediate, and avoid losing the remaining stats if it's normal.
>
> +1
>
>
>>
>>
>>
>>>> I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
>>>> I don't have any strong opinion to handle this case since this result shows to receive and processes
>>>> the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
>>>> signal in 5msec(099->104) after the checkpointer process exits.
>>>
>>> Agreed.
>>>
>>>>
>>>> ```
>>>> 1615421204.556 [checkpointer] DEBUG:  received shutdown request signal
>>>> 1615421208.099 [checkpointer] DEBUG:  proc_exit(-1): 0 callbacks to make              # exit and send the messages
>>>> 1615421208.099 [stats collector] DEBUG:  process BGWRITER stats message              # receive and process the messages
>>>> 1615421208.099 [stats collector] DEBUG:  process WAL stats message
>>>> 1615421208.104 [postmaster] DEBUG:  reaping dead processes
>>>> 1615421208.104 [stats collector] DEBUG:  received shutdown request signal             # receive shutdown request from the postmaster
>>>> ```
>>>>
>>>>>>> Of course, there is another direction; we can improve the stats collector so
>>>>>>> that it guarantees to collect all the sent stats messages. But I'm afraid
>>>>>>> this change might be big.
>>>>>>
>>>>>> For example, implement to manage background process status in shared memory and
>>>>>> the stats collector collects the stats until another background process exits?
>>>>>>
>>>>>> In my understanding, the statistics are not required high accuracy,
>>>>>> it's ok to ignore them if the impact is not big.
>>>>>>
>>>>>> If we guarantee high accuracy, another background process like autovacuum launcher
>>>>>> must send the WAL stats because it accesses the system catalog and might generate
>>>>>> WAL records due to HOT update even though the possibility is low.
>>>>>>
>>>>>> I thought the impact is small because the time uncollected stats are generated is
>>>>>> short compared to the time from startup. So, it's ok to ignore the remaining stats
>>>>>> when the process exists.
>>>>>
>>>>> I agree that it's not worth changing lots of code to collect such stats.
>>>>> But if we can implement that very simply, isn't it more worth doing
>>>>> that than current situation because we may be able to collect more
>>>>> accurate stats.
>>>>
>>>> Yes, I agree.
>>>> I attached the patch to send the stats before the wal writer and the checkpointer exit.
>>>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>>>
>>> Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.
>>
>> Thanks for your comments and for making patches.
>>
>>
>>> Here is the review comments for 0001 patch.
>>>
>>> +/* Prototypes for private functions */
>>> +static void HandleWalWriterInterrupts(void);
>>>
>>> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
>>> So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
>>> we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
>>> if ShutdownRequestPending is true in the main loop. Attached is the patch
>>> I implemented that way. Thought?
>>
>> I thought there is a corner case that can't send the stats like
>
> You're right! So IMO your patch (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.
>
>
>>
>> ```
>> // First, ShutdownRequstPending = false
>>
>>      if (ShutdownRequestPending)    // don't send the stats
>>          pgstat_send_wal(true);
>>
>> // receive signal and ShutdownRequestPending became true
>>
>>      HandleMainLoopInterrupts();   // proc exit without sending the stats
>>
>> ```
>>
>> Is it ok because it almost never occurs?
>>
>>
>>> Here is the review comments for 0002 patch.
>>>
>>> +static void pgstat_send_checkpointer(void);
>>>
>>> I'm inclined to avoid adding the function with the prefix "pgstat_" outside
>>> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
>>> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.
>>
>> Thanks. I agree.
>
> Thanks for the review!
>
>
> So, barring any objection, I will commit the changes for
> walwriter and checkpointer. That is,
>
> v17-0001-send-stats-for-walwriter-when-shutdown.patch
> v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch

I pushed these two patches.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-03-12 05:26:02 RE: [Patch] Optimize dropping of relation buffers using dlist
Previous Message Justin Pryzby 2021-03-12 05:15:06 Re: [HACKERS] Custom compression methods