Re: About to add WAL write/fsync statistics to pg_stat_wal view

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject: Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date: 2021-03-11 14:33:40
Message-ID: f91a1486-e20c-5679-4543-4e4255eaaa08@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/03/11 21:29, Masahiro Ikeda wrote:
> On 2021-03-11 11:52, Fujii Masao wrote:
>> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>>> On 2021-03-10 17:08, Fujii Masao wrote:
>>>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...).  This is also
>>>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>>>> statistics.  This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
>>>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>>>> receiver path does not.  It seems technically straight-forward to
>>>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>>>> writing the WAL).  Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead.  In either case, this
>>>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>>>
>>>>>>>>>>>> OK, I fixed it.
>>>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>>>
>>>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>>>> +    if (enableFsync &&
>>>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>>>> +    {
>>>>>>>>>>>
>>>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>>>> to simplify the code more?
>>>>>>>>>>
>>>>>>>>>> Thanks for the comments.
>>>>>>>>>> I added the above code in v12 patch.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +        /*
>>>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>>>> +         */
>>>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>>>
>>>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>>>
>>>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>>>
>>>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>>>
>>>>>>>>>> +               /*
>>>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>>>> +                * msec since we last sent one
>>>>>>>>>> +                */
>>>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>>>> +               {
>>>>>>>>>> +                       pgstat_send_wal();
>>>>>>>>>> +                       last_report = now;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>>>
>>>>>>>> Agreed.
>>>>>>>>
>>>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>>
>>>>>>>> I think that we can do that. What about the attached patch?
>>>>>>>
>>>>>>> Thanks, I thought it's better.
>>>>>>>
>>>>>>>
>>>>>>>>> I forgot to remove an unused variable.
>>>>>>>>> The attached v13 patch is fixed.
>>>>>>>>
>>>>>>>> Thanks for updating the patch!
>>>>>>>>
>>>>>>>> +        w.wal_write,
>>>>>>>> +        w.wal_write_time,
>>>>>>>> +        w.wal_sync,
>>>>>>>> +        w.wal_sync_time,
>>>>>>>>
>>>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>>>> each other? That is, what about the following order of columns?
>>>>>>>>
>>>>>>>> wal_write
>>>>>>>> wal_sync
>>>>>>>> wal_write_time
>>>>>>>> wal_sync_time
>>>>>>>
>>>>>>> Yes, I fixed it.
>>>>>>>
>>>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>> -            /* write synced it already */
>>>>>>>> -            break;
>>>>>>>>
>>>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>>>> here, as follows. Thought?
>>>>>>>>
>>>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>> +            /* not reachable */
>>>>>>>> +            Assert(false);
>>>>>>>
>>>>>>> I agree.
>>>>>>>
>>>>>>>
>>>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>>>> this issue in this patch?
>>>>>>>
>>>>>>> Thanks, I thought it's better to do so.
>>>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>>>
>>>>>> Thanks for 0003 patch!
>>>>>>
>>>>>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>>>>>> just send the stats only when ShutdownRequestPending is true in the walwriter
>>>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>>>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>>>>>> But that's ok because FATAL error on walwriter causes the server to crash.
>>>>>> Thought?
>>>>>
>>>>> Thanks for your comments!
>>>>> Yes, I agree.
>>>>>
>>>>>
>>>>>> Also ISTM that we don't need to use the callback for that purpose in
>>>>>> checkpointer because of the same reason. That is, we can send the stats
>>>>>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>>>>>> Thought?
>>>>>
>>>>> Yes, I think so too.
>>>>>
>>>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>>>> it might generate WAL records.
>>>>>
>>>>>
>>>>>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>>>>>> when shutdown is requested, the stats collector seems to exit even when
>>>>>> there are outstanding stats messages. So if checkpointer and walwriter send
>>>>>> the stats in their last cycles, those stats might not be collected.
>>>>>>
>>>>>> On the other hand, I can think that sending the stats in the last cycles would
>>>>>> improve the situation a bit than now. So I'm inclined to apply those changes...
>>>>>
>>>>> I didn't notice that. I agree this is an important aspect.
>>>>> I understood there is a case that the stats collector exits before the checkpointer
>>>>> or the walwriter exits and some stats might not be collected.
>>>>
>>>> IIUC the stats collector basically exits after checkpointer and walwriter exit.
>>>> But there seems no guarantee that the stats collector processes
>>>> all the messages that other processes have sent during the shutdown of
>>>> the server.
>>>
>>> Thanks, I understood the above postmaster behaviors.
>>>
>>> PMState manages the status and after checkpointer is exited, the postmaster sends
>>> SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
>>> (IIUC, although the postmaster kill the walsender, the archiver and
>>> the stats collector at the same time, it's ok because the walsender
>>> and the archiver doesn't send stats to the stats collector now.)
>>>
>>> But, there might be a corner case to lose stats sent by background workers like
>>> the checkpointer before they exit (although this is not implemented yet.)
>>>
>>> For example,
>>>
>>> 1. checkpointer send the stats before it exit
>>> 2. stats collector receive the signal and break before processing
>>>     the stats message from checkpointer. In this case, 1's message is lost.
>>> 3. stats collector writes the stats in the statsfiles and exit
>>>
>>> Why don't you recheck the coming message is zero just before the 2th procedure?
>>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>>
>> Yes, I was thinking the same. This is the straight-forward fix for this issue.
>> The stats collector should process all the outstanding messages when
>> normal shutdown is requested, as the patch does. On the other hand,
>> if immediate shutdown is requested or emergency bailout (by postmaster death)
>> is requested, maybe the stats collector should skip those processings
>> and exit immediately.
>>
>> But if we implement that, we would need to teach the stats collector
>> the shutdown type (i.e., normal shutdown or immediate one). Because
>> currently SIGQUIT is sent to the collector whichever shutdown is requested,
>> and so the collector cannot distinguish the shutdown type. I'm afraid that
>> change is a bit overkill for now.
>>
>> BTW, I found that the collector calls pgstat_write_statsfiles() even at
>> emergency bailout case, before exiting. It's not necessary to save
>> the stats to the file in that case because subsequent server startup does
>> crash recovery and clears that stats file. So it's better to make
>> the collector exit immediately without calling pgstat_write_statsfiles()
>> at emergency bailout case? Probably this should be discussed in other
>> thread because it's different topic from the feature we're discussing here,
>> though.
>
> IIUC, only the stats collector has another hander for SIGQUIT although
> other background processes have a common hander for it and just call _exit(2).
> I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
> collector shutdown before other background processes are shutdown.
>
> I will make another thread to discuss that the stats collector should know the shutdown type or not.
> If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
> is an immediate, and avoid losing the remaining stats if it's normal.

+1

>
>
>
>>> I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
>>> I don't have any strong opinion to handle this case since this result shows to receive and processes
>>> the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
>>> signal in 5msec(099->104) after the checkpointer process exits.
>>
>> Agreed.
>>
>>>
>>> ```
>>> 1615421204.556 [checkpointer] DEBUG:  received shutdown request signal
>>> 1615421208.099 [checkpointer] DEBUG:  proc_exit(-1): 0 callbacks to make              # exit and send the messages
>>> 1615421208.099 [stats collector] DEBUG:  process BGWRITER stats message              # receive and process the messages
>>> 1615421208.099 [stats collector] DEBUG:  process WAL stats message
>>> 1615421208.104 [postmaster] DEBUG:  reaping dead processes
>>> 1615421208.104 [stats collector] DEBUG:  received shutdown request signal             # receive shutdown request from the postmaster
>>> ```
>>>
>>>>>> Of course, there is another direction; we can improve the stats collector so
>>>>>> that it guarantees to collect all the sent stats messages. But I'm afraid
>>>>>> this change might be big.
>>>>>
>>>>> For example, implement to manage background process status in shared memory and
>>>>> the stats collector collects the stats until another background process exits?
>>>>>
>>>>> In my understanding, the statistics are not required high accuracy,
>>>>> it's ok to ignore them if the impact is not big.
>>>>>
>>>>> If we guarantee high accuracy, another background process like autovacuum launcher
>>>>> must send the WAL stats because it accesses the system catalog and might generate
>>>>> WAL records due to HOT update even though the possibility is low.
>>>>>
>>>>> I thought the impact is small because the time uncollected stats are generated is
>>>>> short compared to the time from startup. So, it's ok to ignore the remaining stats
>>>>> when the process exists.
>>>>
>>>> I agree that it's not worth changing lots of code to collect such stats.
>>>> But if we can implement that very simply, isn't it more worth doing
>>>> that than current situation because we may be able to collect more
>>>> accurate stats.
>>>
>>> Yes, I agree.
>>> I attached the patch to send the stats before the wal writer and the checkpointer exit.
>>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>>
>> Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.
>
> Thanks for your comments and for making patches.
>
>
>> Here is the review comments for 0001 patch.
>>
>> +/* Prototypes for private functions */
>> +static void HandleWalWriterInterrupts(void);
>>
>> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
>> So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
>> we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
>> if ShutdownRequestPending is true in the main loop. Attached is the patch
>> I implemented that way. Thought?
>
> I thought there is a corner case that can't send the stats like

You're right! So IMO your patch (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.

>
> ```
> // First, ShutdownRequstPending = false
>
>     if (ShutdownRequestPending)    // don't send the stats
>         pgstat_send_wal(true);
>
> // receive signal and ShutdownRequestPending became true
>
>     HandleMainLoopInterrupts();   // proc exit without sending the stats
>
> ```
>
> Is it ok because it almost never occurs?
>
>
>> Here is the review comments for 0002 patch.
>>
>> +static void pgstat_send_checkpointer(void);
>>
>> I'm inclined to avoid adding the function with the prefix "pgstat_" outside
>> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
>> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.
>
> Thanks. I agree.

Thanks for the review!

So, barring any objection, I will commit the changes for
walwriter and checkpointer. That is,

v17-0001-send-stats-for-walwriter-when-shutdown.patch
v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hywel Carver 2021-03-11 14:39:55 Re: Self-join optimisation
Previous Message Matthias van de Meent 2021-03-11 14:32:16 Re: Self-join optimisation