Re: About to add WAL write/fsync statistics to pg_stat_wal view

From: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject: Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date: 2021-03-08 10:42:37
Message-ID: 18725ca54edbf6af2c0e500055a66fd9@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-03-08 13:44, Fujii Masao wrote:
> On 2021/03/05 19:54, Masahiro Ikeda wrote:
>> On 2021-03-05 12:47, Fujii Masao wrote:
>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>
>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during
>>>>>>>>>>>> an
>>>>>>>>>>>> <function>XLogFlush</function> request (see ...).  This is
>>>>>>>>>>>> also
>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>
>>>>>>>>>>>> ("which normally called" should be "which is normally
>>>>>>>>>>>> called" or
>>>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>>>> original)
>>>>>>>>>>>> You missed the adding the space before an opening
>>>>>>>>>>>> parenthesis here and
>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>
>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>> "This parameter is off by default as it will repeatedly
>>>>>>>>>>>> query the
>>>>>>>>>>>> operating system..."
>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>
>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This
>>>>>>>>>>>> is also
>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>
>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>>>>>> receiver
>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>
>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>>>>>> event is
>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>>>> because ..."
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>
>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>>>> explicitly
>>>>>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>>>>>> write
>>>>>>>>>>>> statistics.  This is because of the presence of
>>>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite.  Additionally, I
>>>>>>>>>>>> observe that
>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while
>>>>>>>>>>>> the WAL
>>>>>>>>>>>> receiver path does not.  It seems technically
>>>>>>>>>>>> straight-forward to
>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the
>>>>>>>>>>>> two places,
>>>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>>>> another
>>>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>>>> processing
>>>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>>>> actually
>>>>>>>>>>>> writing the WAL).  Or, as Fujii noted, go the other way and
>>>>>>>>>>>> don't have
>>>>>>>>>>>> any shared code between the two but instead implement the
>>>>>>>>>>>> WAL receiver
>>>>>>>>>>>> one to use pg_stat_wal_receiver instead.  In either case,
>>>>>>>>>>>> this
>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>
>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver
>>>>>>>>>>> stats.
>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>
>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I added the infrastructure code to communicate the WAL
>>>>>>>>>>> receiver stats messages between the WAL receiver and the
>>>>>>>>>>> stats collector, and
>>>>>>>>>>> the stats for WAL receiver is counted in
>>>>>>>>>>> pg_stat_wal_receiver.
>>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> On second thought, this idea seems not good. Because those
>>>>>>>>>> stats are
>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver
>>>>>>>>>> process running
>>>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>>>> dynamic
>>>>>>>>>> stats and the others show collected stats, even though they
>>>>>>>>>> are in
>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>
>>>>>>>>> OK, I fixed it.
>>>>>>>>> The stats collected in the WAL receiver is exposed in
>>>>>>>>> pg_stat_wal view in v11 patch.
>>>>>>>>
>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>
>>>>>>>> +    /* Check whether the WAL file was synced to disk right now
>>>>>>>> */
>>>>>>>> +    if (enableFsync &&
>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>> +    {
>>>>>>>>
>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>> if enableFsync is off, sync_method is open_sync or
>>>>>>>> open_data_sync,
>>>>>>>> to simplify the code more?
>>>>>>>
>>>>>>> Thanks for the comments.
>>>>>>> I added the above code in v12 patch.
>>>>>>>
>>>>>>>>
>>>>>>>> +        /*
>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has
>>>>>>>> elapsed to minimize
>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>> +         */
>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>> +            pgstat_send_wal();
>>>>>>>>
>>>>>>>> On second thought, this change means that it always takes
>>>>>>>> wal_writer_delay
>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush()
>>>>>>>> is called.
>>>>>>>> For example, if wal_writer_delay is set to several seconds, some
>>>>>>>> values in
>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>>>> seconds.
>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>>>> send
>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>
>>>>>>> Thanks, I didn't notice that.
>>>>>>>
>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>
>>>>> Yeah, if wal_writer_delay is set to very small value, there is a
>>>>> risk
>>>>> that the WAL stats are sent too frequently. I agree that's a
>>>>> problem.
>>>>>
>>>>>>>
>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>
>>>>>>> +               /*
>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>> +                * msec since we last sent one
>>>>>>> +                */
>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>> +               {
>>>>>>> +                       pgstat_send_wal();
>>>>>>> +                       last_report = now;
>>>>>>> +               }
>>>>>>> +
>>>>>>>
>>>>>>> Although I worried that it's better to add the check code in
>>>>>>> pgstat_send_wal(),
>>>>>
>>>>> Agreed.
>>>>>
>>>>>>> I didn't do so because to avoid to double check
>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already
>>>>>>> checks the
>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>
>>>>> I think that we can do that. What about the attached patch?
>>>>
>>>> Thanks, I thought it's better.
>>>>
>>>>
>>>>>> I forgot to remove an unused variable.
>>>>>> The attached v13 patch is fixed.
>>>>>
>>>>> Thanks for updating the patch!
>>>>>
>>>>> +        w.wal_write,
>>>>> +        w.wal_write_time,
>>>>> +        w.wal_sync,
>>>>> +        w.wal_sync_time,
>>>>>
>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>> each other? That is, what about the following order of columns?
>>>>>
>>>>> wal_write
>>>>> wal_sync
>>>>> wal_write_time
>>>>> wal_sync_time
>>>>
>>>> Yes, I fixed it.
>>>>
>>>>> -        case SYNC_METHOD_OPEN:
>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>> -            /* write synced it already */
>>>>> -            break;
>>>>>
>>>>> IMO it's better to add Assert(false) here to ensure that we never
>>>>> reach
>>>>> here, as follows. Thought?
>>>>>
>>>>> +        case SYNC_METHOD_OPEN:
>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>> +            /* not reachable */
>>>>> +            Assert(false);
>>>>
>>>> I agree.
>>>>
>>>>
>>>>> Even when a backend exits, it sends the stats via
>>>>> pgstat_beshutdown_hook().
>>>>> On the other hand, walwriter doesn't do that. Walwriter also should
>>>>> send
>>>>> the stats even at its exit? Otherwise some stats can fail to be
>>>>> collected.
>>>>> But ISTM that this issue existed from before, for example
>>>>> checkpointer
>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
>>>>> to fix
>>>>> this issue in this patch?
>>>>
>>>> Thanks, I thought it's better to do so.
>>>> I added the shutdown hook for the walwriter and the checkpointer in
>>>> v14-0003 patch.
>>>
>>> Thanks!
>>>
>>> Seems you forgot to include the changes of expected/rules.out in 0001
>>> patch,
>>> and which caused the regression test to fail. Attached is the updated
>>> version
>>> of the patch. I included expected/rules.out in it.
>>
>> Sorry.
>>
>>> +    PgStat_Counter m_wal_write_time;    /* time spend writing wal
>>> records in
>>> +                                         * micro seconds */
>>> +    PgStat_Counter m_wal_sync_time; /* time spend syncing wal
>>> records in micro
>>> +                                     * seconds */
>>>
>>> IMO "spend" should be "spent". Also "micro seconds" should be
>>> "microseconds"
>>> in sake of consistent with other comments in pgstat.h. I fixed them.
>>
>> Thanks.
>>
>>> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug.
>>> Even
>>> when pgstat_send_wal() returned without sending any message,
>>> pgstat_report_wal() saved current pgWalUsage and that counter was
>>> used for
>>> the subsequent calculation of WAL usage. This caused some counters
>>> not to
>>> be sent to the collector. This is a bug that I added. I fixed this
>>> bug.
>>
>> Thanks.
>>
>>
>>> +    walStats.wal_write += msg->m_wal_write;
>>> +    walStats.wal_write_time += msg->m_wal_write_time;
>>> +    walStats.wal_sync += msg->m_wal_sync;
>>> +    walStats.wal_sync_time += msg->m_wal_sync_time;
>>>
>>> I changed the order of the above in pgstat.c so that wal_write_time
>>> and
>>> wal_sync_time are placed in next to each other.
>>
>> I forgot to fix them, thanks.
>>
>>
>>> The followings are the comments for the docs part. I've not updated
>>> this
>>> in the patch yet because I'm not sure how to change them for now.
>>> +       Number of times WAL buffers were written out to disk via
>>> +       <function>XLogWrite</function>, which is invoked during an
>>> +       <function>XLogFlush</function> request (see <xref
>>> linkend="wal-configuration"/>)
>>> +      </para></entry>
>>>
>>> XLogWrite() can be invoked during the functions other than
>>> XLogFlush().
>>> For example, XLogBackgroundFlush(). So the above description might be
>>> confusing?
>>>
>>> +       Number of times WAL files were synced to disk via
>>> +       <function>issue_xlog_fsync</function>, which is invoked
>>> during an
>>> +       <function>XLogFlush</function> request (see <xref
>>> linkend="wal-configuration"/>)
>>>
>>> Same as above.
>>
>> Yes, why don't you remove "XLogFlush" in the above comments
>> because XLogWrite() description is covered in wal.sgml?
>>
>> But, now it's mentioned only for backend,
>> I added the comments for the wal writer in the attached patch.
>>
>>
>>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of
>>> the
>>> +       "sync at commit" options (i.e., <literal>fdatasync</literal>,
>>> +       <literal>fsync</literal>, or
>>> <literal>fsync_writethrough</literal>).
>>>
>>> Even open_sync and open_datasync do the sync at commit. No? I'm not
>>> sure
>>> if "sync at commit" is right term to indicate fdatasync, fsync and
>>> fsync_writethrough.
>>
>> Yes, why don't you change to the following comments?
>>
>> ```
>>        while <xref linkend="guc-wal-sync-method"/> was set to one of
>> the
>>        options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>>        <literal>fsync</literal>, or
>> <literal>fsync_writethrough</literal>)
>> ```
>>
>>> +       <literal>open_sync</literal>. Units are in milliseconds with
>>> microsecond resolution.
>>>
>>> "with microsecond resolution" part is really necessary?
>>
>> I removed it because blk_read_time in pg_stat_database is the same
>> above,
>> but it doesn't mention it.
>>
>>
>>> +   transaction records are flushed to permanent storage.
>>> +   <function>XLogFlush</function> calls
>>> <function>XLogWrite</function> to write
>>> +   and <function>issue_xlog_fsync</function> to flush them, which
>>> are
>>> counted as
>>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>>> +   <xref linkend="pg-stat-wal-view"/>. On systems with high log
>>> output,
>>>
>>> This description might cause users to misread that XLogFlush() calls
>>> issue_xlog_fsync(). Since issue_xlog_fsync() is called by
>>> XLogWrite(),
>>> ISTM that this description needs to be updated.
>>
>> I understood. I fixed to mention that XLogWrite()
>> calls issue_xlog_fsync().
>>
>>
>>> Each line in the above seems to end with a space character.
>>> This space character should be removed.
>>
>> Sorry for that. I removed it.
>
> Thanks for updating the patch! I think it's getting good shape!
> - pid | wait_event_type | wait_event
> + pid | wait_event_type | wait_event
>
> This change is not necessary?

No, sorry.
I removed it by mistake when I remove trailing space characters.

> - every <xref linkend="guc-wal-writer-delay"/> milliseconds.
> + every <xref linkend="guc-wal-writer-delay"/> milliseconds, which
> calls
> + <function>XLogWrite</function> to write and
> <function>XLogWrite</function>
> + <function>issue_xlog_fsync</function> to flush them. They are
> counted as
> + <literal>wal_write</literal> and <literal>wal_sync</literal> in
> + <xref linkend="pg-stat-wal-view"/>.
>
> Isn't it better to avoid using the terms like XLogWrite or
> issue_xlog_fsync
> before explaining what they are? They are explained later. At least for
> me
> I'm ok without this change.

OK. I removed them and add a new paragraph.

> - to write (move to kernel cache) a few filled <acronym>WAL</acronym>
> - buffers. This is undesirable because
> <function>XLogInsertRecord</function>
> + to call <function>XLogWrite</function> to write (move to kernel
> cache) a
> + few filled <acronym>WAL</acronym> buffers (the tally of this event
> is reported in
> + <literal>wal_buffers_full</literal> in <xref
> linkend="pg-stat-wal-view"/>).
> + This is undesirable because <function>XLogInsertRecord</function>
>
> This paragraph explains the relationshp between WAL writes and WAL
> buffers. I don't think it's good to add different context to this
> paragraph. Instead, what about adding new paragraph like the follwing?
>
> ----------------------------------
> When track_wal_io_timing is enabled, the total amounts of time
> XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
> counted as wal_write_time and wal_sync_time in pg_stat_wal view,
> respectively. XLogWrite is normally called by XLogInsertRecord (when
> there is no space for the new record in WAL buffers), XLogFlush and
> the WAL writer, to write WAL buffers to disk and call
> issue_xlog_fsync. If wal_sync_method is either open_datasync or
> open_sync, a write operation in XLogWrite guarantees to sync written
> WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
> is either fdatasync, fsync, or fsync_writethrough, the write operation
> moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
> to disk. Regardless of the setting of track_wal_io_timing, the numbers
> of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
> are also counted as wal_write and wal_sync in pg_stat_wal,
> respectively.
> ----------------------------------

Thanks, I agree it's better.

> + <function>issue_xlog_fsync</function> (see <xref
> linkend="wal-configuration"/>)
>
> "request" should be place just before "(see"?

Yes, thanks.

> + Number of times WAL files were synced to disk via
> + <function>issue_xlog_fsync</function> (see <xref
> linkend="wal-configuration"/>)
> + while <xref linkend="guc-wal-sync-method"/> was set to one of
> the
> + options which specific fsync method is called (i.e.,
> <literal>fdatasync</literal>,
> + <literal>fsync</literal>, or
> <literal>fsync_writethrough</literal>).
>
> Isn't it better to mention the case of fsync=off? What about the
> following?
>
> ----------------------------------
> Number of times WAL files were synced to disk via issue_xlog_fsync
> (see ...). This is zero when fsync is off or wal_sync_method is either
> open_datasync or open_sync.
> ----------------------------------

Yes.

> + Total amount of time spent writing WAL buffers were written
> out to disk via
>
> "were written out" is not necessary?

Yes, removed it.

> + Total amount of time spent syncing WAL files to disk via
> + <function>issue_xlog_fsync</function> request (see <xref
> linkend="wal-configuration"/>)
> + while <xref linkend="guc-wal-sync-method"/> was set to one of
> the
> + options which specific fsync method is called (i.e.,
> <literal>fdatasync</literal>,
> + <literal>fsync</literal>, or
> <literal>fsync_writethrough</literal>).
> + Units are in milliseconds.
> + This is zero when <xref linkend="guc-track-wal-io-timing"/> is
> disabled.
>
> Isn't it better to explain the case where this counter is zero a bit
> more clearly as follows?
>
> ---------------------
> This is zero when track_wal_io_timing is disabled, fsync is off, or
> wal_sync_method is either open_datasync or open_sync.
> ---------------------

Yes, thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment Content-Type Size
v16-0001-Add-statistics-related-to-write-sync-wal-records.patch text/x-diff 20.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-03-08 10:49:49 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Fabien COELHO 2021-03-08 10:32:42 Re: Using COPY FREEZE in pgbench