Re: About to add WAL write/fsync statistics to pg_stat_wal view

From: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject: Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date: 2021-03-05 10:54:23
Message-ID: 718082d08f9971608bc0c63d2f8a79bb@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-03-05 12:47, Fujii Masao wrote:
> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>> On 2021-03-05 01:02, Fujii Masao wrote:
>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>
>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>> <function>XLogFlush</function> request (see ...).  This is
>>>>>>>>>> also
>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>
>>>>>>>>>> ("which normally called" should be "which is normally called"
>>>>>>>>>> or
>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>> original)
>>>>>>>>>> You missed the adding the space before an opening parenthesis
>>>>>>>>>> here and
>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>
>>>>>>>>>> is ether -> is either
>>>>>>>>>> "This parameter is off by default as it will repeatedly query
>>>>>>>>>> the
>>>>>>>>>> operating system..."
>>>>>>>>>> ", because" -> "as"
>>>>>>>>>
>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>
>>>>>>>>>> wal_write_time and the sync items also need the note: "This is
>>>>>>>>>> also
>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>
>>>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>>>> receiver
>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>
>>>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>>>> event is
>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>> because ..."
>>>>>>>>>
>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>
>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>> explicitly
>>>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>>>> write
>>>>>>>>>> statistics.  This is because of the presence of
>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>> absence of an equivalent pg_xlog_pwrite.  Additionally, I
>>>>>>>>>> observe that
>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the
>>>>>>>>>> WAL
>>>>>>>>>> receiver path does not.  It seems technically straight-forward
>>>>>>>>>> to
>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two
>>>>>>>>>> places,
>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>> another
>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>> processing
>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>> actually
>>>>>>>>>> writing the WAL).  Or, as Fujii noted, go the other way and
>>>>>>>>>> don't have
>>>>>>>>>> any shared code between the two but instead implement the WAL
>>>>>>>>>> receiver
>>>>>>>>>> one to use pg_stat_wal_receiver instead.  In either case, this
>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>
>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>
>>>>>>>> Thanks for updating the patches!
>>>>>>>>
>>>>>>>>
>>>>>>>>> I added the infrastructure code to communicate the WAL receiver
>>>>>>>>> stats messages between the WAL receiver and the stats
>>>>>>>>> collector, and
>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> On second thought, this idea seems not good. Because those stats
>>>>>>>> are
>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process
>>>>>>>> running
>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>> dynamic
>>>>>>>> stats and the others show collected stats, even though they are
>>>>>>>> in
>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>
>>>>>>> OK, I fixed it.
>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal
>>>>>>> view in v11 patch.
>>>>>>
>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>
>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>> +    if (enableFsync &&
>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>> +    {
>>>>>>
>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>> to simplify the code more?
>>>>>
>>>>> Thanks for the comments.
>>>>> I added the above code in v12 patch.
>>>>>
>>>>>>
>>>>>> +        /*
>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed
>>>>>> to minimize
>>>>>> +         * the overhead in WAL-writing.
>>>>>> +         */
>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>> +            pgstat_send_wal();
>>>>>>
>>>>>> On second thought, this change means that it always takes
>>>>>> wal_writer_delay
>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush()
>>>>>> is called.
>>>>>> For example, if wal_writer_delay is set to several seconds, some
>>>>>> values in
>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>> seconds.
>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>> send
>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>
>>>>> Thanks, I didn't notice that.
>>>>>
>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>> default value is 200msec and it may be set shorter time.
>>>
>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>
>>>>>
>>>>> Why don't to make another way to check the timestamp?
>>>>>
>>>>> +               /*
>>>>> +                * Don't send a message unless it's been at least
>>>>> PGSTAT_STAT_INTERVAL
>>>>> +                * msec since we last sent one
>>>>> +                */
>>>>> +               now = GetCurrentTimestamp();
>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>> PGSTAT_STAT_INTERVAL))
>>>>> +               {
>>>>> +                       pgstat_send_wal();
>>>>> +                       last_report = now;
>>>>> +               }
>>>>> +
>>>>>
>>>>> Although I worried that it's better to add the check code in
>>>>> pgstat_send_wal(),
>>>
>>> Agreed.
>>>
>>>>> I didn't do so because to avoid to double check
>>>>> PGSTAT_STAT_INTERVAL.
>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks
>>>>> the
>>>>> PGSTAT_STAT_INTERVAL.
>>>
>>> I think that we can do that. What about the attached patch?
>>
>> Thanks, I thought it's better.
>>
>>
>>>> I forgot to remove an unused variable.
>>>> The attached v13 patch is fixed.
>>>
>>> Thanks for updating the patch!
>>>
>>> +        w.wal_write,
>>> +        w.wal_write_time,
>>> +        w.wal_sync,
>>> +        w.wal_sync_time,
>>>
>>> It's more natural to put wal_write_time and wal_sync_time next to
>>> each other? That is, what about the following order of columns?
>>>
>>> wal_write
>>> wal_sync
>>> wal_write_time
>>> wal_sync_time
>>
>> Yes, I fixed it.
>>
>>> -        case SYNC_METHOD_OPEN:
>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>> -            /* write synced it already */
>>> -            break;
>>>
>>> IMO it's better to add Assert(false) here to ensure that we never
>>> reach
>>> here, as follows. Thought?
>>>
>>> +        case SYNC_METHOD_OPEN:
>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>> +            /* not reachable */
>>> +            Assert(false);
>>
>> I agree.
>>
>>
>>> Even when a backend exits, it sends the stats via
>>> pgstat_beshutdown_hook().
>>> On the other hand, walwriter doesn't do that. Walwriter also should
>>> send
>>> the stats even at its exit? Otherwise some stats can fail to be
>>> collected.
>>> But ISTM that this issue existed from before, for example
>>> checkpointer
>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
>>> fix
>>> this issue in this patch?
>>
>> Thanks, I thought it's better to do so.
>> I added the shutdown hook for the walwriter and the checkpointer in
>> v14-0003 patch.
>
> Thanks!
>
> Seems you forgot to include the changes of expected/rules.out in 0001
> patch,
> and which caused the regression test to fail. Attached is the updated
> version
> of the patch. I included expected/rules.out in it.

Sorry.

> + PgStat_Counter m_wal_write_time; /* time spend writing wal records in
> + * micro seconds */
> + PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in
> micro
> + * seconds */
>
> IMO "spend" should be "spent". Also "micro seconds" should be
> "microseconds"
> in sake of consistent with other comments in pgstat.h. I fixed them.

Thanks.

> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug.
> Even
> when pgstat_send_wal() returned without sending any message,
> pgstat_report_wal() saved current pgWalUsage and that counter was used
> for
> the subsequent calculation of WAL usage. This caused some counters not
> to
> be sent to the collector. This is a bug that I added. I fixed this bug.

Thanks.

> + walStats.wal_write += msg->m_wal_write;
> + walStats.wal_write_time += msg->m_wal_write_time;
> + walStats.wal_sync += msg->m_wal_sync;
> + walStats.wal_sync_time += msg->m_wal_sync_time;
>
> I changed the order of the above in pgstat.c so that wal_write_time and
> wal_sync_time are placed in next to each other.

I forgot to fix them, thanks.

> The followings are the comments for the docs part. I've not updated
> this
> in the patch yet because I'm not sure how to change them for now.
> + Number of times WAL buffers were written out to disk via
> + <function>XLogWrite</function>, which is invoked during an
> + <function>XLogFlush</function> request (see <xref
> linkend="wal-configuration"/>)
> + </para></entry>
>
> XLogWrite() can be invoked during the functions other than XLogFlush().
> For example, XLogBackgroundFlush(). So the above description might be
> confusing?
>
> + Number of times WAL files were synced to disk via
> + <function>issue_xlog_fsync</function>, which is invoked during
> an
> + <function>XLogFlush</function> request (see <xref
> linkend="wal-configuration"/>)
>
> Same as above.

Yes, why don't you remove "XLogFlush" in the above comments
because XLogWrite() description is covered in wal.sgml?

But, now it's mentioned only for backend,
I added the comments for the wal writer in the attached patch.

> + while <xref linkend="guc-wal-sync-method"/> was set to one of
> the
> + "sync at commit" options (i.e., <literal>fdatasync</literal>,
> + <literal>fsync</literal>, or
> <literal>fsync_writethrough</literal>).
>
> Even open_sync and open_datasync do the sync at commit. No? I'm not
> sure
> if "sync at commit" is right term to indicate fdatasync, fsync and
> fsync_writethrough.

Yes, why don't you change to the following comments?

```
while <xref linkend="guc-wal-sync-method"/> was set to one of the
options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
<literal>fsync</literal>, or
<literal>fsync_writethrough</literal>)
```

> + <literal>open_sync</literal>. Units are in milliseconds with
> microsecond resolution.
>
> "with microsecond resolution" part is really necessary?

I removed it because blk_read_time in pg_stat_database is the same
above,
but it doesn't mention it.

> + transaction records are flushed to permanent storage.
> + <function>XLogFlush</function> calls <function>XLogWrite</function>
> to write
> + and <function>issue_xlog_fsync</function> to flush them, which are
> counted as
> + <literal>wal_write</literal> and <literal>wal_sync</literal> in
> + <xref linkend="pg-stat-wal-view"/>. On systems with high log
> output,
>
> This description might cause users to misread that XLogFlush() calls
> issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
> ISTM that this description needs to be updated.

I understood. I fixed to mention that XLogWrite()
calls issue_xlog_fsync().

> Each line in the above seems to end with a space character.
> This space character should be removed.

Sorry for that. I removed it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment Content-Type Size
v15-0001-Add-statistics-related-to-write-sync-wal-records.patch text/x-diff 22.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ibrar Ahmed 2021-03-05 11:37:30 Re: POC: GROUP BY optimization
Previous Message Joel Jacobson 2021-03-05 10:37:35 Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]