Re: Inconsistency in determining the timestamp of the db statfile.

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Subject: Re: Inconsistency in determining the timestamp of the db statfile.
Date: 2020-09-08 13:11:01
Message-ID: 06a452e2-18f3-0210-d587-e4b159c361d1@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/09/08 19:28, Magnus Hagander wrote:
>
>
> On Tue, Sep 8, 2020 at 8:10 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com <mailto:amit(dot)kapila16(at)gmail(dot)com>> wrote:
>
> We use the timestamp of the global statfile if we are not able to
> determine it for a particular database either because the entry for
> that database doesn't exist or there is an error while reading the
> specific database entry. This was not taken care of while reading
> other entries like ArchiverStats or SLRUStats. See
> pgstat_read_db_statsfile_timestamp. I have observed this while
> reviewing Sawada-San's patch related to logical replication stats [1].
>
> I think this can only happen if due to some reason the statfile got
> corrupt or we
> have some bug in stats writing code, the chances of both are rare and even
> if that happens we will use stale statistics.
>
> The attached patch by Sawada-San will fix this issue. As the chances of this
> the problem is rare and nobody has reported any issue related to this,
> so it might be okay not to backpatch this.
>
> Thoughts?
>
>
> Why are we reading the archiver statis and and slru stats in "pgstat_read_db_statsfile_timestamp" in the first place?

Maybe because they are written before database stats entries? That is,
to read the target database stats entry and get the timestamp from it,
we need to read (or lseek) all the global stats entries written before them.

> That seems well out of scope for that function.
>
> If nothing else the comment at the top of that function is out of touch with reality. We do seem to read it into local buffers and then ignore the contents. I guess we're doing it just to verify that it's not corrupt -- so perhaps the function should actually have a different name than it does now, since it certainly seems to do more than actually read the statsfile timestamp.
>
> Specifically in this patch it looks wrong -- in the case of say the SLRU stats being corrupt, we will now return the timestamp of the global stats file even if there is one existing for the database requested, which definitely breaks the contract of the function.

Yes.
We should return false when fread() for database entry fails, instead? That is,

1. If corrupted stats file is found, the function always returns false.
2. If the file is not currupted and the target database entry is found, its timestamp is returned.
3. If the file is not corrupted and the target is NOT found, the timestamp of global entry is returned.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2020-09-08 13:20:08 Re: Transactions involving multiple postgres foreign servers, take 2
Previous Message Bharath Rupireddy 2020-09-08 12:36:36 Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks