Review for GetWALAvailability()

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Review for GetWALAvailability()
Date: 2020-06-12 16:38:49
Message-ID: 8d55969f-ba37-b89a-6494-e9322ccdb35d@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to resume replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.

keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
wal_segment_size) + 1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of them.
The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), wal_keep_segments) + 1

if ((max_slot_wal_keep_size_mb <= 0 ||
max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
means that the claimed files are within max_wal_size". So whatever
max_slot_wal_keep_size value is, IMO that "normal" should be
reported if the WAL files claimed by the slot are within max_wal_size.
Thought?

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.

If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?

BTW, if we want to implement GetWALAvailability() as the document
advertises, we can simplify it like the attached POC patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
simple_getwalavailability_wip.patch text/plain 2.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kenichiro Tanaka 2020-06-12 16:40:42 Re: Wrong width of UNION statement
Previous Message Emre Hasegeli 2020-06-12 15:34:15 Re: exp() versus the POSIX standard