Re: Lowering the ever-growing heap->pd_lower

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Lowering the ever-growing heap->pd_lower
Date: 2021-03-10 14:01:05
Message-ID: CAEze2WhGJcfE3a3uLJaPb8+=eNKx825iVAAsQD6LuF6ibOrkGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 9 Mar 2021 at 22:35, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> writes:
> > The only two existing mechanisms that I could find (in the access/heap
> > directory) that possibly could fail on shrunken line pointer arrays;
> > being xlog recovery (I do not have enough knowledge on recovery to
> > determine if that may touch pages that have shrunken line pointer
> > arrays, or if those situations won't exist due to never using dirtied
> > pages in recovery) and backwards table scans on non-MVCC snapshots
> > (which would be fixed in the attached patch).
>
> I think you're not visualizing the problem properly.
>
> The case I was concerned about back when is that there are various bits of
> code that may visit a page with a predetermined TID in mind to look at.
> An index lookup is an obvious example, and another one is chasing an
> update chain's t_ctid link. You might argue that if the tuple was dead
> enough to be removed, there should be no such in-flight references to
> worry about, but I think such an assumption is unsafe. There is not, for
> example, any interlock that ensures that a process that has obtained a TID
> from an index will have completed its heap visit before a VACUUM that
> subsequently removed that index entry also removes the heap tuple.

I am aware of this problem. I will admit that I did not detected all
potentially problematic accesses, so I'll show you my work.

> So, to accept a patch that shortens the line pointer array, what we need
> to do is verify that every such code path checks for an out-of-range
> offset before trying to fetch the target line pointer. I believed
> back in 2007 that there were, or once had been, code paths that omitted
> such a range check, assuming that they could trust the TID they had
> gotten from $wherever to point at an extant line pointer array entry.
> Maybe things are all good now, but I think you should run around and
> examine every place that checks for tuple deadness to see if the offset
> it used is known to be within the current page bounds.

In my search for problematic accesses, I make the following assumptions:
* PageRepairFragmentation as found in bufpage is only applicable to
heap pages; this function is not applied to other pages in core
postgres. So, any problems that occur are with due to access with an
OffsetNumber > PageGetMaxOffsetNumber.
* Items [on heap pages] are only extracted after using PageGetItemId
for that item on the page, whilst holding a lock.

Under those assumptions, I ran "grep PageGetItemId" over the src
directory. For all 227 results (as of 68b34b23) I checked if the page
accessed (or item accessed thereafter) was a heap page or heap tuple.
After analysis of the relevant references, I had the following results
(attached full report, containing a line with the file & line number,
and code line of the call, followed by a line containing the usage
type):

Count of usage type - usage type
4 - Not a call (comment)
7 - Callsite guarantees bounds
8 - Has assertion ItemIdIsNormal (asserts item is not removed; i.e.
concurrent vacuum should not have been able to remove this item)
39 - Has bound checks
6 - Not a heap page (brin)
1 - Not a heap page (generic index)
24 - Not a heap page (gin)
21 - Not a heap page (gist)
14 - Not a heap page (hash)
60 - Not a heap page (nbtree)
1 - Not a heap page (sequence)
36 - Not a heap page (spgist)
2 - OffsetNumber is generated by PageAddItem
2 - problem case 1 (table scan)
1 - Problem case 2 (xlog)
1 - Problem case 3 (invalid redirect pointers)

The 3 problem cases were classified based on the origin of the
potentially invalid pointer.

Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

The table scan maintains a state which contains a page-bound
OffsetNumber, which it uses as a cursor whilst working through the
pages of the relation. In forward scans, the bounds of the page are
re-validated at the start of the 'while (linesleft > 0)' loop at 681,
but for backwards scans this check is invalid, because it incorrectly
assumes that the last OffsetNumber is guaranteed to still exist.

For MVCC snapshots, this is true (the previously returned value must
have been visible in its snapshot, therefore cannot have been vacuumed
because the snapshot is still alive), but non-mvcc snapshots may have
returned a dead tuple, which is now vacuumed and truncated away.

The first occurrance of this issue is easily fixed with the changes as
submitted in patch v2.

The second problem case (on line 811) is for forward scans, where the
line pointer array could have been truncated to 0 length. As the code
uses a hardcoded offset of FirstOffsetNumber (=1), that might point
into arbitrary data. The reading of this arbitrary data is saved by
the 'while (linesleft > 0) check', because at that point linesleft
will be PageGetMaxOffsetNumber, which would then equal 0.

Problem case 2: xlog; heapam.c line 8796, in heap_xlog_freeze_page

This is in the replay of transaction logs, in heap_xlog_freeze_page.
As I am unaware whether or not pages to which these transaction logs
are applied can contain changes from the xlog-generating instance, I
flagged this as a potential problem. The application of the xlogs is
probably safe (it assumes the existence of a HeapTupleHeader for that
ItemId), but my limited knowledge put this on the 'potential problem'
list.

Please advise on this; I cannot find the right documentation

Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
heap_get_root_tuples

The heap pruning mechanism currently assumes that all redirects are
valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
out of the loop, but doesn't actually fail. This code is already
potentially problematic because it has no bounds or sanity checking
for the nextoffnum, but with shrinking pd_linp it would also add the
failure mode of HOT tuples pointing into what is now arbitrary data.

This failure mode is now accompanied by an Assert, which fails when
the redirect is to an invalid OffsetNumber. This is not enough to not
exhibit arbitrary behaviour when accessing corrupted data with
non-assert builds, but I think that that is fine; we already do not
have a guaranteed behaviour for this failure mode.

I have also searched the contrib modules using the same method; and
all 18 usages seem to be validated correctly.

With regards,

Matthias van de Meent.

Attachment Content-Type Size
heap_page_itemid_analysis.txt text/plain 26.4 KB
v3-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patch application/x-patch 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2021-03-10 14:24:23 Re: [PATCH] SET search_path += octopus
Previous Message Dilip Kumar 2021-03-10 14:00:59 Re: Parallel INSERT (INTO ... SELECT ...)