Re: Reviewing freeze map code

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reviewing freeze map code
Date: 2016-06-10 17:59:11
Message-ID: 20160610175911.rbbjywcuygzohn56@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-06-09 23:39:24 -0700, Andres Freund wrote:
> On 2016-06-10 11:58:26 +0530, Amit Kapila wrote:
> > I have tried in multiple ways by running pgbench with read-write tests, but
> > could not see any such behaviour.
>
> It took over an hour of pgbench on a fast laptop till I saw it.
>
>
> > I have tried by even crashing and
> > restarting the server and then again running pgbench. Do you see these
> > records on master or slave?
>
> Master, but with an existing standby. So it could be related to
> hot_standby_feedback or such.

I just managed to trigger it again.

#1 0x00007fa1a73778da in __GI_abort () at abort.c:89
#2 0x00007f9f1395e59c in record_corrupt_item (items=items(at)entry=0x2137be0, tid=0x7f9fb8681c0c)
at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:612
#3 0x00007f9f1395ead5 in collect_corrupt_items (relid=relid(at)entry=29449, all_visible=all_visible(at)entry=0 '\000', all_frozen=all_frozen(at)entry=1 '\001')
at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:572
#4 0x00007f9f1395f476 in pg_check_frozen (fcinfo=0x7ffe5343a200) at /home/andres/src/postgresql/contrib/pg_visibility/pg_visibility.c:292
#5 0x00000000005fdbec in ExecMakeTableFunctionResult (funcexpr=0x2168630, econtext=0x2168320, argContext=<optimized out>, expectedDesc=0x2168ef0,
randomAccess=0 '\000') at /home/andres/src/postgresql/src/backend/executor/execQual.c:2211
#6 0x0000000000616992 in FunctionNext (node=node(at)entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:94
#7 0x00000000005ffdcb in ExecScanFetch (recheckMtd=0x6166f0 <FunctionRecheck>, accessMtd=0x616700 <FunctionNext>, node=0x2168210)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:95
#8 ExecScan (node=node(at)entry=0x2168210, accessMtd=accessMtd(at)entry=0x616700 <FunctionNext>, recheckMtd=recheckMtd(at)entry=0x6166f0 <FunctionRecheck>)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:145
#9 0x00000000006169e4 in ExecFunctionScan (node=node(at)entry=0x2168210) at /home/andres/src/postgresql/src/backend/executor/nodeFunctionscan.c:268

the error happened just after I restarted a standby, so it's not
unlikely to be related to hot_standby_feedback.

(gdb) p *tuple.t_data
$5 = {t_choice = {t_heap = {t_xmin = 9105470, t_xmax = 26049273, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 9105470,
datum_typmod = 26049273, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 1, bi_lo = 19765}, ip_posid = 3}, t_infomask2 = 4, t_infomask = 770,
t_hoff = 24 '\030', t_bits = 0x7f9fb8681c17 ""}

Infomask is:
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */

This indeed looks borked. Such a tuple should never survive
if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
check_frozen = false;
especially not when
(gdb) p PageIsAllVisible(page)
$3 = 4

(fwiw, checking PD_ALL_VISIBLE in those functions sounds like a good plan)

I've got another earlier case (that I somehow missed seeing), below
check_visible:

(gdb) p *tuple->t_data
$2 = {t_choice = {t_heap = {t_xmin = 13616549, t_xmax = 25210801, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 13616549,
datum_typmod = 25210801, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 0, bi_lo = 52320}, ip_posid = 67}, t_infomask2 = 32772, t_infomask = 8962,
t_hoff = 24 '\030', t_bits = 0x7f9fda2f8717 ""}

infomask is:
#define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
infomask2 is:
#define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */

I'll run again, with a debugger attached, maybe I can get some more
information.

Regards,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-06-10 18:08:37 Re: parallel workers and client encoding
Previous Message Peter Eisentraut 2016-06-10 17:49:35 Re: parallel.c is not marked as test covered