Re:Re: could not access status of transaction

From: chenhj <chjischj(at)163(dot)com>
To: "Tomas Vondra" <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re:Re: could not access status of transaction
Date: 2019-09-29 07:20:54
Message-ID: 39fcc88d.785d.16d7be4c99e.Coremail.chjischj@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, all

Our other system had encountered the same failure, but this time it is PostgreSQL 10.7(rhel 6.3).

Details are as follows:

Phenomenon:

app_db=# select count(*) from loba_sp_cost_xcd_104561;
ERROR: could not access status of transaction 35153545
DETAIL: Could not open file "pg_xact/0021": No such file or directory.

Related Information:
The clog file in the pg_xact directory is 00CE~0179, and pg_xact/0021 has been cleaned up.

Through gdb analysis, the tuple that caused the error was in block 163363.

(gdb) bt
#0 SlruReportIOError (ctl=0xca6b60, pageno=1072, xid=35153545) at slru.c:894
#1 0x00000000004e2060 in SimpleLruReadPage (ctl=0xca6b60, pageno=1072, write_ok=1 '\001', xid=35153545) at slru.c:445
#2 0x00000000004da431 in TransactionIdGetStatus (xid=35153545, lsn=0x7fffd07b1e88) at clog.c:411
#3 0x00000000004e385a in TransactionLogFetch (transactionId=35153545) at transam.c:79
#4 0x00000000004e39a5 in TransactionIdDidCommit (transactionId=<value optimized out>) at transam.c:129
#5 0x000000000085e959 in HeapTupleSatisfiesMVCC (htup=<value optimized out>, snapshot=0x2c21740, buffer=320) at tqual.c:1057
#6 0x00000000004b49e7 in heapgetpage (scan=0x2c8a5d8, page=163363) at heapam.c:439
#7 0x00000000004b5091 in heapgettup_pagemode (scan=0x2c8a5d8, direction=<value optimized out>) at heapam.c:1034
#8 heap_getnext (scan=0x2c8a5d8, direction=<value optimized out>) at heapam.c:1801
#9 0x0000000000618b21 in SeqNext (node=0x2c63708) at nodeSeqscan.c:81
#10 0x00000000005fe132 in ExecScanFetch (node=0x2c63708, accessMtd=0x618ae0 <SeqNext>, recheckMtd=0x618840 <SeqRecheck>) at execScan.c:97
#11 ExecScan (node=0x2c63708, accessMtd=0x618ae0 <SeqNext>, recheckMtd=0x618840 <SeqRecheck>) at execScan.c:164
#12 0x00000000006061ee in ExecProcNode (aggstate=0x2c630c0) at ../../../src/include/executor/executor.h:250
#13 fetch_input_tuple (aggstate=0x2c630c0) at nodeAgg.c:695
#14 0x0000000000607f89 in agg_retrieve_direct (pstate=0x2c630c0) at nodeAgg.c:2448
#15 ExecAgg (pstate=0x2c630c0) at nodeAgg.c:2158
#16 0x000000000060b240 in ExecProcNode (pstate=0x2c62da0) at ../../../src/include/executor/executor.h:250
#17 gather_getnext (pstate=0x2c62da0) at nodeGather.c:286
#18 ExecGather (pstate=0x2c62da0) at nodeGather.c:215
#19 0x00000000006061ee in ExecProcNode (aggstate=0x2c626f8) at ../../../src/include/executor/executor.h:250
#20 fetch_input_tuple (aggstate=0x2c626f8) at nodeAgg.c:695
#21 0x00000000006080ec in agg_retrieve_direct (pstate=0x2c626f8) at nodeAgg.c:2347
#22 ExecAgg (pstate=0x2c626f8) at nodeAgg.c:2158
#23 0x00000000005fb6bf in ExecProcNode (queryDesc=0x2c81658, direction=<value optimized out>, count=0, execute_once=-8 '\370') at ../../../src/include/executor/executor.h:250
#24 ExecutePlan (queryDesc=0x2c81658, direction=<value optimized out>, count=0, execute_once=-8 '\370') at execMain.c:1723
#25 standard_ExecutorRun (queryDesc=0x2c81658, direction=<value optimized out>, count=0, execute_once=-8 '\370') at execMain.c:364
#26 0x00007fd6f161c6f8 in pgss_ExecutorRun (queryDesc=0x2c81658, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at pg_stat_statements.c:889
#27 0x00007fd6f141674d in explain_ExecutorRun (queryDesc=0x2c81658, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at auto_explain.c:267
#28 0x000000000072986b in PortalRunSelect (portal=0x2c85f98, forward=<value optimized out>, count=0, dest=<value optimized out>) at pquery.c:932
#29 0x000000000072aa01 in PortalRun (portal=0x2c85f98, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x2c5fb58, altdest=0x2c5fb58, completionTag=0x7fffd07b26d0 "")
at pquery.c:773
#30 0x0000000000727051 in exec_simple_query (query_string=0x2c26358 "select count(*) from loba_sp_cost_xcd_104561;") at postgres.c:1122
#31 0x0000000000728039 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x2bcc2f0 "app_db", username=<value optimized out>) at postgres.c:4117
#32 0x00000000006bb43a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4405
#33 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4077
---Type <return> to continue, or q <return> to quit---
#34 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1755
#35 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1363
#36 0x000000000063b4d0 in main (argc=1, argv=0x2b8e9f0) at main.c:228
(gdb) f 5
#5 0x000000000085e959 in HeapTupleSatisfiesMVCC (htup=<value optimized out>, snapshot=0x2c21740, buffer=320) at tqual.c:1057
1057tqual.c: No such file or directory.
in tqual.c
(gdb) p *tuple
$1 = {t_choice = {t_heap = {t_xmin = 35153545, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 35153545, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = {
bi_hi = 2, bi_lo = 32291}, ip_posid = 9}, t_infomask2 = 32788, t_infomask = 10243, t_hoff = 32 ' ', t_bits = 0x7fd6d97f0440 "\211f\030\002"}


The problematic tuple data parsed using `pg_filedump -i` is as follows:

Block 32291 ********************************************************
<Header> -----
Block Offset: 0x0fc46000 Offsets: Lower 236 (0x00ec)
Block: Size 8192 Version 4 Upper 2112 (0x0840)
LSN: logid 254 recoff 0xa3a598c8 Special 8192 (0x2000)
Items: 53 Free Space: 1876
Checksum: 0x8355 Prune XID: 0x02186566 Flags: 0x0001 (HAS_FREE_LINES)
Length (including item array): 236
...
Item 8 -- Length: 151 Offset: 4696 (0x1258) Flags: NORMAL
XMIN: 35153480 XMAX: 35153545 CID|XVAC: 0
Block Id: 163363 linp Index: 9 Attributes: 20 Size: 32
infomask: 0x2503 (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_COMMITTED|UPDATED|HOT_UPDATED|HEAP_ONLY)
t_bits: [0]: 0xff [1]: 0xff [2]: 0x07

Item 9 -- Length: 151 Offset: 4544 (0x11c0) Flags: NORMAL
XMIN: 35153545 XMAX: 0 CID|XVAC: 0
Block Id: 163363 linp Index: 9 Attributes: 20 Size: 32
infomask: 0x2803 (HASNULL|HASVARWIDTH|XMAX_INVALID|UPDATED|HEAP_ONLY)
t_bits: [0]: 0xff [1]: 0xff [2]: 0x07

The XMIN of the tuple (163363, 9) above does not have XMIN_COMMITTED, and not be frozen.

Check the data file of this table and find that in addition to some tuples with an infomask of 0x2803, there are many tuples with an infomask of 0x0803, and XMIN_COMMITTED is not set, which also triggers this fault.

Block 32438 ********************************************************
<Header> -----
Block Offset: 0x0fd6c000 Offsets: Lower 204 (0x00cc)
Block: Size 8192 Version 4 Upper 216 (0x00d8)
LSN: logid 122 recoff 0x41303260 Special 8192 (0x2000)
Items: 45 Free Space: 12
Checksum: 0xd960 Prune XID: 0x00000000 Flags: 0x0000 ()
Length (including item array): 204
...
Item 23 -- Length: 174 Offset: 4072 (0x0fe8) Flags: NORMAL
XMIN: 35292368 XMAX: 0 CID|XVAC: 0
Block Id: 163510 linp Index: 23 Attributes: 20 Size: 32
infomask: 0x0803 (HASNULL|HASVARWIDTH|XMAX_INVALID)
t_bits: [0]: 0xff [1]: 0xff [2]: 0x00

Regards,
Chen Huajun

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message chenhj 2019-09-29 08:16:28 Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)
Previous Message Amit Kapila 2019-09-29 05:54:10 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions