Re: reporting TID/table with corruption error

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: reporting TID/table with corruption error
Date: 2022-01-10 14:52:35
Message-ID: 202201101452.csmmhfm63l5f@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-Jan-10, Andrey Borodin wrote:

> 3. The tuple seems to be updated in a high-contention concurrency
> trigger function, autovacuum keeks in ~20-30 seconds after the message
> in logs

Hmm, I bet this is related.

> [ 2022-01-10 09:07:17.671 MSK [unknown],????,????_????s,310759,XX001 ]:ERROR: t_xmin 696079792 is uncommitted in tuple (1419011,109) to be updated in table "????s_statistics"
> [ 2022-01-10 09:07:17.671 MSK [unknown],????,????_????s,310759,XX001 ]:CONTEXT: SQL statement "UPDATE ????_????s.????s_statistics os
> SET ????_????_found_ts = COALESCE(os.????_????_found_ts, NOW()),
> last_????_found_ts = NOW(),
> num_????s = os.num_????s + 1
> WHERE ????_id = NEW.????_id"
> PL/pgSQL function statistics__update_from_new_????() line 3 at SQL statement
> [ 2022-01-10 09:07:17.671 MSK [unknown],????,????_????s,310759,XX001 ]:STATEMENT:
> INSERT INTO ????_????s.????s_????s AS ????s
>
> 4. t_xmin is relatevely new, not ancient
> 5. pageinspect shows dead tuple after some time
> 6. no suspicious activity in logs nearby
> 7. vacuum (disable_page_skipping) and repack of indexes did not change anything
>
>
> I suspect this can be relatively new concurrency stuff. At least I never saw this before on clusters with clean amcheck and heapcheck results.

Ah. I've been thinking that it'd be some very old tuple that is in
trouble, but that seems to be proven false. I think we should examine
the affected tuples more closely while they're in the state that causes
the problem. Can you set up things so that pageinspect's
heap_page_items() is run on the broken page, before the problem
disappears? Maybe extract the page number from the log line, have
get_raw_page() store the page in a separate table, so that we can run
heap_page_items() at leisure later. I would also grep through
pg_waldump output to see what changes have been done to that tuple.

Maybe the reason the problem fixes itself is that something else deletes
the tuple.

> Alvaro, did you observe this on binaries from August 13 minor release or older?

Well, the only reports I have of this problem are with the original
error message that didn't give any clues at to what the problem was or
where to look for it, so I don't know if the xmin was recent or not.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-10 15:05:22 Re: small bug in ecpg unicode identifier error handling
Previous Message Daniel Verite 2022-01-10 14:45:47 Re: ICU for global collation