Re: Pg stuck at 100% cpu, for multiple days

From: Joe Conway <mail(at)joeconway(dot)com>
To: depesz(at)depesz(dot)com, Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
Cc: pgsql-general mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: Pg stuck at 100% cpu, for multiple days
Date: 2021-08-30 15:04:33
Message-ID: cb86c11d-9c9d-d7ac-8261-c06fba3a6612@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 8/30/21 10:36 AM, hubert depesz lubaczewski
> Anyway - it's 12.6 on aarm64. Couple of days there was replication
> slot started, and now it seems to be stuck.

> #0 hash_seq_search (status=status(at)entry=0xffffdd90f380) at ./build/../src/backend/utils/hash/dynahash.c:1448
> #1 0x0000aaaac3042060 in RelfilenodeMapInvalidateCallback (arg=<optimized out>, relid=105496194) at ./build/../src/backend/utils/cache/relfilenodemap.c:64
> #2 0x0000aaaac3033aa4 in LocalExecuteInvalidationMessage (msg=0xffff9b66eec8) at ./build/../src/backend/utils/cache/inval.c:595
> #3 0x0000aaaac2ec8274 in ReorderBufferExecuteInvalidations (rb=0xaaaac326bb00 <errordata>, txn=0xaaaac326b998 <formatted_start_time>, txn=0xaaaac326b998 <formatted_start_time>) at ./build/../src/backend/replication/logical/reorderbuffer.c:2149
> #4 ReorderBufferCommit (rb=0xaaaac326bb00 <errordata>, xid=xid(at)entry=2668396569, commit_lsn=187650393290540, end_lsn=<optimized out>, commit_time=commit_time(at)entry=683222349268077, origin_id=origin_id(at)entry=0, origin_lsn=origin_lsn(at)entry=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1770
> #5 0x0000aaaac2ebd314 in DecodeCommit (xid=2668396569, parsed=0xffffdd90f7e0, buf=0xffffdd90f960, ctx=0xaaaaf5d396a0) at ./build/../src/backend/replication/logical/decode.c:640
> #6 DecodeXactOp (ctx=ctx(at)entry=0xaaaaf5d396a0, buf=0xffffdd90f960, buf(at)entry=0xffffdd90f9c0) at ./build/../src/backend/replication/logical/decode.c:248
> #7 0x0000aaaac2ebd42c in LogicalDecodingProcessRecord (ctx=0xaaaaf5d396a0, record=0xaaaaf5d39938) at ./build/../src/backend/replication/logical/decode.c:117
> #8 0x0000aaaac2ecfdfc in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2840
> #9 0x0000aaaac2ed2228 in WalSndLoop (send_data=send_data(at)entry=0xaaaac2ecfd98 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2189
> #10 0x0000aaaac2ed2efc in StartLogicalReplication (cmd=0xaaaaf5d175a8) at ./build/../src/backend/replication/walsender.c:1133
> #11 exec_replication_command (cmd_string=cmd_string(at)entry=0xaaaaf5c0eb00 "START_REPLICATION SLOT cdc LOGICAL 1A2D/4B3640 (\"proto_version\" '1', \"publication_names\" 'cdc')") at ./build/../src/backend/replication/walsender.c:1549
> #12 0x0000aaaac2f258a4 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xaaaaf5c78cd8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4257
> #13 0x0000aaaac2eac338 in BackendRun (port=0xaaaaf5c68070, port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4484
> #14 BackendStartup (port=0xaaaaf5c68070) at ./build/../src/backend/postmaster/postmaster.c:4167
> #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
> #16 0x0000aaaac2ead364 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398
> #17 0x0000aaaac2c3ca5c in main (argc=5, argv=0xaaaaf5c07720) at ./build/../src/backend/main/main.c:228
>
> The thing is - I can't close it with pg_terminate_backend(), and I'd
> rather not kill -9, as it will, I think, close all other connections,
> and this is prod server.

> still makes me ask: why does Pg end up in such place,> where it
> doesn't do any syscalls, doesn't accept pg_terminate_backend(), and
> is using 100% of cpu?
src/backend/utils/hash/dynahash.c:1448 is in the middle of a while loop,
which is apparently not exiting.

There is no check for interrupts in there and it is a fairly tight loop
which would explain both symptoms.

As to how it got that way, I have to assume data corruption or a bug of
some sort. I would repost the details to hackers for better visibility.

Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ian Dauncey 2021-08-30 15:08:56 RE: vacuumlo
Previous Message Mario Emmenlauer 2021-08-30 15:00:19 Re: lib and share are installed differently, but why?