Re: WIP: WAL prefetch (another approach)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-05-04 19:47:41
Message-ID: 3967044.1620157661@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I suppose that if we're unable to reproduce it on at least one other box,
> we have to write it off as hardware flakiness.

BTW, that conclusion shouldn't distract us from the very real bug
that Andres identified. I was just scraping the buildfarm logs
concerning recent failures, and I found several recent cases
that match the symptom he reported:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2021-04-23%2022%3A27%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-04-21%2005%3A15%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-20%2002%3A03%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-05-04%2004%3A07%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-20%2021%3A08%3A59

They all show the standby in recovery/019_replslot_limit.pl failing
with symptoms like

2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down in recovery at 2021-05-04 07:41:39 UTC
2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode
2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8
2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state reached at 0/1D00000
2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request size 1476397045
2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to accept read only connections
2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) exited with exit code 1

(BTW, the behavior seen here where the failure occurs *immediately*
after reporting "consistent recovery state reached" is seen in the
other reports as well, including Andres' version. I wonder if that
means anything.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-05-04 19:57:39 Re: PG in container w/ pid namespace is init, process exits cause restart
Previous Message Robert Haas 2021-05-04 19:18:36 Re: [bug?] Missed parallel safety checks, and wrong parallel safety