Re: "pgstat wait timeout" just got a lot more common on Windows

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-10 15:27:19
Message-ID: CABUevEwoivuFOkyWPBP=rWKywd1OxC7aROG1xfyULv5hqQnXkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On May 10, 2012 4:59 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
> > Last night I changed the stats collector process to use
> > WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> > the postmaster has died. This morning I observe that several Windows
> > buildfarm members are showing regression test failures caused by
> > unexpected "pgstat wait timeout" warnings. Everybody else is fine.
>
> > This suggests that there is something broken in the Windows
> > implementation of WaitLatchOrSocket. I wonder whether it also
> > tells us something we did not know about the underlying cause of
> > those messages. Not sure what though. Ideas? Can anyone who
> > knows Windows take another look at WaitLatchOrSocket?
>
> Anybody have any clues about that? If not, I think I'll have to revert
> the pgstat changes for beta1, which isn't really forward progress.

Haven't had time to look at the code itself, and won't before wrap time.
Sorry.

> I spent some time staring at the Windows WaitLatchOrSocket code myself.
> The only thing I could find that seemed wrong is that in the event
> array, we list the latch's event before pgwin32_signal_event. The
> Microsoft documentation I looked at says that if more than one event
> is ready, WaitforMultipleObjects reports the first such array member.
> This means that if the latch is already set when control gets here,
> signal handlers will not be serviced.

Yeah, that does seem wrong.

> That doesn't match what would
> happen on a Unix machine, so it seems like at least a violation of the
> POLA. Hence I think we oughta swap the order of those two array
> elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.) I do not however

Maybe we need a loop that checks for all events?

> see a way that that would explain the
> pgstat failures, because the stats collector's latch really shouldn't
> ever get set during normal regression test runs.

So could there be something wrong in the other end, meaning the latch
*does* get set?

/Magnus

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2012-05-10 15:31:15 Re: Draft release notes complete
Previous Message Robert Haas 2012-05-10 15:26:14 Re: Draft release notes complete