| From: | Magnus Hagander <magnus(at)hagander(dot)net> | 
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> | 
| Cc: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: "pgstat wait timeout" just got a lot more common on Windows | 
| Date: | 2012-05-10 15:27:19 | 
| Message-ID: | CABUevEwoivuFOkyWPBP=rWKywd1OxC7aROG1xfyULv5hqQnXkg@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On May 10, 2012 4:59 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
> > Last night I changed the stats collector process to use
> > WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> > the postmaster has died.  This morning I observe that several Windows
> > buildfarm members are showing regression test failures caused by
> > unexpected "pgstat wait timeout" warnings.  Everybody else is fine.
>
> > This suggests that there is something broken in the Windows
> > implementation of WaitLatchOrSocket.  I wonder whether it also
> > tells us something we did not know about the underlying cause of
> > those messages.  Not sure what though.  Ideas?  Can anyone who
> > knows Windows take another look at WaitLatchOrSocket?
>
> Anybody have any clues about that?  If not, I think I'll have to revert
> the pgstat changes for beta1, which isn't really forward progress.
Haven't had time to look at the code itself, and won't before wrap time.
Sorry.
> I spent some time staring at the Windows WaitLatchOrSocket code myself.
> The only thing I could find that seemed wrong is that in the event
> array, we list the latch's event before pgwin32_signal_event.  The
> Microsoft documentation I looked at says that if more than one event
> is ready, WaitforMultipleObjects reports the first such array member.
> This means that if the latch is already set when control gets here,
> signal handlers will not be serviced.
Yeah, that does seem wrong.
>  That doesn't match what would
> happen on a Unix machine, so it seems like at least a violation of the
> POLA.  Hence I think we oughta swap the order of those two array
> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.)  I do not however
Maybe we need a loop that checks for all events?
> see a way that that would explain the
> pgstat failures, because the stats collector's latch really shouldn't
> ever get set during normal regression test runs.
So could there be something wrong in the other end, meaning the latch
*does* get set?
/Magnus
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Magnus Hagander | 2012-05-10 15:31:15 | Re: Draft release notes complete | 
| Previous Message | Robert Haas | 2012-05-10 15:26:14 | Re: Draft release notes complete |