Re: [pgpool-general: 1315] Re: Database connections seemingly hanging

From: Fredrik(dot)HuitfeldtMadsen(at)schneider-electric(dot)com
To: ishii(at)postgresql(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, nagata(at)sraoss(dot)co(dot)jp, pgsql-general(at)postgresql(dot)org, pgpool-general(at)pgpool(dot)net
Subject: Re: [pgpool-general: 1315] Re: Database connections seemingly hanging
Date: 2013-01-28 09:03:39
Message-ID: OFD871653C.EFC11E63-ONC1257B01.002B7ADE-C1257B01.0031C5D5@apcc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Mr. Ishii,

we have attempted to create a selfcontained testcase, but have been
unsuccessful so far. We understand how pgpool acquires locks in theory,
but it seems that what we are seeing is different. We have summarized our
findings here: http://pastebin.com/9f6gjxLA

It seems that pgpool child 7606 communicates with process 26453 on the
.202 node and process 17789 on the .204 node. We can see from the output
from postgres that process 26453 is waiting for the lock on the .202
server, while 17789 has the lock on the .204 server. This means that
pgpool child 7606 will wait until process 26453 can obtain the lock on the
.202 server.

At the same time, we can see that pgpool child 7681 communicates with
process 23451 on the .202 server and process 12464 on the .204 server. We
can see from the output from postgres that the process 23451 has its lock
on the .202 server but process 12464 is waiting for it on the .204 server.
This means that pgpool child 7681 will wait until process 12464 can obtain
the lock on the .204 server.

Since pgpool child 7606 via process 17789 has the lock on the .204 server,
it blocks pgpool child 7681 from completing. Since pgpool child 7681 via
process 23451 has the lock on the .202 server, it blocks pgpool child 7606
from completing. This seems to be a classic deadlock.

We have spent a fair amount of time debugging this situation, and we would
really appreciate feedback on our situation.

Is there any information that would aid you in providing us with this kind
of feedback?

Kind regards,
Fredrik & Friends

Tatsuo Ishii <ishii(at)postgresql(dot)org>
2013/01/16 07:54

To
Fredrik(dot)HuitfeldtMadsen(at)schneider-electric(dot)com
cc
tgl(at)sss(dot)pgh(dot)pa(dot)us, nagata(at)sraoss(dot)co(dot)jp, pgsql-general(at)postgresql(dot)org,
pgpool-general(at)pgpool(dot)net
Subject
Re: [pgpool-general: 1315] Re: [GENERAL] Database connections seemingly
hanging

> It seems that the root cause was that pgpool acquired the locks in the
> wrong order. If the resource is called A it seems that pgpool allows
child
> X to acquire A on node1 and at the same time, child Y acquires A on
node2.
> This leaves X wanting A on node2 and Y wanting A on node1. This leaves
> both children hanging indefinitely. It also leaves both postgres'es
> blissfully unaware of the deadlock, whereby it escapes postgres'es
> deadlock detection.

That's hard to believe for me. For any query, pgpool sends it to the
master node (node1 in your case) first and waits until the node
returns response by using select(2) on the socket to PostgreSQL
backend. After someting comes from the socket, then pgpool issues to
node2. So pgpool never sends query to master node(node1) and node2
concurrently. This is a classical technique to avoid a cross node
dead lock situation.

If your explation is correct, pgpool easily goes into dead lock
situation even by using simple pgbench query.

Could you please show me self-contained test case?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
______________________________________________________________________

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jose Soares 2013-01-28 09:45:41 database design best pratice help
Previous Message Alexander Farber 2013-01-28 07:57:39 Re: Optimizing select count query which often takes over 10 seconds