BUG #9557: Row not visible after receiving notification

From: marko(at)joh(dot)to
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #9557: Row not visible after receiving notification
Date: 2014-03-13 09:39:22
Message-ID: 20140313093922.335.80556@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 9557
Logged by: Marko Tiikkaja
Email address: marko(at)joh(dot)to
PostgreSQL version: 9.1.12
Operating system: Linux
Description:

Hi,

We went live with an application a few days ago, and it's been having
problems with seeing rows after receiving a notification from the
transaction that committed it. I can't say for sure how often it happens,
but I've seen evidence suggesting at least three separate occasions.

The general design is that we have two independent "RPC Workers" LISTENing
for notifications. Upon receiving a notification, they call a function
Claim_Request() which does this:

BEGIN
SELECT
Requests.RequestID INTO _RequestID
FROM rpcworker.Requests
WHERE
state = 'UNCLAIMED' AND
-- sanity check
Requests.Claimed IS NULL
ORDER BY Requests.RequestID
LIMIT 1
FOR UPDATE ;
IF NOT FOUND THEN
RETURN;
END IF;

-- otherwise UPDATE to set claimed=now()

I suspected it might be something wonky with FOR UPDATE, so I added an extra
query sent right before Claim_Request():

SELECT Requests.RequestID
FROM rpcworker.Requests
WHERE state = 'UNCLAIMED' AND
Requests.Claimed IS NULL
ORDER BY Requests.RequestID
LIMIT 1

Below I have attached the relevant portions of the log when this happened.
In the example, the SELECT without the FOR UPDATE did not see the row, but
the subsequent Claim_Request() did (supposedly due to the small latency
added by the extra round-trip by the SELECT before it).

RPCWorker B appeared to be working "normally" in this case (though it was a
bit slower to react than A), but I've attached its logs for posterity as
well. In previous cases, where both of them called Claim_Request()
directly, we saw a couple of instances where *neither* of the RPC workers
saw the new row, which suggests something's terribly wrong.

## RPCWORKER B (UTC): ##

received notification with payload 8839
2014-03-12 21:11:27.250 +0000 LOG: received notification on channel "RPC
Worker"
2014-03-12 21:11:27.253 +0000 LOG: saw requestid 8839 without FOR UPDATE

## RPCWORKER A (UTC): ##

received notification with payload 8839
2014-03-12 21:11:27.240 +0000 LOG: received notification on channel "RPC
Worker"
2014-03-12 21:11:27.245 +0000 LOG: plain SELECT: sql: no rows in result
set
2014-03-12 21:11:27.251 +0000 WARNING: SNAKE: saw 0, claimed 8839
2014-03-12 21:11:27.251 +0000 LOG: received requestid 8839
2014-03-12 21:11:27.253 +0000 LOG: plain SELECT: sql: no rows in result
set
2014-03-12 21:11:27.531 +0000 LOG: request 8839 (attempt 1) succeeded
2014-03-12 21:11:27.562 +0000 LOG: finished request 8839 (attempt 1)

## DBWORKER (UTC + 1): ##

2014-03-12 22:11:27.237
CET,"dbworker","db",34682,"10.1.1.71:42307",531f39d8.877a,211904,"SELECT",2014-03-11
17:29:12 CET,583/727697,1268656677,WARNING,01000,"
# the above line is the last piece of information we see from the
transaction sending the notification

# committed somewhere around here

2014-03-12 22:11:27.246
CET,"dbworker","db",34682,"10.1.1.71:42307",531f39d8.877a,211905,"SELECT",2014-03-11
17:29:12 CET,583/0,0,LOG,00000,"duration: 42.826
ms",,,,,,,,"exec_simple_query, postgres.c:1090",""

# known committed at this point (but a query has already been executed!)

## RPCWORKER A IN DB (UTC + 1): ##

2014-03-12 22:11:21.052
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12480,"SELECT",2014-03-12
16:28:07 CET,34/0,0,LOG,00000,"duration: 0.741
ms",,,,,,,,"exec_simple_query, postgres.c:1090","rpcworker

# idle for a good while here

2014-03-12 22:11:27.245
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12481,"idle",2014-03-12
16:28:07 CET,34/2000571,0,LOG,00000,"statement: -- NO FOR UPDATE
2014-03-12 22:11:27.245
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12482,"SELECT",2014-03-12
16:28:07 CET,34/0,0,LOG,00000,"duration: 0.983
ms",,,,,,,,"exec_simple_query, postgres.c:1090","rpcworker"

# didn't see the row

2014-03-12 22:11:27.252
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12483,"idle",2014-03-12
16:28:07 CET,34/2000572,0,LOG,00000,"statement: -- Claim_Request()

# sees the row!

## RPCWORKER B IN DB (UTC + 1): ##

2014-03-12 22:11:27.253
CET,"rpcworker","db",32865,"10.2.1.72:33236",53207cee.8061,14392,"idle",2014-03-12
16:27:42 CET,95/1580775,0,LOG,00000,"statement: -- NO FOR UPDATE

# is a bit sluggish, so sees the row!

2014-03-12 22:11:27.254
CET,"rpcworker","db",32865,"10.2.1.72:33236",53207cee.8061,14394,"idle",2014-03-12
16:27:42 CET,95/1580776,0,LOG,00000,"statement: -- Claim_Request()

# would likely see the row, but already claimed by worker A

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Fujii Masao 2014-03-13 10:52:28 Re: BUG #9118: WAL Sender does not disconnect replication clients during shutdown
Previous Message Kyotaro HORIGUCHI 2014-03-13 08:14:28 Re: BUG #9553: why bitmap index scan startup_cost=0? it's a bug?