Storage Manager crash at mdwrite()

From: Tareq Aljabban <dee(dot)jay23(dot)me(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Storage Manager crash at mdwrite()
Date: 2012-03-15 17:49:40
Message-ID: CAGOe0a+rspSdXXrFTj5pjHMeQgPSydDbC7-wTR=6bzrvYvnmnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm implementing an extention to mdwrite() at backend/storage/smgr/md.c
When a block is written to the local storage using mdwrite(), I'm sending
this block to an HDFS storage.
So far I don't need to read back the values I'm writing to HDFS. This
approach is working fine in the initDB phase.
However, when I'm running postgres (bin/pg_ctl start), the first few write
operations run successfully, and then suddenly (after writing exactly 3
files to HDFS), I get a 130 exit code with the following message showing
the JVM thread dump of HDFS:

LOG: background writer process (PID 29347) exited with exit code 130
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
2012-03-15 13:27:52
Full thread dump OpenJDK Server VM (16.0-b13 mixed mode):

"IPC Client (47) connection to localhost/127.0.0.1:8020 from taljab1"
daemon prio=10 tid=0x8994d400 nid=0x72e4 in Object.wait() [0x898ad000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8a3c3050> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
- locked <0x8a3c3050> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)

"IPC Client (47) connection to localhost/127.0.0.1:8020 from taljab1"
daemon prio=10 tid=0x89b87c00 nid=0x72e3 in Object.wait() [0x898fe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8a2ff268> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
- locked <0x8a2ff268> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)

"Low Memory Detector" daemon prio=10 tid=0x09daa400 nid=0x72cd runnable
[0x00000000]
java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x09da8400 nid=0x72cc waiting on
condition [0x00000000]
java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x09da6000 nid=0x72cb waiting on
condition [0x00000000]
java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x09da4800 nid=0x72ca waiting on
condition [0x00000000]
java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x09d94800 nid=0x72c9 in Object.wait()
[0x89db4000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8a7202b0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:133)
- locked <0x8a7202b0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:149)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
This shows the JVM thread dump of HDFS.
This shows the JVM thread dump of HDFS.

"Reference Handler" daemon prio=10 tid=0x09d8fc00 nid=0x72c8 in
Object.wait() [0x89e05000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8a720338> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x8a720338> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x09d3bc00 nid=0x72c6 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE

"VM Thread" prio=10 tid=0x09d8d000 nid=0x72c7 runnable

"VM Periodic Task Thread" prio=10 tid=0x09dac800 nid=0x72ce waiting on
condition

JNI global references: 763

Heap
def new generation total 4800K, used 1844K [0x8a270000, 0x8a7a0000,
0x94870000)
eden space 4288K, 34% used [0x8a270000, 0x8a3e0418, 0x8a6a0000)
from space 512K, 72% used [0x8a720000, 0x8a77cdd8, 0x8a7a0000)
to space 512K, 0% used [0x8a6a0000, 0x8a6a0000, 0x8a720000)
tenured generation total 10624K, used 0K [0x94870000, 0x952d0000,
0xa9470000)
the space 10624K, 0% used [0x94870000, 0x94870000, 0x94870200,
0x952d0000)
compacting perm gen total 16384K, used 5765K [0xa9470000, 0xaa470000,
0xb1470000)
the space 16384K, 35% used [0xa9470000, 0xa9a11480, 0xa9a11600,
0xaa470000)
No shared spaces configured.

This seems like an HDFS issue, but the same code worked properly in
initDB(). I replaced this HDFS write code with a code that writes always
the same block (empty one) to HDFS regardless from the value received by
mdwrite().. Kept getting the same issue after writing 3 files.
I copied this exact code to a separate C application and ran it there and
it executed without any problems (I wrote/deleted 100 files). That's why
I'm doubting that it's something related to postgreSQL.

Any ideas on what's going wrong?

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-03-15 18:00:51 Re: EquivalenceClasses and subqueries and PlaceHolderVars, oh my
Previous Message Greg Stark 2012-03-15 17:45:16 Re: pg_upgrade and statistics