Can slock_t ever be unaligned?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Can slock_t ever be unaligned?
Date: 1998-09-24 18:26:07
Message-ID: 422.906661567@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm about halfway convinced that the database corruption problem I
reported yesterday is a result of interlock failure among multiple
backends. I have a trace of the frontend/backend interactions that
were happening at the time the table got corrupted, and let me tell
you they are peculiar. Four clients were simultaneously trying to
access two tables, using BEGIN TRANSACTION / LOCK / END TRANSACTION
to ensure consistency. Works fine 99% of the time. This particular
time, not only was the table corrupted but the clients got logically
inconsistent results: one transaction saw some but not all of the
updates committed by a previous transaction. Moreover, the timestamps
show that one client successfully executed several begin/lock/update/
end transaction cycles on one of the tables *while another client
believed it was holding a lock on that table*. The timestamps also
indicate that the bogus transactions took about ten times longer to
execute than they normally would've.

Given this evidence, I am strongly inclined to think that spinlocking
(S_LOCK and friends) is not working right on my platform ... which is
HPUX 9. I've eyeballed the HP-PA assembly implementation of tas(),
and the only thing potentially wrong with it that I can see is that
the 16-byte slock_t object had better be aligned at least on a 4-byte
boundary. If it happened to be placed at an odd byte address, the
tas() code would overwrite one to three bytes beyond the end of the
slock_t object.

Can anyone say whether that's possible? Is slock_t ever part of a
tuple that might be packed to strange boundaries?

Another thing that would kill this implementation is if someone tried
to copy an slock_t around while it is in the locked state --- the
assembly code is actually using whichever word of the 16-byte object
is aligned on a 16-byte boundary, because that's what HP-PA's semaphore
lock instruction requires. Move the slock_t to a different address,
and the active word within it probably changes. So is there any place
in the system where structures containing slock_t's might be shifted
around?

I think I will try modding tas.s to arrange a coredump if the passed
address isn't adequately aligned, and then start testing things...
but if anyone can tell me exactly where slock_t's usually live, it
might save me some time.

The next possibility is that it's not spin-locking but a higher level
of lock code that is broken. If anyone can give me an idea where to
look, I'd appreciate it.

BTW, I have only seen these failures with a 6.3.2 server, not with
current sources ... but I haven't stressed my development server very
much with multiple clients. The bug could still be in 6.4beta.

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 1998-09-25 01:47:15 Re: [HACKERS] fix for multi-byte partial truncating
Previous Message Matthew C. Aycock 1998-09-24 14:56:22 configure...