Quick Links

postgres bug report

From:	jfaith <jfaith(at)cemsys(dot)com>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	postgres bug report
Date:	2000-07-21 10:54:27
Message-ID:	39782BE3.3FC29D2F@cemsys.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

============================================================================
POSTGRESQL BUG REPORT TEMPLATE
============================================================================
Your name :Jeremy Faith
Your email address :jfaith(at)cemsys(dot)com

System Configuration
---------------------
Architecture (example: Intel Pentium) :Intel Pentium

Operating System (example: Linux 2.0.26 ELF): SCO OpenServer 5.0.5

PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0

Compiler used (example: gcc 2.8.0) : egcs-2.91.66(egcs-1.1.2
release)

Please enter a FULL description of your problem:
------------------------------------------------
A daily 'vacuum verbose' cron failed, the end of the log follows:-
.
.
.
NOTICE: --Relation allcards--
NOTICE: Pages 1517: Changed 5, reaped 231, Empty 0, New 0; Tup 192248:
Vac 396, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 60, MaxLen 60;
Re-using: Free/Avail. Space 37656/35140; EndEmpty/Avail. Pages 0/230.
CPU 1.39s/1.31u sec.
NOTICE: Index alc1: Pages 1063; Tuples 192248: Deleted 396. CPU
0.40s/1.78u sec.
NOTICE: Rel allcards: Pages: 1517 --> 1514; Tuple(s) moved: 280. CPU
0.45s/0.59u sec.
NOTICE: Index alc1: Pages 1063; Tuples 192248: Deleted 280. CPU
0.31s/1.42u sec.
NOTICE: --Relation cards--
NOTICE: Pages 1307: Changed 188, reaped 1245, Empty 0, New 0; Tup
104774: Vac 16484, Keep/VTL 0/0, Crash 0, UnUsed 85, MinLen 72, MaxLen
88; Re-using: Free/Avail. Space 1496136/1491468; EndEmpty/Avail. Pages
0/1244. CPU 0.69s/1.73u sec.
NOTICE: Index crd4: Pages 518; Tuples 104774: Deleted 16484. CPU
0.23s/1.57u sec.
NOTICE: Index crd3: Pages 381; Tuples 104774: Deleted 16484. CPU
0.28s/1.72u sec.
NOTICE: Index crd2: Pages 785; Tuples 104774: Deleted 16484. CPU
0.57s/1.85u sec.
NOTICE: Index crd1: Pages 533; Tuples 104774: Deleted 16484. CPU
0.47s/1.72u sec.
NOTICE: Rel cards: Pages: 1307 --> 1121; Tuple(s) moved: 6159. CPU
26.80s/9.18u sec.
NOTICE: Index crd4: Pages 518; Tuples 104774: Deleted 6159. CPU
0.34s/0.95u sec.
NOTICE: Index crd3: Pages 381; Tuples 104774: Deleted 6159. CPU
0.17s/0.79u sec.
NOTICE: Index crd2: Pages 785; Tuples 104774: Deleted 6159. CPU
0.50s/0.98u sec.
NOTICE: Index crd1: Pages 533; Tuples 104774: Deleted 6159. CPU
0.43s/0.81u sec.
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
connection to server was lost

There are two programs(libpq) that access this DB.
prtcwri:-which receives audits from a master DB and produces SQL
statements(inserts/updates/deletes) to mirror the effects in the
Postgres DB. prtn_ncn:-that spawns a set of child processes(in this case
8) to service requests from serial ports. These are all selects except
for one update on
the cards table. On this machine about 18000 of these updates occur per
day.

The prtc_ncn process died when the cron ran. On restart it seemed to be
fine.

At the time the cron runs prtcwri would not have any activity and did
not die,
when audits came though later its log file indicated they were performed
correctly. However when I ran psql selects on cards did not show the new
data!

This machine is in production use so I am sorry but I was unable to
perform any more analysis. I reinitialised the DB(dropdb,createdb etc)
from the master(seperate machine running Informix-SE) and restarted my
processes. I disabled the cron and the machine has been fine for the
last few days.
2 other machines,identical software, at the same site are operating
normally but do not perform as many updates.

If it will be helpfull I can simulate most of these cicumstances on a
test machine here in an attempt to recreate the failure. What debugging
options should I use?

On a related theme but not a bug.

I've just looked at the code for vacuum. If I understand correctly it
locks each table in turn exclusively. This can take some time(in my case
about 5 minutes for cards) for each table. For my purposes I need the
cards table to be accessable 24x7 an exclusive lock for 1 or 2 seconds
would probably be ok
but certainly not minutes. As a work around(kludge) instead of doing a
vacuum
I am currently experimenting with the following:-

create table cards_new...
insert into cards_new select * from cards;
create indexes on new cards...(using an incrementing extension number)
alter table cards rename to cards_old;
alter table cards_new rename to cards;

So the cards table is only unavailable for a brief period while the 2
rename
statements run. I then replicate missed updates from cards_old into
cards
(there is a last_time field which makes this easy enough). If this
script
is run from prtcwri no other insert/delete/update statements run on the
DB.
It has the desired effect of keeping performance up and takes less time
than
a vacuum cards!.

I don't like this solution but I need something to keep the performance
of
the DB up. Most queries run very rapidly 10-100ms for a sequence at
least 4 selects but after a few days without a vacuum it gets more like
500ms average. I need this to be at worst 2s.
Q)Is this solution ok in principle or is there a fundamental flaw?
Both programs connect to the DB at start up(parent prtc_ncn diconnects
and forks each child process which then opens its own connection) and
only close
the connection when killed.

What I would most like to see is a non exclusive locking version of
vacuum.
Maybe 'vacuum lite' which might not be able to reclaim space but could
keep
performance up. I realise this is a complicated procedure at the very
heart
of the system. But I would be happy to work on this if it seems feasible
and would be acceptable to others. I also know that a non overwriting
storage manager may be available in the future along with WAL but I'm
looking for something soon.

A simple addition that might be handy would be the inclusion of actual
time
taken to perform each vacuum operation(I know other processes etc will
cause
this to vary heavily) as well as the already included rusage values.
This
would at least give an indication of how long a table is locked for.

Please describe a way to repeat the problem. Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

Responses

Re: postgres bug report at 2000-07-21 14:44:22 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Jan Wieck	2000-07-21 11:02:44	Re: [BUGS] problem with view and case - please help
Previous Message	Sezai Yilmaz	2000-07-21 10:35:53	Problem with Turkish localization