A long-running transaction

From: John Summerfield <postgres(at)herakles(dot)homelinux(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: A long-running transaction
Date: 2007-04-03 14:16:13
Message-ID: 461261AD.6060108@herakles.homelinux.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I have a Java (java 1.1) program that I wrote some years ago, to read
records from a text file and insert it into a ostgresql database.

One of the assumptions I made was that one file contained one day's
data, maybe as many as 1500 records, and I coded it to do the whole lot
as one transaction so either a single file was loaded in its entirity,
or none of its data was.

I lost the Java code, but revived the idea and I've collected about two
years' data using (Linux) shell scripts, and loading the data using psql.

Then, I found the Java code on a disused hard disk:-)

I made the necessary changes for it to build in java 1.5, and used psql
to extract data from my new database in the correct format for the old
program. This time, I have a little more data than I ever loaded at once
before:
summer(at)Bandicoot:~$ wc -l testdata
6242217 testdata
summer(at)Bandicoot:~$ \ls -hl testdata
-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata
summer(at)Bandicoot:~$

Now, it wouldn't surprise me if postgresql used lots of memory - but how
much could it possibly need? My laptop, where I first tried this, has
1.25 Gbytes, so I could allow it some.

It wouldn't surprise me a lot if it used lots of memory and caused all
sorts of problems, but no, it's chugging away, still using no more RAM
than it could have had on my old Pentium 133 all those years ago.

In the process of checking it out, I've set it running on a machine with
a AMD Sempron(tm) 2400+ running Kubuntu 6.10 (kernel is
2.6.17-6-server-xen0) and 512 Mbytes of RAM.

This is the java program:-)
summer pts/6 :0.0 Thu20 5days 1:07 1:07
/usr/bin/gij-4.1 -cp /usr/s
It's been running five days so far, and I can see where it's up to by
attaching strace. It's reading 2k of the input file every few seconds.

Okay, clearly something's wrong, and I don't think it's all my crddu code.
No probs swapping:
summer(at)Bandicoot:~$ free
total used free shared buffers cached
Mem: 460800 456472 4328 0 860 262164
-/+ buffers/cache: 193448 267352
Swap: 1461872 284 1461588
summer(at)Bandicoot:~$

It is hitting the disk pretty hard now on this machine, but the laptop's
still going too, and the disk seems to run about half the time, part of
a second running, part idle (but the intervals are getting shorter).

It struck me as fairly curious that neither postgresql nor the
application was hogging the CPU.

Perhaps the laptop is more interesting: look at the size of the buffer pool:
summer(at)Echidna:~> free
total used free shared buffers cached
Mem: 1295528 1268548 26980 0 3976 392388
-/+ buffers/cache: 872184 423344
Swap: 1941496 32656 1908840
summer(at)Echidna:~>
Again, no problem with over-use of RAM, and I'm logged on using KDE too
and that's running fine.

It's been running a little longer here:
summer pts/23 28Mar07 5days 25:12 25:11 java -cp
/home/summer/Classes/:/usr/share/p

This is Sun's Java 1.5 on OpenSUSE 10.2.

This is what suggested I should write:
summer(at)Echidna:~> procinfo
Linux 2.6.18.8-0.1-default (geeko(at)buildhost) (gcc 4.1.2 20061115) #1
1CPU [Echidna.]

Memory: Total Used Free Shared Buffers
Mem: 1295528 1271720 23808 0 3716
Swap: 1941496 32656 1908840

Bootup: Tue Mar 27 18:50:19 2007 Load average: 2.21 2.65 2.69 2/243 19305

user : 1d 3:17:04.03 16.0% page in : 131097310 disk 1:
3079516r20087664w
nice : 0:05:39.64 0.1% page out: 197016649
system: 2d 20:38:37.13 40.1% page act: 87906251
IOwait: 2d 0:46:37.33 28.5% page dea: 16218135
hw irq: 0:44:46.71 0.4% page flt: 306255213
sw irq: 0:50:04.69 0.5% swap in : 4026
idle : 1d 0:36:29.73 14.4% swap out: 9552
uptime: 7d 2:59:20.97 context : 702502442

irq 0: 153880209 timer irq 7: 0 parport0
irq 1: 69402 i8042 irq 8: 2 rtc
irq 2: 0 cascade [4] irq 9: 1696942 acpi
irq 3: 4 irq 10: 1
irq 4: 4 irq 11: 71842329 ehci_hcd:usb1,
uhci_
irq 5: 28545863 Intel 82801DB-ICH4 irq 12: 467432 i8042
irq 6: 1 irq 14: 25021586 ide0

summer(at)Echidna:~>

Look at that line beginning "System:" two days 20 hours in the Linux
kernel. It's my guess that the Linux kernel is spending a great deal of
time manipulating that buffer pool.

This shows postgresql taking 60% CPU:
summer(at)Echidna:~> ps xaru
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
summer 20237 0.2 1.2 263716 15988 pts/23 Rl+ Mar28 25:11 java
-cp /home/summer/Class
postgres 19321 0.1 0.3 19844 3984 ? D 21:50 0:00
postgres: summer stocksshar
summer 19344 0.0 0.0 2484 852 pts/25 R+ 21:52 0:00 ps xaru
summer(at)Echidna:~> ps xaru
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 20248 60.0 0.8 20760 11536 ? D Mar28 5167:27
postgres: summer sharetrad
postgres 19321 0.1 0.3 19844 4008 ? D 21:50 0:00
postgres: summer stocksshar
summer 19348 0.0 0.0 2480 848 pts/25 R+ 21:52 0:00 ps xaru
summer(at)Echidna:~>

Note that pid 20248 is the one, the other has nothing to do with the
problem.

This is the SQL I used to create the table:
String createTableStatement = "create table TJS0102_trades ("
+ " trade date"
+ ", ttime int2"
+ ", ASX varchar(7)"
+ ", thigh int4"
+ ", tlow int4"
+ ", tclose int4"
+ ", topen int4"
+ ", tvolume int4"
+ ", tvalue int4"
// + ")"
+ ", unique(ASX,trade,ttime)"
+ ");";

For each record, I update a non-key field in another table; the source
data for that other table is less than a megabyte.

I have changed the program so as to load each day's data as a single
transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in about
three and an half hours. This is more-or-less what I expected.

If you agree with me that postgresql should do better, now is a good
time to ask for this as a test case.

Note, you could also get suitable test data from float.com.au - it's not
where mine came from, but it should be substantially the same.

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Scott Marlowe 2007-04-03 14:17:19 Re: Serial
Previous Message Hilary Forbes 2007-04-03 14:14:22 Re: Using a variable as a view name in a select