Re: [pgsql-sql] Daily digest v1.2492 (19 messages)

From: Steve Midgley <public(at)misuse(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Cc: John Summerfield <postgres(at)herakles(dot)homelinux(dot)org>
Subject: Re: [pgsql-sql] Daily digest v1.2492 (19 messages)
Date: 2007-04-03 18:35:37
Message-ID: 20070403183618.60EB89FB486@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hi John,

It sounds like a disk-bound operation, so cpu is not maxed out. I'm not
clear on all the details of your operation but it sounds like you're
using Java to do row-by-row based inserts, selects and updates within a
transaction, from a file. This can be a very slow process if you have
many rows. The OS stats you describe fits that theory (but not
conclusively).

If you are using (psuedo-)code such as:

Open file {
Read line {
select from Pg: "select from [other_table] where val =
[line[colN]]"
exec to Pg: "insert into [table] (col1, col2, ...) values
(line[col1], line[col2]..."
}
}

You can radically speed up such a system by using the "copy"
(http://www.postgresql.org/docs/8.2/interactive/sql-copy.html) command
to load all the data at once from the file into Pg and then do
post-processing with Java/SQL to get all the fields looking right.
Doing a bulk update with a join across several tables is so much faster
than looping through them with a wrapper in Java (or other lang) you
won't believe it.

I hope this helps and is on-topic for you.

Steve

At 09:38 AM 4/3/2007, pgsql-sql-owner(at)postgresql(dot)org wrote:
>Date: Tue, 03 Apr 2007 22:16:13 +0800
>From: John Summerfield <postgres(at)herakles(dot)homelinux(dot)org>
>To: pgsql-sql(at)postgresql(dot)org
>Subject: A long-running transaction
>Message-ID: <461261AD(dot)6060108(at)herakles(dot)homelinux(dot)org>
>
>I have a Java (java 1.1) program that I wrote some years ago, to read
>records from a text file and insert it into a ostgresql database.
>
>One of the assumptions I made was that one file contained one day's
>data, maybe as many as 1500 records, and I coded it to do the whole
>lot
>as one transaction so either a single file was loaded in its entirity,
>
>or none of its data was.
>
>I lost the Java code, but revived the idea and I've collected about
>two
>years' data using (Linux) shell scripts, and loading the data using
>psql.
>
>Then, I found the Java code on a disused hard disk:-)
>
>I made the necessary changes for it to build in java 1.5, and used
>psql
>to extract data from my new database in the correct format for the old
>
>program. This time, I have a little more data than I ever loaded at
>once
>before:
>summer(at)Bandicoot:~$ wc -l testdata
>6242217 testdata
>summer(at)Bandicoot:~$ \ls -hl testdata
>-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata
>summer(at)Bandicoot:~$
>
>Now, it wouldn't surprise me if postgresql used lots of memory - but
>how
>much could it possibly need? My laptop, where I first tried this, has
>1.25 Gbytes, so I could allow it some.
>
>It wouldn't surprise me a lot if it used lots of memory and caused all
>
>sorts of problems, but no, it's chugging away, still using no more RAM
>
>than it could have had on my old Pentium 133 all those years ago.
>
>In the process of checking it out, I've set it running on a machine
>with
>a AMD Sempron(tm) 2400+ running Kubuntu 6.10 (kernel is
>2.6.17-6-server-xen0) and 512 Mbytes of RAM.
>
>This is the java program:-)
>summer pts/6 :0.0 Thu20 5days 1:07 1:07
>/usr/bin/gij-4.1 -cp /usr/s
>It's been running five days so far, and I can see where it's up to by
>attaching strace. It's reading 2k of the input file every few seconds.
>
>Okay, clearly something's wrong, and I don't think it's all my crddu
>code.
>No probs swapping:
>summer(at)Bandicoot:~$ free
> total used free shared buffers
> cached
>Mem: 460800 456472 4328 0 860
>262164
>-/+ buffers/cache: 193448 267352
>Swap: 1461872 284 1461588
>summer(at)Bandicoot:~$
>
>It is hitting the disk pretty hard now on this machine, but the
>laptop's
>still going too, and the disk seems to run about half the time, part
>of
>a second running, part idle (but the intervals are getting shorter).
>
>It struck me as fairly curious that neither postgresql nor the
>application was hogging the CPU.
>
>Perhaps the laptop is more interesting: look at the size of the buffer
>pool:
>summer(at)Echidna:~> free
> total used free shared buffers
> cached
>Mem: 1295528 1268548 26980 0 3976
>392388
>-/+ buffers/cache: 872184 423344
>Swap: 1941496 32656 1908840
>summer(at)Echidna:~>
>Again, no problem with over-use of RAM, and I'm logged on using KDE
>too
>and that's running fine.
>
>It's been running a little longer here:
>summer pts/23 28Mar07 5days 25:12 25:11 java -cp
>/home/summer/Classes/:/usr/share/p
>
>This is Sun's Java 1.5 on OpenSUSE 10.2.
>
>
>This is what suggested I should write:
>summer(at)Echidna:~> procinfo
>Linux 2.6.18.8-0.1-default (geeko(at)buildhost) (gcc 4.1.2 20061115) #1
>1CPU [Echidna.]
>
>Memory: Total Used Free Shared Buffers
>Mem: 1295528 1271720 23808 0 3716
>Swap: 1941496 32656 1908840
>
>Bootup: Tue Mar 27 18:50:19 2007 Load average: 2.21 2.65 2.69 2/243
>19305
>
>user : 1d 3:17:04.03 16.0% page in : 131097310 disk 1:
>3079516r20087664w
>nice : 0:05:39.64 0.1% page out: 197016649
>system: 2d 20:38:37.13 40.1% page act: 87906251
>IOwait: 2d 0:46:37.33 28.5% page dea: 16218135
>hw irq: 0:44:46.71 0.4% page flt: 306255213
>sw irq: 0:50:04.69 0.5% swap in : 4026
>idle : 1d 0:36:29.73 14.4% swap out: 9552
>uptime: 7d 2:59:20.97 context : 702502442
>
>irq 0: 153880209 timer irq 7: 0 parport0
>irq 1: 69402 i8042 irq 8: 2 rtc
>irq 2: 0 cascade [4] irq 9: 1696942 acpi
>irq 3: 4 irq 10: 1
>irq 4: 4 irq 11: 71842329
>ehci_hcd:usb1,
>uhci_
>irq 5: 28545863 Intel 82801DB-ICH4 irq 12: 467432 i8042
>irq 6: 1 irq 14: 25021586 ide0
>
>summer(at)Echidna:~>
>
>
>Look at that line beginning "System:" two days 20 hours in the Linux
>kernel. It's my guess that the Linux kernel is spending a great deal
>of
>time manipulating that buffer pool.
>
>This shows postgresql taking 60% CPU:
>summer(at)Echidna:~> ps xaru
>USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>COMMAND
>summer 20237 0.2 1.2 263716 15988 pts/23 Rl+ Mar28 25:11 java
>-cp /home/summer/Class
>postgres 19321 0.1 0.3 19844 3984 ? D 21:50 0:00
>postgres: summer stocksshar
>summer 19344 0.0 0.0 2484 852 pts/25 R+ 21:52 0:00 ps
>xaru
>summer(at)Echidna:~> ps xaru
>USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>COMMAND
>postgres 20248 60.0 0.8 20760 11536 ? D Mar28 5167:27
>postgres: summer sharetrad
>postgres 19321 0.1 0.3 19844 4008 ? D 21:50 0:00
>postgres: summer stocksshar
>summer 19348 0.0 0.0 2480 848 pts/25 R+ 21:52 0:00 ps
>xaru
>summer(at)Echidna:~>
>
>Note that pid 20248 is the one, the other has nothing to do with the
>problem.
>
>This is the SQL I used to create the table:
> String createTableStatement = "create table TJS0102_trades ("
> + " trade date"
> + ", ttime int2"
> + ", ASX varchar(7)"
> + ", thigh int4"
> + ", tlow int4"
> + ", tclose int4"
> + ", topen int4"
> + ", tvolume int4"
> + ", tvalue int4"
>// + ")"
> + ", unique(ASX,trade,ttime)"
> + ");";
>
>
>For each record, I update a non-key field in another table; the source
>
>data for that other table is less than a megabyte.
>
>
>I have changed the program so as to load each day's data as a single
>transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in
>about
>three and an half hours. This is more-or-less what I expected.
>
>If you agree with me that postgresql should do better, now is a good
>time to ask for this as a test case.
>
>Note, you could also get suitable test data from float.com.au - it's
>not
>where mine came from, but it should be substantially the same.

Browse pgsql-sql by date

  From Date Subject
Next Message A. Kretschmer 2007-04-03 19:24:19 Re: plpgsql function question
Previous Message Karthikeyan Sundaram 2007-04-03 18:33:39 Re: plpgsql function question