psql large RSS (1.6GB)

From: TTK Ciar <ttk2(at)hardpoint(dot)ciar(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: psql large RSS (1.6GB)
Date: 2004-10-27 07:57:48
Message-ID: 20041027075748.GA8355@hardpoint.ciar.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello!

My name is TTK, and I'm a software engineer at the Internet Archive's
Data Repository department. We have recently started using postgresql
for a couple of projects (we have historically been a MySQL outfit),
and today my co-worker noticed psql eating memory like mad when invoked
with a simple select statement incorporating a join of two tables.

The platform is a heavily modified RedHat 7.3 Linux. We are using
version 7.4.5 of postgresql.

The invocation was via sh script:

#!/bin/bash

outfile=$1
if [ -z "$outfile" ]; then
outfile=/0/brad/all_arcs.txt
fi

/usr/lib/postgresql/bin/psql -c 'select ServerDisks.servername,ServerDisks.diskserial,ServerDisks.diskmountpoint,DiskFiles.name,DiskFiles.md5 from DiskFiles,ServerDisks where DiskFiles.diskserial=ServerDisks.diskserial;' -F ' ' -A -t -o $outfile

.. and the tables in question are somewhat large (hundreds of GB's
of data), though we didn't expect that to be an issue as far as the
psql process was concerned.

We monitored server load via 'top -i -d 0.5' and watched the output
file for data. Over the course of about 200 seconds, psql's RSS
climbed to about 1.6 GB, and stayed there, while no data was written
to the output file. Eventually 10133194 lines were written to the
output file, all at once, about 1.2GB's worth of data.

I re-ran the select query using psql in interactive mode, and saw
the same results.

I re-ran it again, using "explain analyse", and this time psql's
RSS did *not* increase significantly. The result is here, if it
helps:

brad=# explain analyse select ServerDisks.servername,ServerDisks.diskserial,ServerDisks.diskmountpoint,DiskFiles.name,DiskFiles.md5 from DiskFiles,ServerDisks where DiskFiles.diskserial=ServerDisks.diskserial;
QUERY PLAN
------------------------------------------------------------------
Hash Join (cost=22.50..65.00 rows=1000 width=274) (actual time=118.584..124653.729 rows=10133349 loops=1)
Hash Cond: (("outer".diskserial)::text = ("inner".diskserial)::text)
-> Seq Scan on diskfiles (cost=0.00..20.00 rows=1000 width=198) (actual time=7.201..31336.063 rows=10133349 loops=1)
-> Hash (cost=20.00..20.00 rows=1000 width=158) (actual time=90.821..90.821 rows=0 loops=1)
-> Seq Scan on serverdisks (cost=0.00..20.00 rows=1000 width=158) (actual time=9.985..87.364 rows=2280 loops=1)
Total runtime: 130944.586 ms

At a guess, it looks like the data set is being buffered in its
entirety by psql, before any data is written to the output file,
which is surprising. I would have expected it to grab data as it
appeared on the socket from postmaster and write it to disk. Is
there something we can do to stop psql from buffering results?
Does anyone know what's going on here?

If the solution is to just write a little client that uses perl
DBI to fetch rows one at a time and write them out, that's doable,
but it would be nice if psql could be made to "just work" without
the monster RSS.

I'd appreciate any feedback. If you need any additional info,
please let me know and I will provide it.

-- TTK
ttk2(at)ciar(dot)org
ttk(at)archive(dot)org

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message D'Arcy J.M. Cain 2004-10-27 11:23:46 Re: Free PostgreSQL Training, Philadelphia, Oct 30
Previous Message Matt Clark 2004-10-27 06:09:21 Re: can't handle large number of INSERT/UPDATEs