Very big insert/join performance problem (bacula)

From: Marc Cousin <mcousin(at)sigma(dot)fr>
To: pgsql-performance(at)postgresql(dot)org
Subject: Very big insert/join performance problem (bacula)
Date: 2009-07-13 13:40:18
Message-ID: 200907131540.18679.mcousin@sigma.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I'm trying to solve big performance issues with PostgreSQL + bacula while
inserting very big sets of records.

I'm sorry, this email will be a long one, as I've already spent quite a lot of
time on the issue, I don't want to waste your time speculating on things I
may already have done, and the problem is (or seems to me) a bit complex. The
other problem is that I don't have the explain plans to provide with the
email right now. I'll try to use this as a way to push 8.4 in this setup, to
dump all these plans with autoexplain (queries are on temporary tables, so a
bit tricky to get).

Let me first explain or remind how this works. Bacula is a backup solution and
is trying to insert its metadatas at the end of backups (file name, directory
name, size, etc ...)
For what we are interested in, there are 3 tables :
- file
- filename
- path

file is the one containing most records. It's the real metadata. filename and
path just contain an id and the real file or directory name (to save some
space with redundant names).

Before explaining the issue, just some information about sizing here :

file is 1.1 billion records for 280GB (with indexes).

Column | Type | Modifiers
------------+---------+-------------------------------------------------------
fileid | bigint | not null default nextval('file_fileid_seq'::regclass)
fileindex | integer | not null default 0
jobid | integer | not null
pathid | integer | not null
filenameid | integer | not null
markid | integer | not null default 0
lstat | text | not null
md5 | text | not null
Indexes:
"file_pkey" UNIQUE, btree (fileid)
"file_fp_idx" btree (filenameid, pathid)
"file_jpfid_idx" btree (jobid, pathid, filenameid)

path is 17 million for 6 GB

Column | Type | Modifiers
--------+---------+-------------------------------------------------------
pathid | integer | not null default nextval('path_pathid_seq'::regclass)
path | text | not null
Indexes:
"path_pkey" PRIMARY KEY, btree (pathid)
"path_name_idx" UNIQUE, btree (path)

filename is 80 million for 13GB

Column | Type | Modifiers
------------+---------+---------------------------------------------------------------
filenameid | integer | not null default
nextval('filename_filenameid_seq'::regclass)
name | text | not null
Indexes:
"filename_pkey" PRIMARY KEY, btree (filenameid)
"filename_name_idx" UNIQUE, btree (name)

There are several queries for each job despooling :

First we fill a temp table with the raw data (filename, pathname, metadata),
using COPY (no problem here)

Then we insert missing filenames in file, and missing pathnames in path,
with this query (and the same for file) :

INSERT INTO Path (Path)
SELECT a.Path FROM (SELECT DISTINCT Path FROM batch) AS a
WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path)

These do nested loops and work very well (after a sort on batch to get rid
from duplicates). They work reasonably fast (what one would expect when
looping on millions of records... they do their job in a few minutes).

The problem occurs with the final query, which inserts data in file, joining
the temp table to both file and filename

INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5)
SELECT batch.FileIndex,
batch.JobId,
Path.PathId,
Filename.FilenameId,
batch.LStat,
batch.MD5
FROM batch
JOIN Path ON (batch.Path = Path.Path)
JOIN Filename ON (batch.Name = Filename.Name)

This one has two split personnalities, depending on how many records are in
batch.
For small batch tables, it does nested loops.
For big batch tables (more than around one million initially) it decides to
hash join path (still ok, it's reasonably small) and then filename to batch
before starting. And that's when the problems begin The behaviour seems
logicial to me, it should go to hash join when batch gets bigger, but it
seems to be much too early here, considering the size of filename.

First of all, performance remains much better on nested loops, except for
extremely big batches (i'd say over 30 million, extrapolating from the times
I'm seeing with 10 millions records), so if I disable hash/merge joins, I get
my performance back on these queries (they execute in around the same time as
the searches in path and filename above). So I found a way to make most of my
queries do nested loops (I'll come back to this later)

Second, If there is more than one of these big sorts, performance degrades
drastically (we had 2 of them this weekend, they both took 24 hours to
complete). This is probably due to our quite bad disk setup (we didn't have a
big budget for this). There was no swapping of linux

So all of this makes me think there is a cost evaluation problem in this
setup : with the default values, postgresql seems to underestimate the cost
of sorting here (the row estimates were good, no problem with that).
PostgreSQL seems to think that at around 1 million records in file it should
go with a hash join on filename and path, so we go on hashing the 17 million
records of path, the 80 millions of filename, then joining and inserting into
file (we're talking about sorting around 15 GB for each of these despools in
parallel).

Temporarily I moved the problem at a bit higher sizes of batch by changing
random_page_cost to 0.02 and seq_page_cost to 0.01, but I feel like an
apprentice sorcerer with this, as I told postgreSQL that fetching rows from
disk are much cheaper than they are. These values are, I think, completely
abnormal. Doing this, I got the change of plan at around 8 million. And had 2
of them at 9 millions at the same time this weekend, and both of the took 24
hours, while the nested loops before the join (for inserts in path and
filename) did their work in minutes...

So, finally, to my questions :
- Is it normal that PostgreSQL is this off base on these queries (sorry I
don't have the plans, if they are required I'll do my best to get some, but
they really are the two obvious plans for this kind of query). What could
make it choose the hash join for too small batch tables ?
- Is changing the 2 costs the way to go ?
- Is there a way to tell postgreSQL that it's more costly to sort than it
thinks ? (instead of telling it that fetching data from disk doesn't cost
anything).

Here are the other non-default values from my configuration :

shared_buffers = 2GB
work_mem = 64MB
maintenance_work_mem = 256MB
max_fsm_pages = 15000000 # There are quite big deletes with bacula ...
effective_cache_size = 800MB
default_statistics_target = 1000

PostgreSQL is 8.3.5 on Debian Lenny

I'm sorry for this very long email, I tried to be as precise as I could, but
don't hesitate to ask me more.

Thanks for helping.

Marc Cousin

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message SystemManagement 2009-07-13 14:37:06 Re: Very big insert/join performance problem (bacula)
Previous Message Matthew Wakeling 2009-07-13 12:21:25 Re: Res: Cost performace question