Some observations on very slow pg_restore operations

From: Ogden Brash <info(at)litika(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Some observations on very slow pg_restore operations
Date: 2019-10-03 20:30:30
Message-ID: CAFCR_K35nhd3FxEuDsFsu9rQDHEEYxeRm0J-KFDF+jVMODSJgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I recently performed a pg_dump (data-only) of a relatively large database
where we store intermediate results of calculations. It is approximately 3
TB on disk and has about 20 billion rows.

We do the dump/restore about once a month and as the dataset has grown, the
restores have gotten very slow. So, this time I decided to do it a
different way and have some observations that puzzle me.

Background:

The data is extremely simple. The rows consist only of numbers and are all
fixed length. There are no foreign keys, constraints, null values, or
default values. There are no strings or arrays. There are 66 tables and the
number of rows in each table forms a gaussian distribution; so there are 3
tables which have about 3 billion rows each and the rest of the tables have
significantly fewer rows.

I used the directory format when doing the pg_dump. The compressed data of
the dump is 550 GB.

I am using: (PostgreSQL) 11.5 (Ubuntu 11.5-1.pgdg18.04+1)

The machine that I attempted to do a pg_restore to is a dedicated server
just for one instance of posgresql. It has 32 GB of memory and is running
Ubuntu 18.04 (headless). It physical hardware, not virtualized. Nothing
else runs on the machine and the postgresql.conf settings have been tuned
(to the best of my postgresql abilities which are suspect). While the
operating system is installed on an SSD, there is one extra large, fast HDD
that is dedicated to the posgresql server. It has been in use for this
particular purpose for a while and has not had performance issues. (Just
with pg_restore)

Autovacuum is off and all indexes have been deleted before the restore is
started. There is nothing in the db except for the empty data tables.

Restoring over the net:

In the past we have always restored in a way where the dumped data is read
over a gigabit connection while being restored to the local drive. But, the
last time we did it it took 2 days and I was looking for something faster.
So, I decided to copy the dumped directory to the local drive and restore
from the dump locally. I knew that because the machine only had one drive
that would fit the data, there would be some I/O contention, but I hoped
that it might not be as bad as reading over the network.

The pg_restore went unbearably slowly... after many hours it had written
less than 20GB to the database, so I started tracking it with iostat to see
what was going on. The following is iostat output every 60 seconds. I
tracked it for several hours and this is representative of what was
happening consistently.

avg-cpu: %user %nice %system %iowait %steal %idle
0.39 0.00 0.40 43.10 0.00 56.11

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0 0
sda 263.33 132.87 2990.93 7972 179456
sdb 0.17 0.00 0.73 0 44

avg-cpu: %user %nice %system %iowait %steal %idle
0.34 0.00 0.41 44.43 0.00 54.82

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0 0
sda 262.95 140.47 2983.00 8428 178980
sdb 0.08 0.00 0.40 0 24

While I was tracking this I started experimenting with the IO scheduler to
see if it had a noticable impact. I had been using cfq (ubuntu 18.04
default). Changing to deadline did not have a noticable difference.
Changing to noop made things much slower. I went back to cfq. I also
experimented with turning fsync off; that did speed things up a bit but not
enough for me to leave it off.

What puzzled me is that the OS was spending such a large percentage of time
in iowait, yet there was so little IO going on.

So, I decided to go back to restoring over the net. While the slow
pg_restore was still going on, and while I was still tracking iostat, I
copied the 550 GB dumps to an nfs drive. The copy happened pretty much at
full speed (limit being the gigabit ethernet) and interestingly, it did not
slow down kb_wrtn and kb_wrtn/s numbers in iostat (which was the postgresql
server continuing with the restore). To me that seemed to indicate that it
was not really a disk I/O limitation.

Restoring over the net:

After copying the dump files to an NFS drive, I stopped the restore,
truncated the tables and started exactly the same command, but this time
taking its input from the nfs drive. I did not reboot the machine or
restart the postgresql server. I tracked iostate every 60 seconds and this
is what it looks like:

avg-cpu: %user %nice %system %iowait %steal %idle
8.87 0.00 1.62 39.89 0.00 49.61

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0 0
sda 252.77 527.87 37837.47 31672 2270248
sdb 0.22 0.00 1.00 0 60

avg-cpu: %user %nice %system %iowait %steal %idle
8.57 0.00 2.21 35.26 0.00 53.97

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0 0
sda 236.10 465.27 54312.00 27916 3258720
sdb 0.08 0.00 0.40 0 24

Notice that the database is writing approximately 15 times as fast (and I
have verified that by tracking the size of the posgresql data directory
over time) while the number of i/o transactions per second has actually
dropped a little bit. It has now been running about 24 hours and has
maintained that speed.

My interpretation

At first sight this seems to me as being symptomatic of the pg_restore
process doing a huge number of very small input operations when reading
from the dump. If the proportion of input to output operations is the same
now as it was when trying to restore from the local drive, that implies
that the vast majority of i/o operations were inputs and not outputs.

However, I am not sure that even that would cause such a slowdown because
the compressed data files in the directory format dump correspond to the
tables and so there are 3 very large files that it starts with. So all of
these stats were gathered in the first 24 hours of the restore when it was
just restoring the first 3 tables (I have verbose on, so I know). Because
those files are gzipped, we know that they are being read sequentially and
because the machine has lots of memory we know that the OS has allocated a
lot of space to disk buffers and so even if postgresql was doing lots of
small reads, bouncing around between the 3 files, it would not hit the disk
that often.

Now that restore is happening 15 times faster when reading from an nfs
drive, I looked at the nfsiostat output for a while and it does not show
any indication of any untoward behavior:

192.168.17.146:/volume1/Backups mounted on /nas/Backups:

ops/s rpc bklog
27.000 0.000

read: ops/s kB/s kB/op retrans
avg RTT (ms) avg exe (ms)
27.000 3464.332 128.309 0 (0.0%)
13.500 13.511
write: ops/s kB/s kB/op retrans
avg RTT (ms) avg exe (ms)
0.000 0.000 0.000 0 (0.0%)
0.000 0.000

192.168.17.146:/volume1/Backups mounted on /nas/Backups:

ops/s rpc bklog
24.000 0.000

read: ops/s kB/s kB/op retrans
avg RTT (ms) avg exe (ms)
24.000 3079.406 128.309 0 (0.0%)
28.492 28.504
write: ops/s kB/s kB/op retrans
avg RTT (ms) avg exe (ms)
0.000 0.000 0.000 0 (0.0%)
0.000 0.000

The nubmer of operations per second (if those correspond to reads from
postgresql, which I do not know for a fact) does not seem high at all.

I actually do not have a great theory for what is going on but it might be
more obvious to someone who knows the postgresql implementation well. I
would love to hear any thoughts that would be helpful on how to get my
restores even faster.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message MichaelDBA 2019-10-03 21:48:18 Re: Some observations on very slow pg_restore operations
Previous Message Michael Lewis 2019-10-03 17:55:22 Re: pg12 - partition by column that might have null values