Re: design for parallel backup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-04-20 21:10:18
Message-ID: 20200420211018.w2qphw4yybcbxksl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
> My suspicion is that it has mostly to do with adequately utilizing the
> hardware resources on the server side. If you are network-constrained,
> adding more connections won't help, unless there's something shaping
> the traffic which can be gamed by having multiple connections.
> However, as things stand today, at any given point in time the base
> backup code on the server will EITHER be attempting a single
> filesystem I/O or a single network I/O, and likewise for the client.

Well, kinda, but not really. Both file reads (server)/writes(client) and
network send(server)/recv(client) are buffered by the OS, and the file
IO is entirely sequential.

That's not true for checksum computations / compressions to the same
degree. They're largely bottlenecked in userland, without the kernel
doing as much async work.

> If a backup client - either current or hypothetical - is compressing
> and encrypting, then it doesn't have either a filesystem I/O or a
> network I/O in progress while it's doing so. You take not only the hit
> of the time required for compression and/or encryption, but also use
> that much less of the available network and/or I/O capacity.

I don't think it's really the time for network/file I/O that's the
issue. Sure memcpy()'ing from the kernel takes time, but compared to
encryption/compression it's not that much. Especially for compression,
it's not really lack of cycles for networking that prevent a higher
throughput, it's that after buffering a few MB there's just no point
buffering more, given compression will plod along with 20-100MB/s.

> While I agree that some of these problems could likely be addressed in
> other ways, parallelism seems to offer an approach that could solve
> multiple issues at the same time. If you want to address it without
> that, you need asynchronous filesystem I/O and asynchronous network
> I/O and both of those on both the client and server side, plus
> multithreaded compression and multithreaded encryption and maybe some
> other things. That sounds pretty hairy and hard to get right.

I'm not really convinced. You're complicating the wire protocol by
having multiple tar files with overlapping contents. With the
consequence that clients need additional logic to deal with that. We'll
not get one manifest, but multiple ones, etc.

We already do network IO non-blocking, and leaving the copying to
kernel, the kernel does the actual network work asynchronously. Except
for file boundaries the kernel does asynchronous read IO for us (but we
should probably hint it to do that even at the start of a new file).

I think we're quite a bit away from where we need to worry about making
encryption multi-threaded:
andres(at)awork3:~/src/postgresql$ openssl speed -evp aes-256-ctr
Doing aes-256-ctr for 3s on 16 size blocks: 81878709 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 64 size blocks: 71062203 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 256 size blocks: 31738391 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 1024 size blocks: 10043519 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 8192 size blocks: 1346933 aes-256-ctr's in 3.00s
Doing aes-256-ctr for 3s on 16384 size blocks: 674680 aes-256-ctr's in 3.00s
OpenSSL 1.1.1f 31 Mar 2020
built on: Tue Mar 31 21:59:59 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-hsg853/openssl-1.1.1f=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-ctr 436686.45k 1515993.66k 2708342.70k 3428187.82k 3678025.05k 3684652.37k

So that really just leaves compression (and perhaps cryptographic
checksumming). Given that we can provide nearly all of the benefits of
multi-stream parallelism in a compatible way by using
parallelism/threads at that level, I just have a hard time believing the
complexity of doing those tasks in parallel is bigger than multi-stream
parallelism. And I'd be fairly unsurprised if you'd end up with a lot
more "bubbles" in the pipeline when using multi-stream parallelism.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-04-20 21:14:49 Re: new heapcheck contrib module
Previous Message Robert Haas 2020-04-20 20:40:20 Re: new heapcheck contrib module