| From: | solai v <solai(dot)cdac(at)gmail(dot)com> |
|---|---|
| To: | Nitin Motiani <nitinmotiani(at)google(dot)com> |
| Cc: | Hannu Krosing <hannuk(at)google(dot)com>, Mahendra Singh Thalor <mahi6run(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Adding pg_dump flag for parallel export to pipes |
| Date: | 2026-05-22 10:34:23 |
| Message-ID: | CAF0whuc-zU-H4zPxERH7onw00xGUYaj6ZmUbOfvZWNusb9EtNg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi all,
Thank you for the updated patch.
On Fri, May 22, 2026 at 1:03 PM Nitin Motiani <nitinmotiani(at)google(dot)com> wrote:
>
> Changed how pipe commands are quoted in the Windows test. The latest
> versions are attached.
I worked on reproducing the current limitation around parallel dumps
and then tested the latest v16 patch adding --pipe support for
pg_dump. To begin with, I verified the existing behavior.
For example:
pg_dump postgres | gzip > dump.sql.gz works, but does not support parallelism,
whereas:
pg_dump -Fd -j 4 -f dumpdir postgres
du -sh dumpdir
21M dumpdir
requires intermediate disk storage. This demonstrates the current
limitation where users must choose between parallelism and streaming
pipelines.
I then tested the patch introducing --pipe support. The feature is
quite useful for modern workflows where users want to stream dump
output directly to compression or upload pipelines without relying on
intermediate storage. Basic functionality worked as expected.
For example:
pg_dump -p 55432 -Fd -j 4 --pipe="cat > dump.out" postgres, produced a
~38MB output file,
and:
pg_dump -p 55432 -Fd -j 4 --pipe="gzip > dump.gz" postgres produced, a
compressed file (~11MB).
The initial contents appeared valid:
gunzip -c dump.gz | head
1
2
3
...
Also, no intermediate directory was created, confirming that the patch
enables streaming without filesystem-backed staging. Error handling
also behaved correctly.
For example:
--pipe="invalid_cmd"
resulted in:
pg_dump: error: pipe command failed: command not found
and:
--pipe="gzip | false"
resulted in:
pg_dump: error: pipe command failed: child process exited with exit code 1
However, I observed an important issue when using the feature with
multiple parallel workers. Since the pipe command is executed per
output file, using: --pipe="gzip > dump.gz", it results in multiple
workers invoking independent gzip processes that all write to the same
output file. This leads to corrupted or truncated output.
In my testing:
gunzip -c dump.gz > dump.sql
failed with:
gzip: dump.gz: unexpected end of file
This suggests that concurrent writes to a shared output target are not
coordinated and can result in invalid dumps. It would be helpful to
clarify expected usage patterns here. For example: whether users are
expected to generate distinct outputs per worker, or whether
safeguards should be implemented to prevent multiple workers from
writing to the same destination. Additionally, during failure
scenarios I observed backend logs such as:
FATAL: connection to client lost
Broken pipe
While this is expected when the pipe terminates prematurely, it may be
worth considering whether error messaging or cleanup behavior can be
made clearer from the user perspective.
Overall, the feature is valuable and aligns well with modern backup
workflows. However, behavior in multi-worker scenarios with shared
pipe targets may need further clarification or safeguards to avoid
data corruption. Looking forward to more feedback.
Regards.
Solai
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | shveta malik | 2026-05-22 10:27:32 | Re: [PATCH] Preserve replication origin OIDs in pg_upgrade |