pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout

From: "r(dot)takahashi_2(at)fujitsu(dot)com" <r(dot)takahashi_2(at)fujitsu(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout
Date: 2019-09-02 04:42:55
Message-ID: OSBPR01MB4550DAE2F8C9502894A45AAB82BE0@OSBPR01MB4550.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout in following environment.

[Environment]
Postgres 13dev (master branch)
Red Hat Enterprise Postgres 7.4

[Error]
$ pg_basebackup -F t --progress --verbose -h <hostname> -D <directory>
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/5A000060 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_15647"
pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[Analysis]
- pg_basebackup -F t creates a tar file and does fsync() for each tablespace.
(Otherwise, -F p does fsync() only once at the end.)
- While doing fsync() for a tar file for one tablespace, wal sender sends the content of the next tablespace.
When fsync() spends long time, the tcp socket of pg_basebackup returns "zero window" packets to wal sender.
This means the tcp socket buffer of pg_basebackup is exhausted since pg_basebackup cannot receive during fsync().
- The socket of wal sender retries to send the packet, but resets connection after tcp_user_timeout.
After wal sender resets connection, pg_basebackup cannot receive data and fails with above error.

[Solution]
I think fsync() for each tablespace is not necessary.
Like pg_basebackup -F p, I think fsync() is necessary only once at the end.

Could you give me any comment?

Regards,
Ryohei Takahashi

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-09-02 04:50:42 Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity
Previous Message Tom Lane 2019-09-02 04:29:55 Re: safe to overload objectSubId for a type?