Re: Inefficiency in parallel pg_restore with many tables

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Inefficiency in parallel pg_restore with many tables
Date: 2023-07-16 12:17:49
Message-ID: cf80eacd-8446-2a25-370b-40b2eb1cf890@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2023-07-15 Sa 13:47, Tom Lane wrote:
> I looked into the performance gripe at [1] about pg_restore not making
> effective use of parallel workers when there are a lot of tables.
> I was able to reproduce that by dumping and parallel restoring 100K
> tables made according to this script:
>
> do $$
> begin
> for i in 1..100000 loop
> execute format('create table t%s (f1 int unique, f2 int unique);', i);
> execute format('insert into t%s select x, x from generate_series(1,1000) x',
> i);
> if i % 100 = 0 then commit; end if;
> end loop;
> end
> $$;
>
> Once pg_restore reaches the parallelizable part of the restore, what
> I see is that the parent pg_restore process goes to 100% CPU while its
> children (and the server) mostly sit idle; that is, the task dispatch
> logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
> to keep the children busy. A quick perf check showed most of the time
> being eaten by pg_qsort and TocEntrySizeCompare.
>
> What I believe is happening is that we start the parallel restore phase
> with 100K TableData items that are ready to go (they are in the
> ready_list) and 200K AddConstraint items that are pending, because
> we make those have dependencies on the corresponding TableData so we
> don't build an index until after its table is populated. Each time
> one of the TableData items is completed by some worker, the two
> AddConstraint items for its table are moved from the pending_list
> to the ready_list --- and that means ready_list_insert marks the
> ready_list as no longer sorted. When we go to pop the next task
> from the ready_list, we re-sort that entire list first. So
> we spend something like O(N^2 * log(N)) time just sorting, if
> there are N tables. Clearly, this code is much less bright
> than it thinks it is (and that's all my fault, if memory serves).
>
> I'm not sure how big a deal this is in practice: in most situations
> the individual jobs are larger than they are in this toy example,
> plus the initial non-parallelizable part of the restore is a bigger
> bottleneck anyway with this many tables. Still, we do have one
> real-world complaint, so maybe we should look into improving it.
>
> I wonder if we could replace the sorted ready-list with a priority heap,
> although that might be complicated by the fact that pop_next_work_item
> has to be capable of popping something that's not necessarily the
> largest remaining job. Another idea could be to be a little less eager
> to sort the list every time; I think in practice scheduling wouldn't
> get much worse if we only re-sorted every so often.
>

Yeah, I think that last idea is reasonable. Something like if the number
added since the last sort is more than min(50, list_length/4) then sort.
That shouldn't be too invasive.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-07-16 13:45:54 Re: Inefficiency in parallel pg_restore with many tables
Previous Message Magnus Hagander 2023-07-16 11:24:06 Re: Should we remove db_user_namespace?