Re: problem with large maintenance_work_mem settings and

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: problem with large maintenance_work_mem settings and
Date: 2006-03-10 20:10:07
Message-ID: 8193.1142021407@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> samples % symbol name
> 350318533 98.8618 mergepreread
> 971822 0.2743 tuplesort_gettuple_common
> 413674 0.1167 tuplesort_heap_siftup

I don't have enough memory to really reproduce this, but I've come close
enough that I believe I see what's happening. It's an artifact of the
code I added recently to prevent the SortTuple array from growing during
the merge phase, specifically the "mergeslotsfree" logic. You can get
into a state where mergeslotsfree is at the lower limit of what the code
will allow, and then if there's a run of tuples that should come from a
single tape, mergepreread ends up sucking just one tuple per call from
that tape --- and with the outer loop over 28000 tapes that aren't doing
anything, each call is pretty expensive. I had mistakenly assumed that
the mergeslotsfree limit would be a seldom-hit corner case, but it seems
it's not so hard to get into that mode after all. The code really needs
to do a better job of sharing the available array slots among the tapes.
Probably the right answer is to allocate so many free array slots to each
tape, similar to the per-tape limit on memory usage --- I had thought
that the memory limit would cover matters but it doesn't.

Another thing I am wondering about is the code's habit of prereading
from all tapes when one goes empty. This is clearly pretty pointless in
the final-merge-pass case: we might as well just reload from the one
that went empty, and not bother scanning the rest. However, in the
scenario where we are rewriting the data to tape, I think we still need
the preread-from-all behavior in order to keep things efficient in
logtape.c. logtape likes it if you alternate a lot of reads with a lot
of writes, so once you've started reading you really want to refill
memory completely.

It might also be worth remembering the index of the last active tape so
that we don't iterate over thousands of uninteresting tapes in
mergepreread.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Atkins 2006-03-10 20:23:04 Re: random observations while testing with a 1,8B row table
Previous Message Luke Lonergan 2006-03-10 20:02:43 Re: random observations while testing with a 1,8B row