Re: problem with large maintenance_work_mem settings and

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: problem with large maintenance_work_mem settings and
Date: 2006-03-09 17:25:32
Message-ID: 26014.1141925132@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> samples % symbol name
> 103520432 47.9018 inlineApplySortFunction
> 33382738 15.4471 comparetup_index
> 25296438 11.7054 tuplesort_heap_siftup
> 10089122 4.6685 btint4cmp
> 8395676 3.8849 LogicalTapeRead
> 2873556 1.3297 tuplesort_heap_insert
> 2796545 1.2940 tuplesort_gettuple_common
> 2752345 1.2736 AllocSetFree
> 2233889 1.0337 IndexBuildHeapScan

Interesting. What's the platform, and what compiler exactly? For me,
gcc seems to inline inlineApplySortFunction successfully, but your
compiler evidently is not doing that.

I get a significantly different oprofile result from CVS HEAD:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
194439 37.8399 tuplesort_heap_siftup
70803 13.7790 comparetup_index
35030 6.8172 btint4cmp
25139 4.8923 AllocSetAlloc
23965 4.6638 writetup_index
18057 3.5141 GetMemoryChunkSpace
17384 3.3831 tuplesort_gettuple_common
14356 2.7938 mergepreread
12740 2.4793 LWLockAcquire
11175 2.1748 LWLockRelease
6986 1.3596 tuplesort_heap_insert
6659 1.2959 PageAddItem
6387 1.2430 index_form_tuple
5225 1.0168 LogicalTapeRead

LOG: begin index sort: unique = f, workMem = 65536, randomAccess = f
LOG: switching to external sort with 234 tapes: CPU 0.23s/0.81u sec elapsed 1.82 sec
LOG: finished writing run 1 to tape 0: CPU 0.68s/6.80u sec elapsed 9.59 sec
LOG: finished writing run 2 to tape 1: CPU 1.23s/13.54u sec elapsed 17.88 sec
LOG: finished writing run 3 to tape 2: CPU 1.62s/20.43u sec elapsed 25.56 sec
LOG: finished writing run 4 to tape 3: CPU 2.08s/27.48u sec elapsed 34.64 sec
LOG: finished writing run 5 to tape 4: CPU 2.56s/34.50u sec elapsed 43.06 sec
LOG: performsort starting: CPU 2.66s/35.81u sec elapsed 45.00 sec
LOG: finished writing run 6 to tape 5: CPU 2.74s/38.11u sec elapsed 47.38 sec
LOG: finished writing final run 7 to tape 6: CPU 2.74s/38.15u sec elapsed 47.43 sec
LOG: performsort done (except 7-way final merge): CPU 2.74s/38.56u sec elapsed 47.84 sec
LOG: external sort ended, 24434 disk blocks used: CPU 4.10s/51.98u sec elapsed 65.33 sec

This is creating an index on a single integer column, 10 million rows,
random data, with maintenance_work_mem set to 65536 (64MB). Platform
is 64-bit Xeon, Fedora Core 4. I tried a larger test case (100M rows,
640MB maintenance_work_mem) and got similar results, with if anything
a few more percent in tuplesort_heap_siftup. What's even more
interesting is that an instruction-level dump of the oprofile counts
shows that almost all the tuplesort_heap_siftup time is spent at
the "((tup1)->tupindex != (tup2)->tupindex)" part of the first
HEAPCOMPARE macro. I interpret this as memory fetch delay: that's the
first touch of the SortTuple array elements, and if that part of the
array wasn't in L2 cache this is where the CPU would stall waiting
for it.

Your machine seems not to be subject to nearly the same amount of memory
delay. Which is interesting because most of the argument for changing
sort algorithms seems to hinge on the assumption that main-memory delay
is the main thing we need to worry about. That looks to be valid on the
Xeon I'm testing but not on your machine ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-03-09 17:44:40 Re: Merge algorithms for large numbers of "tapes"
Previous Message Stephen Frost 2006-03-09 17:18:28 Re: Proposal for SYNONYMS