On Fri, Jan 27, 2012 at 2:56 PM, Pierre C <lists(at)peufeu(dot)com> wrote:
>> Right. I think it makes more sense to try to get parallelism working
>> first with the infrastructure we have. Converting to use threading,
>> if we ever do it at all, should be something we view as a later
>> performance optimization. But I suspect we won't want to do it
>> anyway; I think there will be easier ways to get where we want to be.
> Multithreading got fashionable with the arrival of the Dual-core CPU a few
> years ago. However, multithreading as it is used currently has a huge
> problem : usually, threads share all of their memory. This opens the door to
> an infinite number of hard to find bugs, and more importantly, defeats the
> "Re-entrant palloc()" is nonsense. Suppose you can make a reentrant palloc()
> which scales OK at 2 threads thanks to a cleverly placed atomic instruction.
> How is it going to scale on 64 cores ? On HP's new 1000-core ARM server with
> non-uniform memory access ? Probably it would suck very very badly... not to
> mention the horror of multithreaded exception-safe deallocation when 1
> thread among many blows up on an error...
There are academic papers out there on how to build a thread-safe,
highly concurrent memory allocator. You seem to be assuming that
everyone doing allocations would need to compete for access to a
single freelist, or something like that, which is simply not true. A
lot of effort and study has been put into figuring out how to get past
bottlenecks in this area, because there is a lot of multi-threaded
code out there that needs to surmount these problems. I don't believe
that the problem is that it can't be done, but rather that we haven't
> For the ultimate in parallelism, ask a FPGA guy. Is he using shared memory
> to wire together his 12000 DSP blocks ? Nope, he's using isolated Processes
> which share nothing and communicate through FIFOs and hardware message
> passing. Like shell pipes, basically. Or Erlang.
I'm not sure we can conclude much from this example. The programming
style of people using FPGAs is probably governed by the nature of the
interface and the type of computation they are doing rather than
> Good parallelism = reduce shared state and communicate through data/message
> Shared-everything multithreading is going to be in a lot of trouble on
> future many-core machines. Incidentally, Postgres, with its Processes,
> sharing only what is needed, has a good head start...
> With more and more cores coming, you guys are going to have to fight to
> reduce the quantity of shared state between processes, not augment it by
> using shared memory threads !...
I do agree that it's important to reduce shared state. We've seen
some optimizations this release cycle that work precisely because they
cut down on the rate at which cache lines must be passed between
cores, and it's pretty clear that we need to go farther in that
direction. On the other hand, I think it's a mistake to confuse the
programming model with the amount of shared state. In a
multi-threaded programming model there is likely to be a lot more
memory that is technically "shared" in the sense that any thread could
technically access it. But if the application is coded in such a way
that actual sharing is minimal, then it's not necessarily any worse
than a process model as far as concurrency is concerned. Threading
provides a couple of key advantages which, with our process model, we
can't get: it avoids the cost of a copy-on-write operation every time
a child is forked, and it allows arbitrary amounts of memory rather
than being limited to a single shared memory segment that must be
sized in advance. The major disadvantage is really with robustness,
not performance, I think: in a threaded environment, with a shared
address space, the consequences of a random memory stomp will be less
> Say you want to parallelize sorting.
> Sorting is a black-box with one input data pipe and one output data pipe.
> Data pipes are good for parallelism, just like FIFOs. FPGA guys love black
> boxes with FIFOs between them.
> Say you manage to send tuples through a FIFO like zeromq. Now you can even
> run the sort on another machine and allow it to use all the RAM if you like.
> Now split the black box in two black boxes (qsort and merge), instanciate as
> many qsort boxes as necessary, and connect that together with pipes. Run
> some boxes on some of this machine's cores, some other boxes on another
> machine, etc. That would be very flexible (and scalable).
> Of course the black box has a small backdoor : some comparison functions can
> access shared state, which is basically *the* issue (not reentrant stuff,
> which you do not need).
Well, you do need reentrant stuff, if the comparator does anything
non-trivial. It's easy to imagine that comparing strings or dates or
whatever is a trivial operation that's done without allocating any
memory or throwing any errors, but it's not really true. I think the
challenge of using GPU acceleration or JIT or threading or other
things that are used in really high-performance computing is going to
be that a lot of our apparently-trivial operations are actually, well,
not so trivial, because we have error checks, overflow checks,
nontrivial encoding/decoding from the on-disk format, etc. There's a
tendency to wave that stuff away as peripheral, but I think that's a
mistake. Someone who knows how to do it can probably write a
muti-threaded, just-in-time-compiled, and/or GPU-accelerated program
in an afternoon that solves pretty complex problems much more quickly
than PostgreSQL, but doing it without throwing all the error checks
and on numeric as well as int4 and in a way that's portable to every
architecture we support - ah, well, there's the hard part.
The Enterprise PostgreSQL Company
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2012-01-30 15:07:16|
|Subject: Re: Configuring Postgres to Add A New Source File|
|Previous:||From: Heikki Linnakangas||Date: 2012-01-30 14:57:27|
|Subject: Re: Group commit, revised|