Quick Links

Re: CUDA Sorting

From:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Gaetano Mendola <mendola(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CUDA Sorting
Date:	2012-02-13 10:39:23
Message-ID:	CADyhKSUZpOV4Tj1D4Y3gamK-nH8wCpripdUDbuwRz7=0U3n_uw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

2012/2/13 Greg Smith <greg(at)2ndquadrant(dot)com>:
> On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
>>
>> The trend is to have server capable of running CUDA providing GPU via
>> external hardware (PCI Express interface with PCI Express switches), look
>> for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
>
>
> The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
> enclosure. That's a completely sensible purchase if your goal is to build a
> computing cluster, where a lot of work is handed off to a set of GPUs. I
> think that's even less likely to be a cost-effective option for a database
> server. Adding a single dedicated GPU installed in a server to accelerate
> sorting is something that might be justifiable, based on your benchmarks.
> This is a much more expensive option than that though. Details at
> http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants
> to see just how big this external box is.
>
>
>> I did some experimenst timing the sort done with CUDA and the sort done
>> with pg_qsort:
>> CUDA pg_qsort
>> 33Milion integers: ~ 900 ms, ~ 6000 ms
>> 1Milion integers: ~ 21 ms, ~ 162 ms
>> 100k integers: ~ 2 ms, ~ 13 ms
>> CUDA time has already in the copy operations (host->device, device->host).
>> As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
>> Xeon(R) CPU X5650 @ 2.67GHz
>
>
> That's really interesting, and the X5650 is by no means a slow CPU. So this
> benchmark is providing a lot of CPU power yet still seeing over a 6X speedup
> in sort times. It sounds like the PCI Express bus has gotten fast enough
> that the time to hand data over and get it back again can easily be
> justified for medium to large sized sorts.
>
> It would be helpful to take this patch and confirm whether it scales when
> using in parallel. Easiest way to do that would be to use the pgbench "-f"
> feature, which allows running an arbitrary number of some query at once.
> Seeing whether this acceleration continued to hold as the number of clients
> increases is a useful data point.
>
> Is it possible for you to break down where the time is being spent? For
> example, how much of this time is consumed in the GPU itself, compared to
> time spent transferring data between CPU and GPU? I'm also curious where
> the bottleneck is at with this approach. If it's the speed of the PCI-E bus
> for smaller data sets, adding more GPUs may never be practical. If the bus
> can handle quite a few of these at once before it saturates, it might be
> possible to overload a single GPU. That seems like it would be really hard
> to reach for database sorting though; I can't really defend justify my gut
> feel for that being true though.
>
>
>> > I've never seen a PostgreSQL server capable of running CUDA, and I
>> > don't expect that to change.
>>
>> That sounds like:
>>
>> "I think there is a world market for maybe five computers."
>> - IBM Chairman Thomas Watson, 1943
>
>
> Yes, and "640K will be enough for everyone", ha ha. (Having said the 640K
> thing is flat out denied by Gates, BTW, and no one has come up with proof
> otherwise).
>
> I think you've made an interesting case for this sort of acceleration now
> being useful for systems doing what's typically considered a data warehouse
> task. I regularly see servers waiting for far more than 13M integers to
> sort. And I am seeing a clear trend toward providing more PCI-E slots in
> servers now. Dell's R810 is the most popular single server model my
> customers have deployed in the last year, and it has 5 X8 slots in it. It's
> rare all 5 of those are filled. As long as a dedicated GPU works fine when
> dropped to X8 speeds, I know a fair number of systems where one of those
> could be added now.
>
> There's another data point in your favor I didn't notice before your last
> e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type that
> runs with NVIDIA Tesla hardware. That means the installed base of people
> who could consider CUDA is higher than I expected. To demonstrate how much
> that costs, to provision a GPU enabled reserved instance from Amazon for one
> year costs $2410 at "Light Utilization", giving a system with 22GB of RAM
> and 1.69GB of storage. (I find the reserved prices easier to compare with
> dedicated hardware than the hourly ones) That's halfway between the
> High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and
> the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
> $2200. If someone could prove sorting was a bottleneck on their server,
> that isn't an unreasonable option to consider on a cloud-based database
> deployment.
>
> I still think that an approach based on OpenCL is more likely to be suitable
> for PostgreSQL, which was part of why I gave CUDA low odds here. The points
> in favor of OpenCL are:
>
> -Since you last posted, OpenCL compiling has switched to using LLVM as their
> standard compiler. Good PostgreSQL support for LLVM isn't far away. It
> looks to me like the compiler situation for CUDA requires their PathScale
> based compiler. I don't know enough about this area to say which compiling
> tool chain will end up being easier to deal with.
>
> -Intel is making GPU support standard for OpenCL, as I mentioned before.
> NVIDIA will be hard pressed to compete with Intel for GPU acceleration once
> more systems supporting that enter the market.
>
> -Easy availability of OpenCL on Mac OS X for development sake. Lots of
> Postgres hackers with OS X systems, even though there aren't too many OS X
> database servers.
>
> The fact that Amazon provides a way to crack the chicken/egg hardware
> problem immediately helps a lot though, I don't even need a physical card
> here to test CUDA GPU acceleration on Linux now. With that data point, your
> benchmarks are good enough to say I'd be willing to help review a patch in
> this area here as part of the 9.3 development cycle. That may validate that
> GPU acceleration is useful, and then the next step would be considering how
> portable that will be to other GPU interfaces. I still expect CUDA will be
> looked back on as a dead end for GPU accelerated computing one day.
> Computing history is not filled with many single-vendor standards who
> competed successfully against Intel providing the same thing. AMD's x86-64
> is the only example I can think of where Intel didn't win that sort of race,
> which happened (IMHO) only because Intel's Itanium failed to prioritize
> backwards compatibility highly enough.
>
As a side node. My module (PG-Strom) also uses CUDA, although it tried to
implement it with OpenCL at begining of the project, because it didn't work
well when multiple sessions uses a GPU device concurrently.
The second background process get an error due to out-of-resources during
another process opens a GPU device.

I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or bugs of
my code. Anyway, I switched to CUDA, instead of the investigation on binary
drivers. :-(

Thanks,
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

In response to

Re: CUDA Sorting at 2012-02-13 07:26:49 from Greg Smith

Responses

Re: CUDA Sorting at 2012-02-13 17:32:58 from Gaetano Mendola

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dimitri Fontaine	2012-02-13 11:18:07	Re: Finer Extension dependencies
Previous Message	Marti Raudsepp	2012-02-13 10:38:16	Re: bitfield and gcc