Quick Links

Re: Hadoop backend?

From:	pi song <pi(dot)songs(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Hadoop backend?
Date:	2009-02-23 04:56:59
Message-ID:	1b29507a0902222056h7c576a65k50ab572c4da601ff@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Feb 23, 2009 at 3:56 PM, pi song <pi(dot)songs(at)gmail(dot)com> wrote:

> I think the point that you can access more system cache is right but that
> doesn't mean it will be more efficient than accessing from your local disk.
> Take Hadoop for example, your request for file content will have to go to
> Namenode (file chunk indexing service) and then you go ask the data node
> which then provides you data. Assuming that you're working on a large
> dataset, the probability of the data chunk you need staying in system cache
> is very low therefore most of the time you end up reading from a remote
> disk.
>
> I've got a better idea. How about we make the buffer pool multilevel? The
> first level is the current one. The second level represents memory from
> remote machines. Things that are used less often should stay on the second
> level. Has anyone ever thought about something like this before?
>
> Pi Song
>
> On Mon, Feb 23, 2009 at 1:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com>wrote:
>
>> On Sun, Feb 22, 2009 at 5:18 PM, pi song <pi(dot)songs(at)gmail(dot)com> wrote:
>> > One more problem is that data placement on HDFS is inherent, meaning you
>> > have no explicit control. Thus, you cannot place two sets of data which
>> are
>> > likely to be joined together on the same node = uncontrollable latency
>> > during query processing.
>> > Pi Song
>>
>> It would only be possible to have the actual PostgreSQL backends
>> running on a single node anyway, because they use shared memory to
>> hold lock tables and things. The advantage of a distributed file
>> system would be that you could access more storage (and more system
>> buffer cache) than would be possible on a single system (or perhaps
>> the same amount but at less cost). Assuming some sort of
>> per-tablespace control over the storage manager, you could put your
>> most frequently accessed data locally and the less frequently accessed
>> data into the DFS.
>>
>> But you'd still have to pull all the data back to the master node to
>> do anything with it. Being able to actually distribute the
>> computation would be a much harder problem. Currently, we don't even
>> have the ability to bring multiple CPUs to bear on (for example) a
>> large sequential scan (even though all the data is on a single node).
>>
>> ...Robert
>>
>
>

In response to

Re: Hadoop backend? at 2009-02-23 02:09:16 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Boszormenyi Zoltan	2009-02-23 08:14:34	Re: 8.4 features presentation
Previous Message	Bruce Momjian	2009-02-23 03:03:18	8.4 features presentation