Re: Hadoop backend?

From: pi song <pi(dot)songs(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hadoop backend?
Date: 2009-02-23 04:56:59
Message-ID: 1b29507a0902222056h7c576a65k50ab572c4da601ff@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 23, 2009 at 3:56 PM, pi song <pi(dot)songs(at)gmail(dot)com> wrote:

> I think the point that you can access more system cache is right but that
> doesn't mean it will be more efficient than accessing from your local disk.
> Take Hadoop for example, your request for file content will have to go to
> Namenode (file chunk indexing service) and then you go ask the data node
> which then provides you data. Assuming that you're working on a large
> dataset, the probability of the data chunk you need staying in system cache
> is very low therefore most of the time you end up reading from a remote
> disk.
>
> I've got a better idea. How about we make the buffer pool multilevel? The
> first level is the current one. The second level represents memory from
> remote machines. Things that are used less often should stay on the second
> level. Has anyone ever thought about something like this before?
>
> Pi Song
>
> On Mon, Feb 23, 2009 at 1:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com>wrote:
>
>> On Sun, Feb 22, 2009 at 5:18 PM, pi song <pi(dot)songs(at)gmail(dot)com> wrote:
>> > One more problem is that data placement on HDFS is inherent, meaning you
>> > have no explicit control. Thus, you cannot place two sets of data which
>> are
>> > likely to be joined together on the same node = uncontrollable latency
>> > during query processing.
>> > Pi Song
>>
>> It would only be possible to have the actual PostgreSQL backends
>> running on a single node anyway, because they use shared memory to
>> hold lock tables and things. The advantage of a distributed file
>> system would be that you could access more storage (and more system
>> buffer cache) than would be possible on a single system (or perhaps
>> the same amount but at less cost). Assuming some sort of
>> per-tablespace control over the storage manager, you could put your
>> most frequently accessed data locally and the less frequently accessed
>> data into the DFS.
>>
>> But you'd still have to pull all the data back to the master node to
>> do anything with it. Being able to actually distribute the
>> computation would be a much harder problem. Currently, we don't even
>> have the ability to bring multiple CPUs to bear on (for example) a
>> large sequential scan (even though all the data is on a single node).
>>
>> ...Robert
>>
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Boszormenyi Zoltan 2009-02-23 08:14:34 Re: 8.4 features presentation
Previous Message Bruce Momjian 2009-02-23 03:03:18 8.4 features presentation