Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system.

From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org, fulan Peng <fulanpeng(at)gmail(dot)com>
Subject: Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system.
Date: 2009-09-12 23:12:05
Message-ID: 4AAC2AC5.4010502@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 09/12/2009 04:17 PM, Stephen Frost wrote:
> * Mark Mielke (mark(at)mark(dot)mielke(dot)cc) wrote:
>
>> There is no technical requirement for PostgreSQL to separate data in
>> databases or tables on subdirectory or file boundaries. Nothing wrong
>> with having one or more large files that contain everything.
>>
> Uhh, except where you run into system limitations on file size (eg- a 2G
> max file size..). You'll note PG creates files up to 1G and then splits
> them into separate files. It's not done just because it's fun.
>

This becomes a bit of a side thread - but note that I carefully didn't
say "exactly one file". I said "one or more large files that contain
everything". That is, if we have 3 databases each of size 50 Mbytes,
there is no technical reason why this cannot be stored within a single
150 Mbyte data file. Sure, if it goes beyond 2G, we can break it into a
set of files, and treat each file as a 2G "block" in a virtual larger
storage pool. VMWare has this for storing virtual drives.

If we assume that 32k *databases* is reasonable for a single instance,
for 32k databases to *require* 32k immediate sub-directories is the real
problem. This can be solved either by: 1) Adding additional depth to the
directory height to work around this limit (what the OP and you are
proposing), or 2) Storing multiple databases within the same files or
sub-directories. If you really must have this amount of scalability, I
am suggesting that you consider all of the resources required to access
32k worth of sub-directories in the file systems, specifically including
file descriptors, inodes, the backing bitmaps or extent mappings that
allocate from the file system free space, the rather inefficient
directory layouts of many file systems (many file systems still do
LINEAR searches for filenames, making file lookups linearly slower as
the directory becomes larger), and the kernel memory caches that track
all of these little details. The POSIX guarantees required are certainly
more heavy weight than the requirements that PostgreSQL has, and I am
certain it is possible to create a targetted solution to this problem
that is simpler and faster. For only a few databases and a few files,
the effort isn't worth it. But, if supporting 32k+ *databases*, or even
32k+ tables and indexes is a major requirement, and a major design
target, then PostgreSQL should do this stuff itself.

Modern file systems don't have the 2G problem. ext2/ext3 for 4Kbyte
blocks (standard) supports up to 2Tbytes. This also matches the
practical limit on addressing a single physical disk, at least on the
platforms I am familiar with. The requirement to stay under 2G for a
single file is a bit out dated.

>> I guess I'm not seeing how using 32k tables is a sensible model.
>>
> For one thing, there's partitioning. For another, there's a large user
> base. 32K tables is, to be honest, not all that many, especially for
> some of these databases which reach into the multi-TB range..
>

Talking philosophically - the need to use table-based partitioning to
achieve acceptable performance or storage requirements is somewhat of a
hacky work around. It's effectively moving the database query logic back
into the application space, where the application must know which tables
contain which data. The inherited tables and automatic constraint-based
query planning helps out, but it's still an elaborate hack. It's
exposing data that the application should not need to care about, and
then making it possible to hide some of it again. Table partitioning
should be far more automatic. I don't want to break my theoretical table
containing every call made on my network into per-hour tables, each with
a constraint for the time range it includes data for. I want to create a
table, with a timestamp column, fill it with billions of records,
provide a few hints, and the database engine should be smart enough to
partition the table such that my queries "just work".

Back to reality - maybe things have not reached this level of maturity
yet, and people with practical requirements today, have found that they
need to use very complex manual partitioning schemes that chew up
thousands of tables.

>> So yes,
>> things can be done to reduce the cost - but it seems like something is
>> wrong if this is truly a requirement.
>>
> I have no idea what you've been working with, but I hardly think it
> makes sense for PG to consider over 32k tables as not worth supporting.
>

I don't advocate any limits. However, I also don't advocate designing
PostgreSQL specifically for the case of 32k tables. If you want to use
32k tables, then you better have a file system that supports 32k+ files
in a single directory, and a kernel that is able to work efficiently
when postgres has thousands or more file descriptors open and in use at
the same time. The system *supports* 32k tables, but if you look at the
design, you'll see that it is not optimal for 32k tables. Even with the
changes to reduce the use of flat files and such, it's still not a
design that makes 32k tables optimal. If you want to create 32k tables -
I don't want to stop you.

But, I'm not going to agree that this is optimal or that PostgreSQL
should be put in excessive effort to make it optimal. :-)

Specifically with regard to my comments about creating a truly scalable
system that would support 32k databases - I don't think it's worth it. I
know it is possible, but I don't want to see effort put into it.

>> There are alternative models of
>> storage that would not require 32k tables, that likely perform better.
>>
> Eh? You would advocate combining tables for no reason other than you
> think it's bad to have alot?
>

"likely perform better" is not "no reason".

>> Do you agree with me that having 32k open file descriptors (or worse,
>> open on demand file descriptors that need to be re-opened many times) is
>> a problem?
>>
> Nope.
>

If you don't think it represents a scaleable problem, then I think it
puts your ability to judge the situation in question. :-)

>> Looking at PostgreSQL today - I don't think it's designed to scale to
>> this. Looking at SQL today, I think I would find it difficult to justify
>> creating a solution that requires this capability.
>>
> Actually, I find that PG handles it pretty well. And we used to be an
> Oracle shop.
>

Everything is relative. Throw enough RAM and CPU at the problem, and
things can appear acceptable and scaleable. Hit a limit, and ... oops

You do remember we're in a thread talking about how the # of databases
is limited on some systems, right? :-)

"Pretty well" means "good enough for you". It means nothing more or less.

Cheers,
mark

--
Mark Mielke<mark(at)mielke(dot)cc>

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Joe Conway 2009-09-12 23:20:52 pgsql: plug dblink resource leak dblink generates orphaned connections
Previous Message User Fxjr 2009-09-12 23:08:53 npgsql - Npgsql2: [#1010700] Capture parameter names with

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2009-09-12 23:30:09 Re: Re: [COMMITTERS] Can not create more than 32766 databases in ufs file system.
Previous Message Andrew Dunstan 2009-09-12 23:10:44 Re: plperl returning setof foo[]