Quick Links

Re: Netflix Prize data

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Greg Sabino Mullane" <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Netflix Prize data
Date:	2006-10-04 23:53:20
Message-ID:	21728.24.91.171.78.1160006000.squirrel@mail.mohawksoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>
> "Greg Sabino Mullane" <greg(at)turnstep(dot)com> writes:
>
>> CREATE TABLE rating (
>> movie SMALLINT NOT NULL,
>> person INTEGER NOT NULL,
>> rating SMALLINT NOT NULL,
>> viewed DATE NOT NULL
>> );
>
> You would probably be better off putting the two smallints first followed
> by
> the integer and date. Otherwise both the integer and the date field will
> have
> an extra two bytes of padding wasting 4 bytes of space.
>
> If you reorder the fields that way you'll be down to 28 bytes of tuple
> header
> overhead and 12 bytes of data. There's actually another 4 bytes in the
> form of
> the line pointer so a total of 44 bytes per record. Ie, almost 73% of the
> disk
> i/o you're seeing is actually per-record overhead.
>

That's good advice, however, It is said that Netflix has greater than 64K
movies, so, while the test info may work with a small int, I doubt the
overall system would work.

The rating, however, is one char 1~9. Would making it a char(1) buy anything?

In wonder....

If I started screwing around with movie ID and rating, and moved them into
one int. One byte for rating, three bytes for movie ID. That could reduce
the data size by at least half gig.

In response to

Re: Netflix Prize data at 2006-10-04 23:36:09 from Gregory Stark

Responses

Re: Netflix Prize data at 2006-10-05 00:18:11 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2006-10-05 00:18:11	Re: Netflix Prize data
Previous Message	Josh Berkus	2006-10-04 23:39:58	Re: [HACKERS] Updated version of FAQ_Solaris