Re: JOIN column maximum

From: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: JOIN column maximum
Date: 2012-01-06 21:56:17
Message-ID: CANnCtnJ+-N-hKVBEKo6THZff7gwyNx-EDALQu5BgH+zSM+R9VA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Jan 5, 2012 at 10:19 PM, Darren Duncan <darren(at)darrenduncan(dot)net>
wrote:

>
> This all being said, 23K values per row just sounds wrong, and I can't
> imagine any census forms having that many details.
>
> Do you, by chance, have multiple values of the same type that are in
> different fields, eg telephone_1, telephone_2 or child_1, child_2 etc? You
> should take any of those and collect them into array-typed fields, or
> separate tables with just telephone or child columns. Or do you say have a
> set of coordinates in separate fields? Or you may have other kinds of
> redundancy within single rows that are best normalized into separate rows.
>
> With 23K values, these probably have many mutual associations, and you
> could split that table into a bunch of other ones where columns that relate
> more closely together are collected.
>
> My question is already answered, so this is mostly for anyone curious
about Census data, and if anyone wants to add to this, feel free.

You're right that no census form has that many details. The American
Community Survey replaced the old Census long form, so it does have more
details than the form sent out for the 2010 Census, but not 23,000
questions. It might ask, e.g. income, sex, and how you travel to work. But
the forms are private, so that data (the so-called microdata) is not
released in its entirety. What I am working with is called the summary
file, which presents the data in aggregate. That means you have an income
table with 16 income classes, plus a total column. Then you have 9 more
tables which show the same income classes by 9 racial and ethnic
categories, for a total of 153 more columns. Then you also have a table
which crosses 9 income classes by 5 mobility statuses (living in same
house, moved within county, moved within state, etc.) for a total of 55
columns. Then you have income classes crossed with sex, income classes
crossed with mode of transportation to work, sex crossed with mode of
transportation to work, etc. When all is said and done, you have 23,000
columns.

Believe me, I would all love to be working directly with the microdata. All
of these different ways of slicing and dicing the categories are basically
how the Census Bureau provides as much detail as possible without providing
so much detail that privacy would be compromised (i.e., enough information
is available that specific individuals could be identified). That
inevitably leads to a great deal of redundancy in the data, since the same
individuals are just being counted in different groups in different tables.

Given all that, one could still take the data that came from Census and try
to normalize it and organize it, but my main goal given the size of the
dataset is to keep it as similar as possible to the source, so that a
researcher familiar with the Census data can work with our database using
the Census documentation without our having to produce a new set of
documentation. The way I had done that was to store the sequences (which
are merely a data dissemination convenience, and have no relationship to
the logic of the data) in Postgres tables, and make the subject tables
(which do represent logical groupings of data) into views. I'm thinking
about making the sequences into array columns. The subject tables would
still be views.

--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Marko Kreen 2012-01-06 22:27:14 Re: URGENT: temporary table not recognized?
Previous Message Filip Rembiałkowski 2012-01-06 21:45:35 Re: please help understand freeing shared buffers