Quick Links

Re: More of a SQL question, I guess.

From:	Frank Bax <fbax(at)sympatico(dot)ca>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: More of a SQL question, I guess.
Date:	2007-08-14 14:36:26
Message-ID:	46C1BDEA.6050209@sympatico.ca
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

Howard Eglowstein wrote:
> I have a database with 150 fields per row. For historical reasons, it's
> broken into three databases, data_a, data_b and data_c. One of the
> fields is a timestamp and one of them is a serial number that came in
> from a remote machine that reports its status every day.
>
> If I use a command like 'SELECT MAX(Timestamp) FROM data_a WHERE field1
> = '0001', I can get back the time of the latest report. I can then issue
> the command 'SELECT * from data_a, data_b, data_c WHERE
> data_a.id=data_b.id AND data_a.id=data_c.id and field1='0001' to get the
> 150 fields for that report. It works fine, but it takes a while to
> respond when the database is the better part of a million records.
>
> If it were just one serial number, the two queries would be okay because
> It really only takes a minute. The problem I have is that we're hoping
> to have thousands of machine in the field (this is a just a test
> database) and clearly this approach won't work for thousands of serial
> numbers. The second method I did was to simply use 'SE:ECT * from
> data_a.....' to get all million records, and have my C code look for
> serial numbers in each line and keep the latest by timestamp. That takes
> about as long as doing the first procedure 3 times, but it gives me the
> latest data for all of the serial numbers in the system. That's
> perfectly cool, except that it won't scale nicely. If the web code that
> does the search isn't on the same machine that holds the data, we'll
> have to ship gigabytes of data over the network for each search.
>
> What I'd *like* is something that uses groups and MAX() to do this in
> one SQL command. It would group the data by the serial number (field1),
> find the record in each group with the maximum timestamp value and
> return all 150 fields. Is this possible? I thought of using unions, but
> I think I have to issue a pretty long command for each group and the PG
> buffers probably will max out after some relatively small number, no?
>
> Any thoughts would be appreciated.

Start with

SELECT field1,MAX(timestamp) as ts FROM data_a GROUP BY field1

to get a timestamp for each value of field1; then use self join to get
the serial number from these rows. At this point you have a problem if
there are any duplicate timestamp values.

SELECT id, a.f1, ts from data_a join (select f,max(timestamp) as ts from
data_a group by f1) as a on a.ts=data_a.timestamp;

Call this select SQL1 and join itto your longer sql statement:

SELECT * from data_a, data_b, data_c
JOIN (SQL1) as aa on aa.id = data_a.id
WHERE data_a.id=data_b.id AND data_a.id=data_c.id

It seems to me the underlying question here comes up often. Might I
suggest that we add a new question to section 4 of FAQ:
Q: How can I retrieve other fields in same row as result returned by an
aggregate function such as max()?
A: Is my "SQL1" the best/only answer to this question?

In response to

More of a SQL question, I guess. at 2007-08-13 20:41:59 from Howard Eglowstein

Responses

Re: More of a SQL question, I guess. at 2007-08-14 15:11:04 from Tom Lane

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Jasbinder Singh Bali	2007-08-14 14:49:59	Re: Postmaster start up problems (can't create lock file )
Previous Message	Patrick Lindeman	2007-08-14 09:56:05	Re: Postmaster start up problems (can't create lock file )