Re: Join query help

From: Michael Glaesemann <grzm(at)seespotcode(dot)net>
To: novice <user(dot)postgresql(at)gmail(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Join query help
Date: 2007-08-21 01:04:10
Message-ID: 4416C45B-CCC2-44AC-AA92-92500CF21BF5@seespotcode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-sql


On Aug 18, 2007, at 0:35 , novice wrote:

> What query do I write to generate the following?
>
> week_no | count(record_id | count(observation_id) | sum(score_id)
> where = '1'
> 2007, 30 | 2 | 8 | 6
> 2007, 29 | 1 | 2 | 1

Okay: let's take a look at what you're trying to get:

first column is year and week of year (which might be better labeled
"week" rather than "week_no")
Checking the available date/time functions, I see extract can get at
both of these attributes of a date value. That could be helpful.
However, this splits the week and year into two separate values,
which are two things to keep track of. As well, those values are now
integers, which have little to do with dates. date_trunc allows you
to truncate timestamps to the week, returning a timestamp value. This
mean's we can look at timestamps with week-precision: and they're
still timestamps, which means we can rely on them to act as dates in
terms of ordering (Some would go ahead and format the year, week
column as text right a way using TO_CHAR and rely on string
comparisons for grouping and ordering, but there's no reason to do
this, and you can no longer handle the value easily as a datetime
value. If you need a specific output format, do it at the end or in
your middleware.)

If you're going to group records by week (rather than timestamp)
often, I'd go ahead and create a VIEW:

CREATE VIEW record_with_week AS
SELECT record_id
, date_trunc('week', record_date) AS record_week
FROM record;
SELECT *, extract(week from record_week) as week_number
test-# FROM record_with_week;
record_id | record_week | week_number
-----------+------------------------+-------------
1 | 2007-07-16 00:00:00-05 | 29
2 | 2007-07-23 00:00:00-05 | 30
3 | 2007-07-16 00:00:00-05 | 29

You're also looking for the count of distinct records and
observations per week, so let's do that:

SELECT record_week
, count(DISTINCT record_id) AS record_count
, count(DISTINCT observation_id) AS observation_count
FROM record_with_week
NATURAL JOIN observation
GROUP BY record_week;
record_week | record_count | observation_count
------------------------+--------------+-------------------
2007-07-16 00:00:00-05 | 2 | 6
2007-07-23 00:00:00-05 | 1 | 4
(2 rows)

Now here's where I started having trouble. I can't figure out how to
get 2 observations for week 29 (record_id 1 & 3) and 8 for week 30
(record_id 2). Assuming the data is wrong (which is admittedly a poor
assumption), I moved ahead.

I'm interpreting "sum(score_id) where = '1'" as the sum of scores for
observations of that week where score_description is 'SAFE' (note
that '1' is text, and it appears that your score_id column is an
integer). So, let's get the observations that were safe:

SELECT record_week, sum(score_id) as safe_score_sum
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN score
WHERE score_description = 'SAFE'
GROUP BY record_week;
record_week | safe_score_sum
------------------------+----------------
2007-07-23 00:00:00-05 | 3
2007-07-16 00:00:00-05 | 4
(2 rows)

Note I'm using score_description = 'SAFE' rather than score_id = 1,
as this is much more descriptive of what you're actually doing. It
makes the query easier to read as well.

Again, I can't get my numbers to agree with yours, so I'm probably
misinterpreting something, but I can't figure out another way to
interpret what you've described.

So, let's join this to the query:

SELECT record_week
, count(DISTINCT record_id) AS record_count
, count(DISTINCT observation_id) AS observation_count
, safe_score_sum
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN (
SELECT record_week, sum(score_id) as safe_score_sum
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN score
WHERE score_description = 'SAFE'
GROUP BY record_week
) safe_observation
GROUP BY record_week, safe_score_sum;
record_week | record_count | observation_count |
safe_score_sum
------------------------+--------------+-------------------
+----------------
2007-07-16 00:00:00-05 | 2 | 6
| 4
2007-07-23 00:00:00-05 | 1 | 4
| 3
(2 rows)

One thing that struck me as odd is that you're summing an ID column.
I'm guessing you're doing this as an attempt to count the total safe
observations, taking advantage of the coincidence that the score_id
is 1, so sum of observations with score_id = 1 is the same as the
count of observations where score_id is 1. If this is indeed what's
happening, I think I've got a better way to do it:

SELECT record_week
, count(DISTINCT record_id) AS record_count
, count(DISTINCT observation_id) AS observation_count
, count(DISTINCT safe_observation_id) as safe_observation_count
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN (
SELECT record_week
, observation_id as safe_observation_id
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN score
WHERE score_description = 'SAFE'
) safe_observation
GROUP BY record_week;
record_week | record_count | observation_count |
safe_observation_count
------------------------+--------------+-------------------
+------------------------
2007-07-16 00:00:00-05 | 2 | 6
| 4
2007-07-23 00:00:00-05 | 1 | 4
| 3
(2 rows)

And finally, formatting for output:

SELECT TO_CHAR(record_week, 'YYYY, IW') AS formatted_record_week
, count(DISTINCT record_id) AS record_count
, count(DISTINCT observation_id) AS observation_count
, count(DISTINCT safe_observation_id) as safe_observation_count
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN (
SELECT record_week
, observation_id as safe_observation_id
FROM record_with_week
NATURAL JOIN observation
NATURAL JOIN score
WHERE score_description = 'SAFE'
) safe_observation
GROUP BY record_week;
formatted_record_week | record_count | observation_count |
safe_observation_count
-----------------------+--------------+-------------------
+------------------------
2007, 29 | 2 | 6
| 4
2007, 30 | 1 | 4
| 3
(2 rows)

I've got data below so others may play along at home.

Does this help?

Michael Glaesemann
grzm seespotcode net

CREATE TABLE record
(
record_id INTEGER PRIMARY KEY
, record_date TIMESTAMP(0) WITH TIME ZONE
);

INSERT INTO record (record_id, record_date) VALUES
(1, '2007-07-23 11:30:37+10')
, (2, '2007-07-27 11:30:14+10')
, (3, '2007-07-17 13:15:03+10');

CREATE TABLE score
(
score_id INTEGER PRIMARY KEY
, score_description TEXT NOT NULL UNIQUE
);
INSERT INTO score (score_id, score_description) VALUES
(0, 'NA')
, (1, 'SAFE')
, (2, 'AT RISK');

CREATE TABLE observation
(
observation_id INTEGER PRIMARY KEY
, record_id INTEGER NOT NULL
, score_id INTEGER NOT NULL
REFERENCES score
);
INSERT INTO observation (observation_id, record_id, score_id) VALUES
(3240,1,1)
, (3239,1,1)
, (3238,1,2)
, (3237,1,1)
, (2872,2,1)
, (2869,2,2)
, (2870,2,1)
, (2871,2,1)
, (3218,3,2)
, (3217,3,1);

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message novice 2007-08-21 01:21:40 Re: Join query help
Previous Message novice 2007-08-21 00:52:46 Re: Join query help

Browse pgsql-sql by date

  From Date Subject
Next Message novice 2007-08-21 01:21:40 Re: Join query help
Previous Message novice 2007-08-21 00:52:46 Re: Join query help