Quick Links

Re: Update with last known location?

From:	James David Smith <james(dot)david(dot)smith(at)gmail(dot)com>
To:	Erik Darling <edarling80(at)gmail(dot)com>
Cc:	PGSQL-Novice <pgsql-novice(at)postgresql(dot)org>
Subject:	Re: Update with last known location?
Date:	2014-01-30 13:58:36
Message-ID:	CAMu32ACWq4hJcfANx_+91SRQ6hxs-nmcOdHgzUji-cnQVquaMg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

All,

Here's a SQL fiddle of my problem:

http://sqlfiddle.com/#!15/77157

Thanks

James

On 30 January 2014 11:19, James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> Hi Erik / all,
>
> I don't think that will work, as what happens if one of the people has
> two missing periods of time within their day?
>
> I've made a self-contained example of my problem below. Would you mind
> trying to produce in the code below what you think I should do? Or if
> anyone else fancies having a go then please do.
>
> I very much appreciate your help by the way. Thank you. I'm really at
> a loss with this. :-(
>
> James
> --------------------------------------
>
> DROP TABLE test_data;
>
> CREATE TABLE test_data(
> ppid integer,
> point_time timestamp without time zone,
> the_geom integer);
>
> INSERT INTO test_data VALUES
> ('1', '2012-01-01 07:00', '1'),
> ('1', '2012-01-01 07:01', '1'),
> ('1', '2012-01-01 07:02', '1'),
> ('1', '2012-01-01 07:03', NULL), -- null should be replaced with 1
> ('1', '2012-01-01 07:04', NULL), -- null should be replaced with 1
> ('1', '2012-01-01 07:05', '5'),
> ('1', '2012-01-01 07:06', '5'),
> ('1', '2012-01-01 07:07', '5'),
> ('1', '2012-01-01 07:08', NULL), -- null should be replaced with 5
> ('1', '2012-01-01 07:09', NULL), -- null should be replaced with 5
> ('1', '2012-01-01 07:10', NULL), -- null should be replaced with 5
> ('1', '2012-01-01 07:11', NULL), -- null should be replaced with 5
> ('2', '2013-05-02 07:12', '24'),
> ('2', '2013-05-02 07:13', '24'),
> ('2', '2013-05-02 07:14', '24'),
> ('2', '2013-05-02 07:15', NULL), -- null should be replaced with 24
> ('2', '2013-05-02 07:16', NULL), -- null should be replaced with 24
> ('2', '2013-05-02 07:17', '44'),
> ('2', '2013-05-02 07:18', '44'),
> ('2', '2013-05-02 07:19', NULL), -- null should be replaced with 44
> ('2', '2013-05-02 07:20', '4'),
> ('2', '2013-05-02 07:21', '4'),
> ('2', '2013-05-02 07:22', '4');
>
> WITH missing_geoms AS (
> SELECT ppid,
> point_time,
> the_geom
> FROM test_data
> WHERE the_geom IS NULL)
> ---
> ,filled_geoms AS (
> SELECT ppid,
> point_time,
> the_geom
> FROM test_data
> WHERE the_geom IS NOT NULL)
> ---
> ,partitioned AS (
> SELECT missing_geoms.ppid as missing_geoms_ppid,
> missing_geoms.point_time as missing_geoms_point_time,
> missing_geoms.the_geom as missing_geoms_the_geom,
> filled_geoms.ppid as filled_geoms_ppid,
> filled_geoms.point_time as filled_geoms_point_time,
> filled_geoms.the_geom as filled_geoms_the_geom
> FROM missing_geoms
> LEFT JOIN filled_geoms
> ON filled_geoms.point_time < missing_geoms.point_time
> AND filled_geoms.ppid = missing_geoms.ppid
> ORDER BY missing_geoms_ppid,
> missing_geoms_point_time)
> ---
> SELECT *
> FROM partitioned;
>
>
>
>
>
> On 29 January 2014 22:10, Erik Darling <edarling80(at)gmail(dot)com> wrote:
>> Hi James,
>>
>> Yeah, that looks like the right place to me.
>>
>>
>> On Wed, Jan 29, 2014 at 1:14 PM, James David Smith
>> <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
>>>
>>> Hi Erik,
>>>
>>> Do you mean in this section of the SQL?
>>> .....
>>> filled_geoms AS (
>>> SELECT
>>> ppid,
>>> point_time,
>>> the_geom
>>> FROM
>>> hybrid_location
>>> WHERE
>>> the_geom IS NOT NULL)
>>> ...
>>>
>>> Thanks
>>>
>>> James
>>>
>>> On 29 January 2014 17:57, Erik Darling <edarling80(at)gmail(dot)com> wrote:
>>> > Hi James,
>>> >
>>> > I think you're still stuck with sort of unnecessary ('too much' ) data
>>> > coming from the right side of your left join. If so, one option I would
>>> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the
>>> > filled_geoms table. If you partition by id and order by date descending,
>>> > you
>>> > can do an additional d_rank = 1 filter to only get the most recent
>>> > activity.
>>> > I believe this is what you want to set your NULL values to, no?
>>> >
>>> >
>>> >
>>> >
>>> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith
>>> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
>>> >>
>>> >> On 29 January 2014 16:02, Erik Darling <edarling80(at)gmail(dot)com> wrote:
>>> >> > I would re-suggest using a CTE to contain each dataset to ensure your
>>> >> > selects are distilling them correctly, and then using a final query
>>> >> > to
>>> >> > join
>>> >> > them. You can then either update your data directly through the
>>> >> > CTE(s),
>>> >> > or
>>> >> > insert the results to another table to do some further testing. I
>>> >> > think
>>> >> > you'll find this method presents the data a bit more ergonomically
>>> >> > for
>>> >> > analysis.
>>> >> >
>>> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith
>>> >> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
>>> >> >>
>>> >> >> Hi Erik/all,
>>> >> >>
>>> >> >> I just tried that, but it's tricky. The 'extra' data is indeed
>>> >> >> coming
>>> >> >> from the right side of the join, but it's hard to select only the
>>> >> >> max
>>> >> >> from it. Maybe it's possible but I've not managed to do it. Here is
>>> >> >> where I am, which is so very close.
>>> >> >>
>>> >> >> SELECT
>>> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as
>>> >> >> row_that_needs_geom_updating,
>>> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
>>> >> >> last_known_position_time
>>> >> >> FROM
>>> >> >> test a
>>> >> >> INNER JOIN
>>> >> >> (SELECT ppid,
>>> >> >> point_time,
>>> >> >> the_geom
>>> >> >> FROM test
>>> >> >> WHERE the_geom IS NOT NULL) b
>>> >> >> ON b.point_time < a.point_time
>>> >> >> AND a.ppid = b.ppid
>>> >> >> WHERE a.the_geom IS NULL;
>>> >> >>
>>> >> >> If you see attached screen-print, the output is the rows that I
>>> >> >> want.
>>> >> >> However I've had to use DISTINCT to stop the duplication. Also I've
>>> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not sure
>>> >> >> how. Anyone?
>>> >> >>
>>> >> >> But it's kind of working. :-)
>>> >> >>
>>> >> >> Worst case if I can't figure out how to solve this in one query I'll
>>> >> >> have to store the result of the above, and then use it as a basis
>>> >> >> for
>>> >> >> another query I think.
>>> >> >>
>>> >> >> Thanks
>>> >> >>
>>> >> >> James
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On 29 January 2014 12:56, Erik Darling <edarling80(at)gmail(dot)com> wrote:
>>> >> >> > I would try partitioning the second time you call row_number,
>>> >> >> > perhaps
>>> >> >> > by
>>> >> >> > ID,
>>> >> >> > and then selecting the MAX() from that, since I think the too much
>>> >> >> > data
>>> >> >> > you're referring to is coming from the right side of your join.
>>> >> >> >
>>> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith"
>>> >> >> > <james(dot)david(dot)smith(at)gmail(dot)com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> On 28 January 2014 23:15, Gavin Flower
>>> >> >> >> <GavinFlower(at)archidevsys(dot)co(dot)nz>
>>> >> >> >> wrote:
>>> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote:
>>> >> >> >> >>
>>> >> >> >> >> James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
>>> >> >> >> >>
>>> >> >> >> >>> Given the data is so large I don't want to be taking the data
>>> >> >> >> >>> out
>>> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd like to
>>> >> >> >> >>> do
>>> >> >> >> >>> this within the database using SQL. I thought I would be able
>>> >> >> >> >>> to
>>> >> >> >> >>> do this using a LOOP to be honest.
>>> >> >> >> >>
>>> >> >> >> >> I would be amazed if you couldn't do this with a single
>>> >> >> >> >> UPDATE
>>> >> >> >> >> statement. I've generally found declarative forms of such
>>> >> >> >> >> work
>>> >> >> >> >> to
>>> >> >> >> >> be at least one order of magnitude faster than going to either
>>> >> >> >> >> a
>>> >> >> >> >> PL
>>> >> >> >> >> or a script approach. I would start by putting together a
>>> >> >> >> >> SELECT
>>> >> >> >> >> query using window functions and maybe a CTE or two to list
>>> >> >> >> >> all
>>> >> >> >> >> the
>>> >> >> >> >> primary keys which need updating and the new values they
>>> >> >> >> >> should
>>> >> >> >> >> have. Once that SELECT was looking good, I would put it in
>>> >> >> >> >> the
>>> >> >> >> >> FROM clause of an UPDATE statement.
>>> >> >> >> >>
>>> >> >> >> >> That should work, but if you are updating a large percentage
>>> >> >> >> >> of
>>> >> >> >> >> the
>>> >> >> >> >> table, I would go one step further before running this against
>>> >> >> >> >> the
>>> >> >> >> >> production tables. I would put a LIMIT on the above-mentioned
>>> >> >> >> >> SELECT of something like 10000 rows, and script a loop that
>>> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the
>>> >> >> >> >> table.
>>> >> >> >> >>
>>> >> >> >> >> --
>>> >> >> >> >> Kevin Grittner
>>> >> >> >> >> EDB: http://www.enterprisedb.com
>>> >> >> >> >> The Enterprise PostgreSQL Company
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> > James, you might consider dropping as many indexes on the table
>>> >> >> >> > as
>>> >> >> >> > you
>>> >> >> >> > safely can, and rebuilding them after the mass update. If you
>>> >> >> >> > have
>>> >> >> >> > lots
>>> >> >> >> > of
>>> >> >> >> > such indexes, you will find this apprtoach to be a lot faster.
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > Cheers,
>>> >> >> >> > Gavin
>>> >> >> >>
>>> >> >> >> Hi all,
>>> >> >> >>
>>> >> >> >> Thanks for your help and assistance. I think that window
>>> >> >> >> functions,
>>> >> >> >> and inparticular the PARTITION function, is 100% the way to go.
>>> >> >> >> I've
>>> >> >> >> been concentrating on a SELECT statement for now and am close but
>>> >> >> >> not
>>> >> >> >> quite close enough. The below query gets all the data I want, but
>>> >> >> >> *too* much. What I've essentially done is:
>>> >> >> >>
>>> >> >> >> - Select all the rows that don't have any geom information
>>> >> >> >> - Join them with all rows before this point that *do* have geom
>>> >> >> >> information.
>>> >> >> >> - Before doing this join, use partition to generate row numbers.
>>> >> >> >>
>>> >> >> >> The attached screen grab shows the result of my query below.
>>> >> >> >> Unfortunately this is generating alot of joins that I don't want.
>>> >> >> >> This
>>> >> >> >> won't be practical when doing it with 75,000 people.
>>> >> >> >>
>>> >> >> >> Thoughts and code suggestions very much appreciated... if needed
>>> >> >> >> I
>>> >> >> >> could put together some SQL to create an example table?
>>> >> >> >>
>>> >> >> >> Thanks
>>> >> >> >>
>>> >> >> >> SELECT row_number() OVER (PARTITION BY test.point_time ORDER
>>> >> >> >> BY
>>> >> >> >> test.point_time) as test_row,
>>> >> >> >> test.ppid as test_ppid,
>>> >> >> >> test.point_time as test_point_time,
>>> >> >> >> test.the_geom as test_the_geom,
>>> >> >> >> a.ppid as a_ppid,
>>> >> >> >> a.point_time as a_point_time,
>>> >> >> >> a.the_geom as a_the_geom,
>>> >> >> >> a.a_row
>>> >> >> >> FROM test
>>> >> >> >> LEFT JOIN (
>>> >> >> >> SELECT the_geom,
>>> >> >> >> ppid,
>>> >> >> >> point_time,
>>> >> >> >> row_number() OVER (ORDER BY ppid, point_time) as
>>> >> >> >> a_row
>>> >> >> >> FROM test
>>> >> >> >> WHERE the_geom IS NOT NULL) a
>>> >> >> >> ON a.point_time < test.point_time
>>> >> >> >> AND a.ppid = test.ppid
>>> >> >> >> WHERE test.the_geom IS NULL
>>> >> >> >> ORDER BY test.point_time)
>>> >> >> >>
>>> >>
>>> >>
>>> >> Hi Erik / all,
>>> >>
>>> >> So I think I've managed to re-write my queries using CTEs. The below
>>> >> code now does get me the data that I want from this. But to do so it
>>> >> is going to create a frankly huge table in the bit of the SQL where it
>>> >> makes the table called 'partitioned'. My rough guess is that it'll
>>> >> have to make a table of about 100 billion rows in order to get data I
>>> >> need ( about 108 million rows).
>>> >>
>>> >> Could someone please glance through it for me and suggest how to write
>>> >> it more efficiently?
>>> >>
>>> >> Thanks
>>> >>
>>> >> James
>>> >>
>>> >> WITH missing_geoms AS (
>>> >> SELECT ppid,
>>> >> point_time,
>>> >> the_geom
>>> >> FROM hybrid_location
>>> >> WHERE the_geom IS NULL)
>>> >> -----------------
>>> >> ,filled_geoms AS (
>>> >> SELECT ppid,
>>> >> point_time,
>>> >> the_geom
>>> >> FROM hybrid_location
>>> >> WHERE the_geom IS NOT NULL)
>>> >> ----------------
>>> >> ,partitioned AS (
>>> >> SELECT missing_geoms.ppid,
>>> >> missing_geoms.point_time,
>>> >> missing_geoms.the_geom,
>>> >> filled_geoms.ppid,
>>> >> filled_geoms.point_time,
>>> >> filled_geoms.the_geom,
>>> >> row_number() OVER ( PARTITION BY missing_geoms.ppid,
>>> >> missing_geoms.point_time
>>> >> ORDER BY missing_geoms.ppid,
>>> >> missing_geoms.point_time,
>>> >> filled_geoms.ppid,
>>> >> filled_geoms.point_time DESC)
>>> >> FROM missing_geoms
>>> >> LEFT JOIN filled_geoms
>>> >> ON filled_geoms.point_time < missing_geoms.point_time
>>> >> AND filled_geoms.ppid = missing_geoms.ppid)
>>> >> --------------
>>> >> SELECT *
>>> >> FROM partitioned
>>> >> WHERE row_number = 1;
>>> >>
>>> >> James
>>> >
>>> >
>>
>>

In response to

Re: Update with last known location? at 2014-01-30 11:19:30 from James David Smith

Responses

Re: Update with last known location? at 2014-01-30 15:24:24 from Erik Darling

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Erik Darling	2014-01-30 15:24:24	Re: Update with last known location?
Previous Message	James David Smith	2014-01-30 11:19:30	Re: Update with last known location?