Re: Update with last known location?

From: Erik Darling <edarling80(at)gmail(dot)com>
To: James David Smith <james(dot)david(dot)smith(at)gmail(dot)com>
Cc: PGSQL-Novice <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Update with last known location?
Date: 2014-01-29 22:10:09
Message-ID: CAO+EYw+Mw0CvmBLZ1D8ypK=30Ns0Yp6=LEH4tmeUtA2S7x+6KQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Hi James,

Yeah, that looks like the right place to me.

On Wed, Jan 29, 2014 at 1:14 PM, James David Smith <
james(dot)david(dot)smith(at)gmail(dot)com> wrote:

> Hi Erik,
>
> Do you mean in this section of the SQL?
> .....
> filled_geoms AS (
> SELECT
> ppid,
> point_time,
> the_geom
> FROM
> hybrid_location
> WHERE
> the_geom IS NOT NULL)
> ...
>
> Thanks
>
> James
>
> On 29 January 2014 17:57, Erik Darling <edarling80(at)gmail(dot)com> wrote:
> > Hi James,
> >
> > I think you're still stuck with sort of unnecessary ('too much' ) data
> > coming from the right side of your left join. If so, one option I would
> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in the
> > filled_geoms table. If you partition by id and order by date descending,
> you
> > can do an additional d_rank = 1 filter to only get the most recent
> activity.
> > I believe this is what you want to set your NULL values to, no?
> >
> >
> >
> >
> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith
> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >>
> >> On 29 January 2014 16:02, Erik Darling <edarling80(at)gmail(dot)com> wrote:
> >> > I would re-suggest using a CTE to contain each dataset to ensure your
> >> > selects are distilling them correctly, and then using a final query to
> >> > join
> >> > them. You can then either update your data directly through the
> CTE(s),
> >> > or
> >> > insert the results to another table to do some further testing. I
> think
> >> > you'll find this method presents the data a bit more ergonomically for
> >> > analysis.
> >> >
> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html
> >> >
> >> >
> >> >
> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith
> >> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >>
> >> >> Hi Erik/all,
> >> >>
> >> >> I just tried that, but it's tricky. The 'extra' data is indeed coming
> >> >> from the right side of the join, but it's hard to select only the max
> >> >> from it. Maybe it's possible but I've not managed to do it. Here is
> >> >> where I am, which is so very close.
> >> >>
> >> >> SELECT
> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as
> >> >> row_that_needs_geom_updating,
> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
> >> >> last_known_position_time
> >> >> FROM
> >> >> test a
> >> >> INNER JOIN
> >> >> (SELECT ppid,
> >> >> point_time,
> >> >> the_geom
> >> >> FROM test
> >> >> WHERE the_geom IS NOT NULL) b
> >> >> ON b.point_time < a.point_time
> >> >> AND a.ppid = b.ppid
> >> >> WHERE a.the_geom IS NULL;
> >> >>
> >> >> If you see attached screen-print, the output is the rows that I want.
> >> >> However I've had to use DISTINCT to stop the duplication. Also I've
> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not sure
> >> >> how. Anyone?
> >> >>
> >> >> But it's kind of working. :-)
> >> >>
> >> >> Worst case if I can't figure out how to solve this in one query I'll
> >> >> have to store the result of the above, and then use it as a basis for
> >> >> another query I think.
> >> >>
> >> >> Thanks
> >> >>
> >> >> James
> >> >>
> >> >>
> >> >>
> >> >> On 29 January 2014 12:56, Erik Darling <edarling80(at)gmail(dot)com> wrote:
> >> >> > I would try partitioning the second time you call row_number,
> perhaps
> >> >> > by
> >> >> > ID,
> >> >> > and then selecting the MAX() from that, since I think the too much
> >> >> > data
> >> >> > you're referring to is coming from the right side of your join.
> >> >> >
> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith"
> >> >> > <james(dot)david(dot)smith(at)gmail(dot)com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On 28 January 2014 23:15, Gavin Flower
> >> >> >> <GavinFlower(at)archidevsys(dot)co(dot)nz>
> >> >> >> wrote:
> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote:
> >> >> >> >>
> >> >> >> >> James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >> >> >>
> >> >> >> >>> Given the data is so large I don't want to be taking the data
> >> >> >> >>> out
> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd like to
> do
> >> >> >> >>> this within the database using SQL. I thought I would be able
> to
> >> >> >> >>> do this using a LOOP to be honest.
> >> >> >> >>
> >> >> >> >> I would be amazed if you couldn't do this with a single UPDATE
> >> >> >> >> statement. I've generally found declarative forms of such work
> >> >> >> >> to
> >> >> >> >> be at least one order of magnitude faster than going to either
> a
> >> >> >> >> PL
> >> >> >> >> or a script approach. I would start by putting together a
> SELECT
> >> >> >> >> query using window functions and maybe a CTE or two to list all
> >> >> >> >> the
> >> >> >> >> primary keys which need updating and the new values they should
> >> >> >> >> have. Once that SELECT was looking good, I would put it in the
> >> >> >> >> FROM clause of an UPDATE statement.
> >> >> >> >>
> >> >> >> >> That should work, but if you are updating a large percentage of
> >> >> >> >> the
> >> >> >> >> table, I would go one step further before running this against
> >> >> >> >> the
> >> >> >> >> production tables. I would put a LIMIT on the above-mentioned
> >> >> >> >> SELECT of something like 10000 rows, and script a loop that
> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the
> table.
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Kevin Grittner
> >> >> >> >> EDB: http://www.enterprisedb.com
> >> >> >> >> The Enterprise PostgreSQL Company
> >> >> >> >>
> >> >> >> >>
> >> >> >> > James, you might consider dropping as many indexes on the table
> as
> >> >> >> > you
> >> >> >> > safely can, and rebuilding them after the mass update. If you
> >> >> >> > have
> >> >> >> > lots
> >> >> >> > of
> >> >> >> > such indexes, you will find this apprtoach to be a lot faster.
> >> >> >> >
> >> >> >> >
> >> >> >> > Cheers,
> >> >> >> > Gavin
> >> >> >>
> >> >> >> Hi all,
> >> >> >>
> >> >> >> Thanks for your help and assistance. I think that window
> functions,
> >> >> >> and inparticular the PARTITION function, is 100% the way to go.
> >> >> >> I've
> >> >> >> been concentrating on a SELECT statement for now and am close but
> >> >> >> not
> >> >> >> quite close enough. The below query gets all the data I want, but
> >> >> >> *too* much. What I've essentially done is:
> >> >> >>
> >> >> >> - Select all the rows that don't have any geom information
> >> >> >> - Join them with all rows before this point that *do* have geom
> >> >> >> information.
> >> >> >> - Before doing this join, use partition to generate row numbers.
> >> >> >>
> >> >> >> The attached screen grab shows the result of my query below.
> >> >> >> Unfortunately this is generating alot of joins that I don't want.
> >> >> >> This
> >> >> >> won't be practical when doing it with 75,000 people.
> >> >> >>
> >> >> >> Thoughts and code suggestions very much appreciated... if needed I
> >> >> >> could put together some SQL to create an example table?
> >> >> >>
> >> >> >> Thanks
> >> >> >>
> >> >> >> SELECT row_number() OVER (PARTITION BY test.point_time ORDER
> BY
> >> >> >> test.point_time) as test_row,
> >> >> >> test.ppid as test_ppid,
> >> >> >> test.point_time as test_point_time,
> >> >> >> test.the_geom as test_the_geom,
> >> >> >> a.ppid as a_ppid,
> >> >> >> a.point_time as a_point_time,
> >> >> >> a.the_geom as a_the_geom,
> >> >> >> a.a_row
> >> >> >> FROM test
> >> >> >> LEFT JOIN (
> >> >> >> SELECT the_geom,
> >> >> >> ppid,
> >> >> >> point_time,
> >> >> >> row_number() OVER (ORDER BY ppid, point_time) as a_row
> >> >> >> FROM test
> >> >> >> WHERE the_geom IS NOT NULL) a
> >> >> >> ON a.point_time < test.point_time
> >> >> >> AND a.ppid = test.ppid
> >> >> >> WHERE test.the_geom IS NULL
> >> >> >> ORDER BY test.point_time)
> >> >> >>
> >>
> >>
> >> Hi Erik / all,
> >>
> >> So I think I've managed to re-write my queries using CTEs. The below
> >> code now does get me the data that I want from this. But to do so it
> >> is going to create a frankly huge table in the bit of the SQL where it
> >> makes the table called 'partitioned'. My rough guess is that it'll
> >> have to make a table of about 100 billion rows in order to get data I
> >> need ( about 108 million rows).
> >>
> >> Could someone please glance through it for me and suggest how to write
> >> it more efficiently?
> >>
> >> Thanks
> >>
> >> James
> >>
> >> WITH missing_geoms AS (
> >> SELECT ppid,
> >> point_time,
> >> the_geom
> >> FROM hybrid_location
> >> WHERE the_geom IS NULL)
> >> -----------------
> >> ,filled_geoms AS (
> >> SELECT ppid,
> >> point_time,
> >> the_geom
> >> FROM hybrid_location
> >> WHERE the_geom IS NOT NULL)
> >> ----------------
> >> ,partitioned AS (
> >> SELECT missing_geoms.ppid,
> >> missing_geoms.point_time,
> >> missing_geoms.the_geom,
> >> filled_geoms.ppid,
> >> filled_geoms.point_time,
> >> filled_geoms.the_geom,
> >> row_number() OVER ( PARTITION BY missing_geoms.ppid,
> >> missing_geoms.point_time
> >> ORDER BY missing_geoms.ppid,
> >> missing_geoms.point_time,
> >> filled_geoms.ppid,
> >> filled_geoms.point_time DESC)
> >> FROM missing_geoms
> >> LEFT JOIN filled_geoms
> >> ON filled_geoms.point_time < missing_geoms.point_time
> >> AND filled_geoms.ppid = missing_geoms.ppid)
> >> --------------
> >> SELECT *
> >> FROM partitioned
> >> WHERE row_number = 1;
> >>
> >> James
> >
> >
>

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Carel Combrink 2014-01-30 06:50:02 BYTEA: PostgreSQL 9.1 vs 9.3
Previous Message James David Smith 2014-01-29 18:14:57 Re: Update with last known location?