MySQL greatest-n-per-group trouble

2019-02-27 01:37发布

问题:

Hey everyone. I believe this is a 'greatest-n-per-group' question but even after looking at several questions on StackOverflow, I'm unsure how to apply this to my situation...

I'm using a MySQL database and have a basic blog-type system set up about Computer Applications... The tables look like this:

POSTS
post_id
post_created
post_type      -- could be article, review, feature, whatever
post_status    -- 'a' approved or 'd' for draft

APPS
app_id 
app_name
app_platform   -- Windows, linux, unix, etc..

APP_TO_POST    -- links my posts to its relevant application
atp_id
atp_app_id
atp_post_id

I'm using the following basic query to pull all articles for the application with the name 'Photoshop' where the post type is an 'Article' and the status of the article is 'a' for approved:

SELECT apps.app_name, apps.app_platform, posts.post_created, posts.post_id
FROM apps
JOIN app_to_post ON app_to_post.atp_app_id = apps.app_id
JOIN posts ON app_to_post.atp_post_id = posts.post_id
WHERE apps.app_name = 'Photoshop'
AND 
posts.post_type = 'Article'
AND
posts.post_status = 'a'

Which gets me these expected results:

app_name    app_platform   post_created      post_id
Photoshop   Windows        Oct. 20th, 2009   1
Photoshop   Windows        Dec. 1, 2009      3
Photoshop   Macintosh      Nov. 10th, 2009   2

Would anyone be able to lend a hand on how I could alter that query to only pull the most recent article per application platform? So for example, I'd like my results to look like this:

app_name    app_platform   post_created      post_id
Photoshop   Windows        Dec. 1, 2009      3
Photoshop   Macintosh      Nov. 10th, 2009   2

And omit one of the 'Photoshop Windows' articles because it isn't the most recent one.

If I simply tack on a MAX(post_created) and a GROUP BY app_platform my results don't always group correctly. From how I understand it I need to perform some kind of inner join of a sub query?

回答1:

Since you have plenty of JOINs, I suggest creating a VIEW first:

CREATE VIEW articles AS
    SELECT    a.app_name, a.app_platform, p.post_created, p.post_id
    FROM      apps a
    JOIN      app_to_post ap ON ap.atp_app_id = a.app_id
    JOIN      posts p ON ap.atp_post_id = p.post_id
    WHERE     p.post_type = 'Article' AND p.post_status = 'a';

Then you can use a NULL-self-join:

SELECT     a1.app_name, a1.app_platform, a1.post_created, a1.post_id
FROM       articles a1
LEFT JOIN  articles a2 ON 
           a2.app_platform = a1.app_platform AND a2.post_created > a1.post_created
WHERE      a2.post_id IS NULL;

Test case:

CREATE TABLE posts (
   post_id          int,
   post_created     datetime,
   post_type        varchar(30),
   post_status      char(1)
);

CREATE TABLE apps (
   app_id           int,
   app_name         varchar(40),
   app_platform     varchar(40)
);

CREATE TABLE app_to_post (
   atp_id           int,
   atp_app_id       int,
   atp_post_id      int
);

INSERT INTO posts VALUES (1, '2010-10-06 05:00:00', 'Article', 'a');
INSERT INTO posts VALUES (2, '2010-10-06 06:00:00', 'Article', 'a');
INSERT INTO posts VALUES (3, '2010-10-06 07:00:00', 'Article', 'a');
INSERT INTO posts VALUES (4, '2010-10-06 08:00:00', 'Article', 'a');
INSERT INTO posts VALUES (5, '2010-10-06 09:00:00', 'Article', 'a');

INSERT INTO apps VALUES (1, 'Photoshop', 'Windows');
INSERT INTO apps VALUES (2, 'Photoshop', 'Macintosh');

INSERT INTO app_to_post VALUES (1, 1, 1);
INSERT INTO app_to_post VALUES (1, 1, 2);
INSERT INTO app_to_post VALUES (1, 2, 3);
INSERT INTO app_to_post VALUES (1, 2, 4);
INSERT INTO app_to_post VALUES (1, 1, 5);

Result:

+-----------+--------------+---------------------+---------+
| app_name  | app_platform | post_created        | post_id |
+-----------+--------------+---------------------+---------+
| Photoshop | Macintosh    | 2010-10-06 08:00:00 |       4 |
| Photoshop | Windows      | 2010-10-06 09:00:00 |       5 |
+-----------+--------------+---------------------+---------+
2 rows in set (0.00 sec)

As a side-note, in general you don't need a surrogate key for your junction table. You might as well set up a composite primary key (and ideally foreign keys to the referenced tables):

CREATE TABLE app_to_post (
   atp_app_id       int,
   atp_post_id      int,
   PRIMARY KEY (atp_app_id, atp_post_id),
   FOREIGN KEY (atp_app_id) REFERENCES apps (app_id),
   FOREIGN KEY (atp_post_id) REFERENCES posts (post_id)
) ENGINE=INNODB;


回答2:

Let's first consider how to obtain rows with maximum value from your query result and your desirable result:

Your result: (Let's call it table T)

app_name    app_platform   post_created      post_id
Photoshop   Windows        Oct. 20th, 2009   1
Photoshop   Windows        Dec. 1, 2009      3
Photoshop   Macintosh      Nov. 10th, 2009   2

The result you want:

app_name    app_platform   post_created      post_id
Photoshop   Windows        Dec. 1, 2009      3
Photoshop   Macintosh      Nov. 10th, 2009   2

In order to get the result, you should:

  1. Calculate the maximum post_id for each platform for table T.
  2. Join maximum result with the origin table T to obtain values in other columns of the row.

The query is below:

SELECT
  t1.app_name,t1.app_platform,t1.post_created,t1.post_id
FROM
  (SELECT app_platform, MAX(post_created) As MaxPostCreated
   FROM T
   GROUP BY app_platform) AS t2 JOIN 
  T AS t1
WHERE
  t1.app_platform = t2.app_platform1
   AND t2.MaxPostCreated = t1.post_created

In this query, the subquery performed the first step, and join performs the second step.

The final result combining with your partial answer is show below (with a view):

CREATE VIEW T 
    SELECT    a.app_name, a.app_platform, p.post_created, p.post_id
    FROM      apps a
    JOIN      app_to_post ap ON ap.atp_app_id = a.app_id
    JOIN      posts p ON ap.atp_post_id = p.post_id
    WHERE     p.post_type = 'Article' AND p.post_status = 'a';

SELECT
  t1.app_name,t1.app_platform,t1.post_created,t1.post_id
FROM
  (SELECT app_platform, MAX(post_created) As MaxPostCreated
   FROM T
   GROUP BY app_platform) AS t2 JOIN 
  T AS t1
WHERE
  t1.app_platform = t2.app_platform1
   AND t2.MaxPostCreated= t1.post_created

By the way, our team is actually now developing a tool trying to automatically help users to write queries, and users can provide input-output examples to the tool, and the tool will produce a query. (The first part of the query is actually generated by the tool! The link to our prototype is https://github.com/Mestway/Scythe)

Hope this can help you. :)



回答3:

You're on the right track.

Try adding

group by app_name,app_platform
having post_created=max(post_created)

Or if your post_id is sequential where a higher value will always reflect a later post, use this having clause: having post_id=max(post_id)