Database - I am working on in Postgres 9.6.5
I am analyzing the data from US Airport Authority (RITA) about the flights arrival and departures. This link (http://stat-computing.org/dataexpo/2009/the-data.html) lists all the columns in the table.
The table has following 29 columns
No Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
There are about a million rows for each year.
I am trying to find out a count the most busy airports when delay is more than 15minutes. column DepDelay - has the delay time. origin - is the origin code for the airport.
All the data has been loaded into a table called 'ontime'
I am forming the query as follows in stages.
select airports where delay is more than 15 minutes
select origin,year,count(*) as depdelay_count from ontime where depdelay > 15
group by year,origin order by depdelay_count desc )Now I wish to pull out only the top 10 airports per year - which I am doing as follows
select x.origin,x.year from (with subquery as ( select origin,year,count(*) as depdelay_count from ontime where depdelay > 15 group by year,origin order by depdelay_count desc ) select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 10;
Now that I have the top 10 airports by depdelay - I wish to get a count of the total flights out of these airports.
select origin,count() from ontime where origin in (select x.origin from (with subquery as ( select origin,year,count() as depdelay_count from ontime where depdelay > 15 group by year,origin order by depdelay_count desc ) select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2) group by origin order by origin;
If I modify the Step 3 query by adding the year in the year clause
---- will be any value from (1987 to 2008)
select origin,count(*) from ontime where year = (<YEAR>) origin in
(select x.origin from (with subquery as (
select origin,year,count(*) as depdelay_count from ontime
where
depdelay > 15
group by year,origin
order by depdelay_count desc
)
select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2)
group by origin
order by origin;
But I have to do this manually for all years from 1987 to 2008 which I want to avoid.
Please can you help refine the query so that I can select the data for all the years without having to select each year manually.
I find CTEs int he middle of queries to e confusing. You can basically do this with one CTE/subquery:
Note the use of conditional aggregation and using window functions with aggregation functions.