Parsing NYC Transit/MTA historical GTFS data (not

2019-06-07 21:29发布

问题:

I've been puzzling on this on and off for months and can't find a solution.

The MTA claims to provide historical data in form of daily dumps in GTFS format here: [http://web.mta.info/developers/MTA-Subway-Time-historical-data.html][1]

See for yourself by downloading the example they provide, in this case Sep, 17th , 2014: [https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31][1]

My problem? The file is gobbledygook. It does not follow GTFS specifications, has no extension, and when I open it using a text editor it looks like 7800 lines of this:

n ^C1.0^X �枪�^Eʞ>` ^C1.0^R^K ^A1^R^F^P����^E^R^K ^A2^R^F^P����^E^R^K ^A3^R^F^P����^E^R^K ^A4^R^F^P����^E^R^K ^A5^R^F^P����^E^R^K ^A6^R^F^P����^E^R^K ^AS^R^F^P����^E^R[ ^F000001^ZQ 6 ^N050400_1..S02R^Z^H20140917*^A1�>^V ^P01 0824 242/SFY^P^A^X^C^R^W^R^F^Pɚ��^E"^D140Sʚ>^F ^AA^R^AA^RR ^F000002"H 6

Per the MTA site (appears untrue)

All data is formatted in GTFS-realtime

Any idea on the steps necessary to transform this mystery file into usable GTFS data? Is there some encoding I am missing? I have looked for 10+ and been unable to come up with a solution.

Also, not to be a stickler but I am NOT referring to the MTA's realtime data feed, which is correctly formatted and usable. I am specifically referring to the historical data dumps I reference above (have received many "solutions" referring only to realtime data feed)

回答1:

The file you link to is in GTFS-realtime format, not GTFS, and the page you linked to does a very bad job of explaining which format their data is actually in (though it is mentioned in your quote).

GTFS is used to store schedule data, like routes and scheduled arrival times.

GTFS-realtime is generally used to transfer actual transit performance data in real-time, like vehicle locations and expected or actual arrival times. It is a protobuf, a specification for compiled binary data publicized by Google, which means you can't usefully read it in a text editor, but you instead have to load it programmatically using the Google protobuf tools. It can be used as a historical data format in the way MTA is here, by making daily dumps of the GTFS-rt feed publicly available. It's called GTFS-realtime because various data fields in the realtime like route_id, trip_id, and stop_id are designed to link to the published GTFS schedules.

I confirmed the validity of the data you linked to by decompiling it using the gtfs-realtime.proto specification and the Google protobuf tools for Python. It begins:

header {
  gtfs_realtime_version: "1.0"
  timestamp: 1410960621
}
entity {
  id: "000001"
  trip_update {
    trip {
      trip_id: "050400_1..S02R"
      start_date: "20140917"
      route_id: "1"
    }
    stop_time_update {
      arrival {
        time: 1410960713
      }
      stop_id: "140S"
    }
  }
}
...

and continues in that vein for a total of 55833 lines (in the default string output format).

EDIT: the Python script used to convert the protobuf into string representation is very simple:

import gtfs_realtime_pb2 as gtfs_rt

f = open('gtfs-rt.pb', 'rb')
raw_str = f.read()

msg = gtfs_rt.FeedMessage()
msg.ParseFromString(raw_str)

print msg

This requires gtfs-realtime.proto to have been compiled into gtfs_realtime_pb2.py using protoc (following the instructions in the Python protobuf documentation under "Compiling Your Protocol Buffers") and placed in the same directory as the Python script. Furthermore, the binary protobuf downloaded from the MTA needs to be named gtfs-rt.pb and located in the same directory as the Python script.