Scraping Real Time Visitors from Google Analytics

2019-01-13 16:21发布

问题:

I have a lot of sites and want to build a dashboard showing the number of real time visitors on each of them on a single page. (would anyone else want this?) Right now the only way to view this information is to open a new tab for each site.

Google doesn't have a real-time API, so I'm wondering if it is possible to scrape this data. Eduardo Cereto found out that Google transfers the real-time data over the realtime/bind network request. Anyone more savvy have an idea of how I should start? Here's what I'm thinking:

  1. Figure out how to authenticate programmatically
  2. Inspect all of the realtime/bind requests to see how they change. Does each request have a unique key? Where does that come from? Below is my breakdown of the request:

    https://www.google.com/analytics/realtime/bind?VER=8

    &key=[What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]

    &ds=[What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]

    &pageId=rt-standard%2Frt-overview

    &q=t%3A0%7C%3A1%3A0%3A%2Ct%3A11%7C%3A1%3A5%3A%2Cot%3A0%3A0%3A4%2Cot%3A0%3A0%3A3%2Ct%3A7%7C%3A1%3A10%3A6%3D%3DREFERRAL%3B%2Ct%3A10%7C%3A1%3A10%3A%2Ct%3A18%7C%3A1%3A10%3A%2Ct%3A4%7C5%7C2%7C%3A1%3A10%3A2!%3Dzz%3B%2C&f

    The q variable URI decodes to this (what the?): t:0|:1:0:,t:11|:1:5:,ot:0:0:4,ot:0:0:3,t:7|:1:10:6==REFERRAL;,t:10|:1:10:,t:18|:1:10:,t:4|5|2|:1:10:2!=zz;,&f

    &RID=rpc

    &SID=[What is this? Where does it come from? 16 character uppercase alphanumeric, stays the same each request]

    &CI=0

    &AID=[What is this? Where does it come from? integer, starts at 1, increments weirdly to 150 and then 298]

    &TYPE=xmlhttp

    &zx=[What is this? Where does it come from? 12 character lowercase alphanumeric, changes each request]

    &t=1

  3. Inspect all of the realtime/bind responses to see how they change. How does the data come in? It looks like some altered JSON. How many times do I need to connect to get the data? Where is the active visitors on site number in there? Here is a dump of sample data:

    19 [[151,["noop"] ] ] 388 [[152,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[49,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2,0],"name":"Total"}]}}]]] ] 388 [[153,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[52,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[2,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2],"name":"Total"}]}}]]] ] 388 [[154,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[53,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,3,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3],"name":"Total"}]}}]]] ]

Let me know if you can help with any of the items above!

回答1:

To get the same, Google has launched new Real Time API. With this API you can easily retrieve real time online visitors as well as several Google Analytics with following dimensions and metrics. https://developers.google.com/analytics/devguides/reporting/realtime/dimsmets/

This is quite similar to Google Analytics API. To start development on this, https://developers.google.com/analytics/devguides/reporting/realtime/v3/devguide



回答2:

With Google Chrome I can see the data on the Network Panel.

The request endpoint is https://www.google.com/analytics/realtime/bind

Seems like the connection stays open for 2.5 minutes, and during this time it just keeps getting more and more data.

After about 2.5 minutes the connection is closed and a new one is open.

On the Network panel you can only see the data for the connections that are terminated. So leave it open for 5 minutes or so and you can start to see the data.

I hope that can give you a place to start.



回答3:

Having google in the loop seems pretty redundant. Suggest you use a common element delivered on demand from the dashboard server and include this item by absolute URL on all pages to be monitored for a given site. The script outputting the item can read the IP of the browser asking and these can all be logged into a database and filtered for uniqueness giving a real time head count.

<?php
$user_ip = $_SERVER["REMOTE_ADDR"];
/// Some MySQL to insert $user_ip to the database table for website XXX  goes here


$file = 'tracking_image.gif';
$type = 'image/gif';
header('Content-Type:'.$type);
header('Content-Length: ' . filesize($file));
readfile($file);
?>

Ammendum: A database can also add a timestamp to every row of data it stores. This can be used to further filter results and provide the number of visitors in the last hour or minute.

Client side Javascript with AJAX for fine tuning or overkill The onblur and onfocus javascript commands can be used to tell if the the page is visible, pass the data back to the dashboard server via Ajax. http://www.thefutureoftheweb.com/demo/2007-05-16-detect-browser-window-focus/

When a visitor closes a page this can also be detected by the javascript onunload function in the body tag and Ajax can be used to send data back to the server one last time before the browser finally closes the page.

As you may also wish to collect some information about the visitor like Google analytics does this page https://panopticlick.eff.org/ has a lot of javascript that can be examined and adapted.



回答4:

I needed/wanted realtime data for personal use so I reverse-engineered their system a little bit.

Instead of binding to /bind I get data from /getData (no pun intended).

At /getData the minimum request is apparently: https://www.google.com/analytics/realtime/realtime/getData?pageId&key={{propertyID}}&q=t:0|:1

Here's a short explanation of the possible query parameters and syntax, please remember that these are all guesses and I don't know all of them:

Query Syntax: pageId&key=propertyID&q=dataType:dimensions|:page|:limit:filters

Values:

pageID: Required but seems to only be used for internal analytics.

propertyID: a{{accountID}}w{{webPropertyID}}p{{profileID}}, as specified at the Documentation link below. You can also find this in the URL of all analytics pages in the UI.


dataType:
    t: Current data
    ot: Overtime/Past
    c: Unknown, returns only a "count" value


dimensions (| separated or alone), most values are only applicable for t:
    1:  Country
    2:  City
    3:  Location code?
    4:  Latitude
    5:  Longitude
    6:  Traffic source type (Social, Referral, etc.)
    7:  Source
    8:  ?? Returns (not set)
    9:  Another location code? longer.
    10: Page URL
    11: Visitor Type (new/returning)
    12: ?? Returns (not set)
    13: ?? Returns (not set)
    14: Medium
    15: ?? Returns "1"

page:
    At first this seems to work for pagination but after further analysis it looks like it's also used to specify which of the 6 pages (Overview, Locations, Traffic Sources, Content, Events and Conversions) to return data for.

    For some reason 0 returns an impossibly high metrictotal

limit: Result limit per page, maximum of 50

filters:
    Syntax is as specified at the Documentation 2 link below except the OR is specified using | instead of a comma.6==CUSTOM;1==United%20States


You can also combine multiple queries in one request by comma separating them (i.e. q=t:1|2|:1|:10,t:6|:1|:10).

Following the above "documentation", if you wanted to build a query that requests the page URL and city of the top 10 active visitors with a traffic source type of CUSTOM located in the US you would use this URL: https://www.google.com/analytics/realtime/realtime/getData?key={{propertyID}}&pageId&q=t:10|2|:1|:10:6==CUSTOM;1==United%20States


Documentation

Documentation 2


I hope that my answer is readable and (although it's a little late) sufficiently answers your question and helps others in the future.