I'm trying to gather information from Google Analytics to build a recommendation engine for my site. The site consists of many pages, so I'm tracking the number of times a user clicks, for example, from page A to page B. Currently I can measure the A -> B
transitions on Google Analytics with previousPagePath = '/A'
and nextPagePath = '/B'
, but the question I really want to answer is, "Of all the visits to the site that included viewing page A, how many times were pages B, C, ... viewed in the same visit?"
For example, if the flow was A -> homepage -> B
, then that would not be captured by my current methodology, but would be captured by the broader measure. It looks like the "Visitors Flow" report on the Google Analytics web interface has the data I'm looking for, but I can't figure out how to access it programmatically via the API.
What is the best way to get this data?
This is a really great idea. I'm a little late to this, but you should be able to accomplish this by downloading all of the data using the Google Analytics Reporting API, store it in a local database/file/whatever, and then build your recommendation engine by aggregating the statistics by hand and storing them locally.
To get the data from the Reporting API, try playing with the query explorer and extracting the number of visits to pages between all pairs of paths using a method similar to @carlsoja:
dimensions=ga:previousPagePath,ga:pagePath&metrics=ga:visits
In order to get all of the data, you will have to use one of the Core Reporting Client Libraries to paginate through the results (which you can experiment with in the query explorer).
Once you have all of the data, you can pretty easily calculate the Markov Chain transition probabilities that a person visits page /A
after they have visited page /B
, or p(/A | /B)
. Then it would be pretty straightforward to estimate the probability that someone visits page /A
if they visited page /B
at some point in the past. If you wanted to get really fancy, you could use their complete history {H}
to make recommendations for pages by estimating p(/A | {H})
, but I'll leave that as an exercise for the reader ;)
Hope this helps!
Is there any reason why you couldn't simply segment against people that viewed page A and use pagePath / pageTitle as a dimension and return the number of number of visits as the metric?
dimensions=ga:pagePath&metrics=ga:visits&segment=dynamic::ga:pagePath=~A
In theory this should list out all of the pagePaths that were viewed in the same visit as pagePath=~A and the number of visits where both were viewed, which is what you're looking for, yes?