Difference in statistics from Google Analytics Rep

2020-07-23 08:29发布

问题:

I have a Google Analytics premium account set up to monitor the user activity of a website and mobile application.

Raw data from GA is being stored in BigQuery tables.

However, I noticed that the statistics that I see in a GA report are quite different the statistics that I see when querying the BigQuery tables.

I understand that GA reports show aggregated data and possibly, sampled data. And that the raw data in Bigquery tables is session/hit-level data.

But I am still not sure if I understand the reason why the statistics could be different.

Would really appreciate it if someone clarified this for me.

Thanks in advance.

UPDATE 1:

I exported the raw data from Bigquery into my Hadoop cluster. The data is stored in a hive table. I flattened all the nested and repeated fields before exporting.

Here is the hive query that I ran on the raw data in the Hive table:

SELECT
   date as VisitDate,
   count(distinct fullvisitorid) as CountVisitors,
   SUM(totals_visits) as SumVisits,
   SUM(totals_pageviews) AS PVs
FROM
   bigquerydata 
WHERE 
   fullvisitorid IS NOT NULL
GROUP BY 
   date
ORDER BY
   VisitDate DESC

A) Taking February 9th as the VisitDate, I get the following results from this query:

i) CountVisitors= 1,074,323
ii) SumVisits= 48,990,198
iii) PVs= 1,122,841,424 

Vs

B) Taking the same VisitDate and obtaining the same statistics from the GA report:

i) Users count = 1,549,757
ii) Number of pageviews = 11,604,449 (Huge difference when compared to A(iii))  

In the hive query above, am I using any wrong fields or processing the fields in a wrong way? Just trying to figure out why I have this difference in numbers.

UPDATE 2 (following @Felipe Hoffa 's suggestion):

This is how I am flattening the tables in my Python code before exporting the result to GCS and then to Hadoop cluster:

queryString = 'SELECT * FROM flatten(flatten(flatten(flatten(flatten(flatten([' + TABLE_NAME + '],hits),hits.product),hits.promotion),hits.customVariables), hits.customDimensions), hits.customMetrics)'

I understand what you are saying about flattening causing repeated pageviews and each repetition getting into the final wrong addition.

I tried the same query (from Update1) on Bigquery table instead of my Hive table. The numbers matched with those on the Google Analytics Dashboard.

However, assuming that the Hive table is all I have and it has those repeated fields due to flattening.. BUT Is there still anyway that I can fix my hive query to match the stats from Google Analytics dashboard?

Logically speaking, if the repeated fields came up due to flattening.. can't I reverse the same thing in my Hive table? If you think that I can reverse, do you have any suggestion as to how I can proceed on it?

Thank you so much in advance!

回答1:

Can you run the same query in BigQuery, instead of on the data exported to Hive?

My guess: "The data is stored in a hive table. I flattened all the nested and repeated fields before exporting." When flattening - are you repeating pageviews several times, with each repetition getting into the final wrong addition?

Note how data can get duplicated when flattening rows:

SELECT col, x FROM (
  SELECT "wrong" col, SUM(totals.pageviews) x
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
  SELECT "correct" col, SUM(totals.pageviews) x
  FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)


col     x    
wrong   2262     
correct 249 

Update given "update 2" to the question:

Since BigQuery is working correctly, and this is a Hive problem, you should add that tag to get relevant answers.

Nevertheless, this is how I would correctly de-duplicate previously duplicated rows with BigQuery:

SELECT SUM(pv)
FROM (
  SELECT visitId, MAX(totals.pageviews) pv
  FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
  GROUP EACH BY 1
)