The gap between GA4 and BigQuery

Are you getting different results when analyzing your Google Analytics 4 data in BigQuery compared to the standard reporting interface? Understanding the differences between the two methods of data analysis is essential for accurate and reliable insights. In the article by Minhaz Kazi you can explore the most significant discrepancies between BigQuery and the GA4 user interface and provide tips to ensure accurate metric calculation. There is my brief explanation of the key differences and the ways to avoid it:

Sampling

Pre-processed database tables in GA4 are used in standard reports and the Data API, reducing sampling likelihood. Explorations use raw data but are sampled if exceeding the quota of 10M to 1B events. To match BigQuery export numbers, check for unsampled reports. GA4 offers more information on addressing sampling in exploration.

Different Users Definition

The Total Users metric in GA4 counts users who have logged at least one event, but Active Users is the primary reporting metric. When calculating user count from BigQuery, filter for active users based on the criteria for each stream type. Query implementations may vary, but BigQuery will soon add an is_active_user field to simplify filtering.

HyperLogLog++

GA4 uses HLL++ algorithm to estimate cardinality for Active Users and Sessions. Unique counts in the UI and API are approximations with precision levels that vary by metric and confidence intervals. BigQuery allows for exact cardinality calculations with potential small variations in metrics. See HLL++ Sketches for more information.

Time Delay

BigQuery creates daily export tables after collecting all events for the day. These tables can get updated up to 72 hours beyond the table's date with time-stamped events. This issue mainly affects Firebase SDK or Measurement Protocol implementations with delayed events. Comparisons should be made on data older than 72 hours due to potential discrepancies between standard reporting surface and BigQuery.

Google Signals

By activating Google Signals, users can be deduplicated across platforms and devices in GA4, avoiding the problem of counting multiple users for one person viewing your website on multiple browsers. However, BigQuery export will still show multiple user_pseudo_ids. Implementing User-IDs and Google Signals will help with deduplication, but note that certain reports in standard reporting surfaces may have thresholding applied, which usually is not available in BigQuery.

Consent mode and Modeled data

When users deny cookie consent, GA4 uses modeling to fill data gaps. Consent mode communicates user consent status to Google, and only cookieless pings are exported to BigQuery. The user_pseudo_id varies by session. Differences exist between standard reporting surfaces and BigQuery due to behavioral modeling. Implementing User-Ids in your GA4 property reduces this effect.

Traffic attribution

In BigQuery, traffic attribution data is available at user and event level, but not at session level as GA uses its own attribution model. To create a custom model, you can join the dataset with first-party data. In the future, more data for traffic attribution might be available through BigQuery event export.

Calculation errors

To ensure accurate metrics in BigQuery, it's important to use the correct calculation methods, such as counting unique combinations of user_pseudo_id/user_id and ga_session_id for GA4 properties. Reporting Identity also affects user count calculation. Additionally, consider dimension and metric scope, time zone differences, and data filtering/export limits, which can cause discrepancies between BigQuery event export data and standard reporting surfaces.

Keep these points in mind when working with GA4 data in BigQuery to ensure that you're getting accurate and reliable data.

How to bridge the gap between GA4 and BigQuery export