At first glance, calculating WAU (Weekly Active Users) and MAU (Monthly Active Users) in BigQuery might seem straightforward.
However, calculating cumulative unique active users by day in BigQuery can be challenging due to several factors:
1. Data Volume. When dealing with large datasets, operations like aggregation and counting unique values can require significant computational resources and time. While BigQuery is designed to handle big data, complex queries involving operations like DISTINCT can take longer to execute, especially if these queries need to run daily.
2. Query Optimization. To run queries efficiently in BigQuery, it’s essential to optimize their structure. Calculating cumulative unique users by day often requires the use of subqueries, window functions, and aggregate operations, which can be complex to write and fine-tune. Poorly optimized queries can lead to increased costs and longer execution times.
3. Limitations of Window Functions. For example, you cannot use COUNT(DISTINCT) together with ORDER BY in a single window function. BigQuery doesn’t allow DISTINCT within window functions that also use ORDER BY. This limitation exists because DISTINCT requires complete aggregation before sorting can be applied, while window functions are designed to process data on the fly as it moves through the dataset in order.
There are many solutions to this dilemma. One of the most elegant is using the HLL_COUNT function proposed by Shane Neubauer. This function is essentially an efficient way to estimate the number of unique values in large datasets. It is especially useful when high execution speed is needed and a small margin of error is acceptable.
Here’s the syntax:
HLL_COUNT.INIT() — Creates a HyperLogLog sketch, which can then be used for further aggregation.
HLL_COUNT.MERGE() — Merges multiple HLL sketches into one.
HLL_COUNT.EXTRACT() — Extracts the final count of unique values from the merged sketch.
If you work with GA4 to BigQuery exports, be sure to check out my SQL cheat sheet.