In July 2023, Universal Analytics will de facto cease to exist. This means that support for such resources will be discontinued, and users will have six months to work with historical data. Therefore, if you are not a GA360 subscriber, the next version of Google Analytics will soon be the only alternative for collecting data using Google services.
My name is Alexander Ignatenko, and I specialize in marketing analytics. I provide consultation on setup and decision-making using data, as well as assistance with training, development, and implementation of custom attribution models and A/B experiments.
There are only a few days left until the deactivation of Universal Analytics. Despite significant changes in the interface and data collection and processing logic, GA4 still fails to address many important issues or does so poorly. This material will help you be prepared for the changes and shed light on certain problematic aspects.
This article aims to answer the question of how this affects end users negatively. GA4 is incredibly flawed. While we could previously attribute the service's shortcomings to its newness, now, almost three years after its launch, they appear as inexplicable shortcomings. I have identified seven such issues, which form the basis of the article's headline. So, let's get started.
1st sin - HyperLogLog++ or forget about accuracy
HyperLogLog++ is a beast used in the GA4 interface for reporting and API calls to estimate the number of unique values. In simple terms, it's a machine learning algorithm that estimates certain metrics (primarily the number of users and sessions) based on a set of heuristics and fills your reports with these approximations.
Its use is convenient for computational efficiency, but it becomes a drawback for users. The metrics in the report are not precise numbers but approximations with a certain level of error. So even if you're looking at an unsampled exploration report, you're still dealing with approximate figures rather than exact values.
Believe it or not? Let's consult the documentation together:
...when you view the unique count of these metrics in the user interface or through the API, they represent an approximation with a certain level of accuracy. In BigQuery, since you have access to detailed data, you can calculate the exact count for these metrics. Thus, the metrics may differ by a small percentage. With a 95% confidence interval, the accuracy of session counts can be ±1.63%...
±1.63%. Once again, what you see is not the metric itself but an approximation. And this is not someone's imagination but the level of error described in Google's documentation. So, if accuracy and data reconciliation with CRM and other sources are important to you, unfortunately, GA4 is not the assistant you're looking for.
2nd sin - threshold values
Threshold values are a mechanic in GA4 that removes data from reports if its volume is too small. However, Google conveniently fails to disclose what these threshold values are:
Threshold values are determined by the system, and they cannot be adjusted.
One thing is known for sure: threshold values are used when Google Signals are enabled and for reports containing demographic data. Diligent analysts have found that when Google Signals are enabled, reports remove data for, let's say, pages visited by fewer than 50 users. This applies to API calls but not to exports in BigQuery.
Therefore, if you want to reduce the impact (completely avoiding it is unfortunately impossible) of threshold values, enable user identification by devices in the admin panel. But user identification raises another question.
3rd sin - user identification
Unlike the previous version of Google Analytics, GA4 utilizes four user identification areas:
user identifiers;
Google Signals;
device identifiers;
modelling.
Together, they result in three available methods of identification:
blended identification;
based on observed parameters;
by device identifiers.
Let's start with the fact that blended identification is used by default. However, for its effective use, enabling Google Signals is necessary (which raises its own questions).
But what's even more important is that the first two methods are essentially black boxes because they either rely on modelling (for users who haven't consented to their data being used on the site) or Google Signals. For resources with low to medium traffic, this is a catastrophe as it adds noise to already limited data.
And the most crucial point is that the method of device identification, which is closest to Universal Analytics, was cowardly hidden in the most non-obvious dropdown menu in the identification settings. To enable it, you'll need to carefully read the documentation and find the inconspicuous "Show all" button in the identification settings.
Well, it was a nice try.
4th sin - Google Signals
Google Signals is an attempt by the company to overcome the limitations on the use of third-party cookies. By enabling this setting, GA4 starts considering whether users were logged into their Google accounts at the time of their visit for better user identification.
It seems reasonable, but good intentions aren't the only thing paved on this road to relevant data. What are the consequences of enabling Google Signals?
Firstly, your data will be filled with more unnecessary noise (as mentioned earlier).
Secondly, GA requests from your website will change the recipient domain. Instead of the innocuous google-analytics.com, a second-level domain, analytics.google.com, will be used. Quite interesting, isn't it?
Thirdly, before activating Google Signals, you will come across an interesting (and very inconspicuous) message:
You've allowed data to be used to improve Google products and services. These settings will also apply to data about visits collected using Google Signals that are associated with Google user accounts. You confirm that you have obtained the necessary end-user authorizations, including the right to disclose this data to Google as described in your privacy policy.
I'm not a professional poker player, but the more I delved into the history of Google Signals, the less I understood who the so-called "sucker" is at this "table." And that's usually a very bad signal. Google Signal.
Not only does this technique provide little (or no) benefit, but it essentially operates on a business logic that is entirely alien to me. Not very pleasant, to say the least.
5th sin - AMP Pages
It's hard to believe, but GA4 lacks a solution for one of Google's flagship products - AMP (Accelerated Mobile Pages) pages. If you think this is an outdated and/or unimportant feature, stay away from SEO specialists, or they'll laugh at you. AMP pages are lightweight versions of your website's content for easy access on mobile devices. And it's a crucial factor for ranking in Google.
No, the solution itself exists - David Valejo took on the role of a handyman, documented and made the JSON configuration of the AMP-analytics component for GA4 available for free. Some even develop plugins for CMS based on it. But it's a homemade solution. And, by the way, it has several vulnerabilities (such as session linking for implementation with GTM Server Side).
So if you're receiving a lot of traffic to your AMP pages, my condolences. Use and modify David Valejo's makeshift solution and submit a support request - maybe someone will deign to respond.
6th sin - A/B Testing
Along with Universal Analytics, another popular service is being thrown into the abyss - Google Optimize. The latter was convenient with its pricing policy (free) and seamless integration with Google Analytics. It's worth forgetting about both.
We can accept the fact that experiments need to be paid for, but everything else raises questions. Google has chosen the simplest shortcut to solve one of the most painful problems - to shift the burden of solving it onto third-party developers.
An API was announced for third-party services through which their clients can send data to GA4, create audiences, and analyze experiment results. Let's forget about the fact that it's a makeshift solution. There's something more important.
Google suggests analyzing the results of A/B tests within the GA4 interface. And it's laughable, primarily because of HyperLogLog++. Accuracy is crucial for A/B tests; the data must be granular, but HLL++ essentially makes it impossible since it generates reports with approximate (inaccurate) data.
The only viable mechanism is interpreting experiment results in BigQuery. But how is this better than working directly in the interface of an A/B testing service that calculates significance and everything else for you? It's unclear.
7th sin - views and properties management
It's a pain. In Universal Analytics, you could replicate your resource based on different roles, tasks, traffic types, or locations. But now, all of that is gone. Yes, the functionality of subproperties/roll-up properties (child and aggregated resources) has been introduced, but it's only available to GA360 subscribers. Regular users are left to invent workarounds to solve such an obvious and important task as creating and managing views.
The extras
Attribution - The model selection has been adjusted, with multi-channel attribution based on data being chosen as the default. Essentially, it's another black box.
Empty parameter values in reports - It's a possible scenario (for example, in a Landing Pages report) that is not explained in the documentation.
Lack of specific parameters with item/product scope - They are still missing, despite being announced.
Lack of clear description regarding the migration - It's unclear what will happen to existing resources after historical data access is disabled and when exactly this will happen (the documentation provides vague wording of "at least 6 months").
Conclusion
Unfortunately, just a few days before the deprecation of the previous version, the new one remains unfinished.
If you have any grievances about GA4, please share them in the comments.
If you enjoyed this post, please like and share it with your friends. It means a lot to me.