If your BigQuery queries are drowning in duplicates, it’s time to bring out the big guns: QUALIFY + ROW_NUMBER(). This trick is a lifesaver when cleaning up messy datasets, and shoutout to Constantin Lungu for highlighting it! I with great pleasure borrowed his screenshot to make it even clearer!
Here’s the deal:
You need to keep only the first (or last) occurrence of something — say, the first event per user. Easy, right? Just use ROW_NUMBER() over your partition by user and order by event time. Then, use QUALIFY [window function result] = 1, and boom — duplicates gone!
BUT… there’s a catch.
Sometimes, two events happen at the exact same time. Uh-oh. What now?
- ROW_NUMBER() will randomly pick one and move on — clean, but non-deterministic.
- DENSE_RANK() will keep ties, letting you decide what to do next.
Moral of the story? Choose your ranking function wisely!
If you work with GA4 to BigQuery exports, be sure to check out my SQL cheat sheet.