Sitemap

Effortless Cardinality Estimation in Snowflake Using DATASKETCHES_HLL

4 min readMay 11, 2025

Snowflake just made counting distinct values at scale much faster — thanks to its new suite of HyperLogLog-based SQL functions.

What Is HyperLogLog (HLL), and Why Should You Care?

In data engineering, cardinality estimation — determining how many unique elements exist in a set — is essential but often costly. Traditional COUNT(DISTINCT col) operations are accurate, but they can be slow and expensive when scanning millions or billions of rows.

Snowflake now supports Apache DataSketches’ HyperLogLog (HLL) algorithm — a probabilistic, compressed data structure that allows you to estimate distinct counts quickly and with minimal resource use.

Benefits of Using HLL in Snowflake

Speed: Executes faster than COUNT(DISTINCT) for large datasets
Efficiency: Uses compact, binary sketches — saves memory
Scalability: Ideal for distributed environments & data lakes
Flexibility: Combine, merge, and estimate sketches across groups or time

The Four New SQL Functions

|  Function                           |   Description.                                     |
| ----------------------------------- | -------------------------------------------------- |
| `DATASKETCHES_HLL(expr)` | Instantly estimates distinct cardinality |
| `DATASKETCHES_HLL_ACCUMULATE(expr)` | Generates a binary sketch for storage or later use |
| `DATASKETCHES_HLL_COMBINE(sketch)` | Merges multiple sketches into one |
| `DATASKETCHES_HLL_ESTIMATE(sketch)` | Returns an approximate count from a sketch |

Real-World Use Case: Customer Segmentation

Let’s walk through a practical use case using Snowflake’s TPCH_SF100 sample data.

🔹 Step 1: Estimate Distinct Customers by Segment

SELECT  
C_MKTSEGMENT,
DATASKETCHES_HLL(C_CUSTKEY) AS approx_customers,
COUNT(DISTINCT C_CUSTKEY) AS exact_customers
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER
GROUP BY C_MKTSEGMENT;

Result: Near-instant estimates — and you can compare them with exact counts to gauge accuracy.

Step 2: Store Sketches for Later Use

CREATE OR REPLACE TEMP TABLE segment_sketches AS  
SELECT
C_MKTSEGMENT,
DATASKETCHES_HLL_ACCUMULATE(C_CUSTKEY) AS sketch
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER
GROUP BY C_MKTSEGMENT;

select * from segment_sketches;

Each row now stores a compressed sketch for its respective segment — great for large data aggregations.

Step 3: Merge Sketches and Estimate Total

CREATE OR REPLACE TEMP TABLE combined_sketch AS  
SELECT
DATASKETCHES_HLL_COMBINE(sketch) AS merged_sketch
FROM segment_sketches
WHERE C_MKTSEGMENT = 'AUTOMOBILE';

SELECT
DATASKETCHES_HLL_ESTIMATE(merged_sketch) AS automobile_customer_estimate
FROM combined_sketch;

Use Case: Useful when summarizing distinct counts across partitions, regions, or time slices.

Pro Tips & Best Practices

| Pitfall                  | Recommendation                                                                      |
| ------------------------ | ----------------------------------------------------------------------------------- |
| Forgetting sketch size | Default HLL accuracy is solid, but can be tuned using `DATASKETCHES_HLL(expr, lgK)` |
| Not storing sketches | Store them in a permanent table if you want to reuse later |
| Expecting 100% precision | HLL is probabilistic (\~1–2% error) — great for trends, not for billing! |

Performance Comparison (Optional Benchmarking)

-- Exact Count
SELECT COUNT(DISTINCT C_CUSTKEY) FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER;
-- Approximate Count
SELECT DATASKETCHES_HLL(C_CUSTKEY) FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER;

In tests, the approximate count runs up to 20% faster on large datasets.

When to Use Which Function?

| Use Case                   | Recommended Function                  |
| -------------------------- | ------------------------------------- |
| Quick estimates | `DATASKETCHES_HLL` |
| Aggregating by date/region | `DATASKETCHES_HLL_ACCUMULATE` |
| Reporting rollups | `DATASKETCHES_HLL_COMBINE + ESTIMATE` |
| Daily incremental loading | Store sketches and merge periodically |

TL;DR

Snowflake’s HyperLogLog-powered DATASKETCHES_HLL functions let you:

  • Replace expensive COUNT(DISTINCT) queries
  • Handle massive data with minimal resource impact
  • Build scalable analytics pipelines

This is a game-changer for performance-sensitive workloads.

Final Thoughts

If you’re building dashboards, behavioral analytics, or large-scale reports, you need these functions in your toolkit.

Start with basic estimation, then evolve into sketch storage and merging for complex use cases.

👏 Enjoyed this? Share, comment, or follow me for more Snowflake deep dives and real-world data engineering tutorials.

#Snowflake, #DataEngineering, #HyperLogLog, #BigData, #SQLTips, #CardinalityEstimation, #CloudDataPlatform, #DataAnalytics, #ProbabilisticDataStructures, #PerformanceOptimization

--

--

No responses yet