BigQuery is the single most tested service on the GCP Professional Data Engineer exam. You need to know not just what it is, but how to optimize it for performance, cost, and security. This guide covers every BigQuery optimization pattern that shows up on the exam.
Partitioning Partitioning divides a table into segments based on a column value. When queries filter by the partition column, BigQuery scans only the relevant partitions instead of the entire table. Partition types: | Type | Column Type | Example | |
|
|
| | Time-unit | TIMESTAMP, DATE, DATETIME | Daily, hourly, monthly, yearly | | Integer range | INT64 | Range of customer IDs | | Ingestion time | (automatic) | Data loaded on a given day | Exam pattern: "A table has 3 years of data. Most queries filter by date. Queries are slow and expensive." The answer is partition by date. Not "add more clustering" or "create a materialized view." Key rules:: You can only have one partitioning column per table: Partition filtering is mandatory for partition pruning to work (the query must filter on the partition column): Maximum 4000 partitions per table
Clustering Clustering sorts data within each partition by up to 4 columns. This improves queries that filter or aggregate by those columns because BigQuery can skip blocks of data that don't match. Clustering vs partitioning: | Technique | What It Does | Best For | |
|
|
| | Partitioning | Divides table into segments | Reducing scanned data by date or range | | Clustering | Sorts within partitions | Improving filter performance on non-partition columns | | Both | Partitions + sorts within | Maximum optimization | Exam pattern: "A table is partitioned by date and clustered by customer_id. A query filters by date and customer_id scans less data than filtering by date alone." This is correct: clustering adds pruning on top of partitioning. Clustering columns should be:: High cardinality (many distinct values): Frequently used in WHERE, GROUP BY, JOIN clauses: NOT low cardinality booleans or enums (use those as filters, not cluster keys) The exam trap: "Change the clustering key to fix a query that doesn't filter by the current cluster key." The fix is to cluster by the column the query actually filters by. But you can only cluster by up to 4 columns: choose wisely.
Materialized Views Materialized views precompute and store query results. BigQuery maintains them automatically as base tables change. Queries that match the materialized view's logic are automatically rewritten to use the cached results. When to use:: Repeated expensive aggregations (daily dashboards, weekly reports): Queries that join multiple large tables repeatedly: BI workloads with predictable query patterns When NOT to use:: Ad hoc queries (no repeated pattern): Highly volatile data (base table changes constantly, maintenance cost is high): Queries that need real-time data (materialized views have slight lag) Exam pattern: "The same aggregation query runs 200 times per day across dashboards. It's slow and expensive." Materialized view. Not "add partitioning" or "use BI Engine." Limitations tested:: Materialized views can only query a single base table (no joins in most cases): Aggregations must be simple (SUM, COUNT, AVG, MIN, MAX): The view definition must remain valid (no breaking schema changes to base table)
BI Engine BI Engine is BigQuery's in-memory acceleration layer. It caches frequently accessed data in memory for sub-second query response. Use for: Dashboard tools (Looker, Tableau, Data Studio) that need fast response times on repeated queries. Exam pattern: "A Looker dashboard queries BigQuery and takes 5 seconds per widget. Users need sub-second response." Enable BI Engine. Not "create a materialized view" (that helps but BI Engine is designed for this specific use case).
Cost Optimization BigQuery pricing has two components:: Storage (per GB per month): Query processing (bytes scanned) Cost reduction strategies: | Strategy | How It Saves | |
|
| | Partitioning | Reduces bytes scanned per query | | Clustering | Reduces bytes scanned within partitions | | Materialized views | Precomputes expensive aggregations | | BI Engine | Reduces repeated query processing | | Reservations/flat-rate | Converts on-demand to committed pricing | | Storage lifecycle | Moves old data to cheaper storage | | SELECT * | Avoid it: scan only needed columns | The exam trap: "A team wants to reduce BigQuery costs. They currently pay on-demand pricing and run predictable workloads." The answer is flat-rate pricing (reservations), not "delete old data" or "use smaller tables." Exam pattern: "A query uses SELECT * and scans 1 GB. It only needs 3 columns." Fix: Select only the needed columns. This reduces bytes scanned and cost.
Authorized Views and Row-Level Security Authorized views let you share a subset of data without exposing the underlying table. Analysts query the view, not the table. The view can filter rows and columns. Row-level security (RLS) filters rows at the table level based on user identity or group membership. Different from authorized views: RLS is a policy on the table itself. | Technique | Scope | Use When | |
|
|
| | Authorized view | View-level | Share a subset of data with specific users | | Row-level security | Table-level | Filter rows by user context automatically | | Policy tags | Column-level | Restrict access to sensitive columns | Exam pattern: "Regional managers should see only their region's sales data in BigQuery. There are 50 regions." Row-level security policy with a region mapping table. Not "create 50 separate views": that's unmaintainable.