Rely On Referential Integrity | Data Model Power BI

Introduction

Throughout my journey with Power BI, I was lucky enough to work on a project deploying Power BI with Databricks using DirectQuery. It was during this time that I stumbled upon a concept that was completely new to me: Referential Integrity settings in the data model.

To be honest, not many people talk about this setting. Why? Because it generally applies specifically to DirectQuery scenarios (according to Microsoft documentation), and let's face it, most of us love Import Mode and DirectQuery is quite expense. But when you are dealing with massive datasets where Import is not an option, this little switch can be a game-changer.

In a nutshell: When connecting to a data source using DirectQuery, enabling "Assume Referential Integrity" tells Power BI to generate INNER JOIN statements instead of the default OUTER JOIN when querying the backend database. This might sound like a small technicality, but it can significantly improve query efficiency.

Let's dive in and see how it works!

Image Source: Why is Referential Integrity Important for Test Data Management? | Syntho

Keywords

Data Model, Referential Integrity, Optimization, DirectQuery, Tabular Editor 2, Cost reduction

Terminology: Referential Integrity

You might be thinking, "Referential Integrity? That sounds like a database thing." And you would be absolutely right! This concept didn't start with Power BI. It stems directly from traditional Relational Database Management Systems (RDBMS) design. In the database world, it is a constraint used to maintain data consistency, ensuring that every Foreign Key (e.g., in your transaction table) points to a valid, existing Primary Key (e.g., in your customer master table). Power BI essentially borrows this classic concept to decide how much "trust" to place in your data source.

Requirements

This is an advanced setting, so we can't just turn it on blindly. It works its magic only when you are connecting via DirectQuery. To make sure "Assume Referential Integrity" works properly and doesn't break your report, your data needs to strictly follow these rules:

No Nulls allowed: The data in the "From" column (usually the Foreign Key in your Fact table) must never be Null or blank.

Perfect Matching: For every value in the "From" column, there must be a corresponding value in the "To" column (the Primary Key in your Dimension table).

Note: In this context, the "From" column is typically the "Many" side in a One-to-Many relationship (your Fact table), and the "To" column is the "One" side (your Dimension table).

Methodology & Dataset

The Methodology

To demonstrate this, I’m not just going to talk theory. I’m going to run a test using Tabular Editor 2 and Databricks.

I will use Tabular Editor 2 (my go-to tool for modeling) to toggle the "Assume Referential Integrity" property between True and False for the relationships. Then, we will capture the actual SQL query sent to the Databricks database to see the difference under the hood.

The Dataset

I'm using a sample dataset from the Databricks Free Edition with a simple Star Schema structure:

Dimension tables:

sales_customers

sales_franchises

Fact table:

sales_transactions

Let's imagine we drag a visual onto the canvas that shows Total Transaction by Customer Name.

Performance Showdown (Tuning)

First, open Tabular Editor 2 and connect to your data model. Navigate to Model > Relationships and select the relationship between the sales_transactions table and the sales_customers table. In the Properties panel on the right, locate the Options section. Here, I can set Assume Referential Integrity to True or False for our testing.

1. Set False (The Default)

By default, Power BI plays it safe. It assumes your data might be "dirty" (i.e., you might have sales transactions for customers that don't exist in your customer list).

When Assume Referential Integrity is set to False, Power BI generates a LEFT OUTER JOIN.

👉 Generated SQL sent to Databricks:

The impact: The database engine has to do extra work to keep all rows from the left table, even if there are no matches on the right. In the world of Big Data, this overhead adds up fast.

2. Set True (To Optimize)

Now, let's go into Tabular Editor, find the relationship, and set Assume Referential Integrity to True. We are essentially telling Power BI: "Trust me, my data is clean. Every sale belongs to a valid customer."

Power BI now switches to an INNER JOIN.

👉 Generated SQL sent to Databricks:

The impact: Database engines love Inner Joins. They are generally faster and much more efficient to execute because the database only cares about matching rows.

Discussion

So, should you always turn this on? No, no, no, think about Data Quality Assurance first then make the decision. This setting is a classic example of the trade-off between Performance and Data Quality Assurance.

-> The Benefit: If you are querying billions of rows on Databricks, Snowflake, or Synapse, switching to INNER JOIN can shave significant time off your report loading speed and reduce cost. It creates a much leaner query plan for the data warehouse.

-> The Risk: This is the important part. If you enable this setting but your data isn't actually perfect (e.g., you have a "Sale" with a CustomerID that doesn't exist in the Customer table), those rows will simply disappear from your report. Power BI won't show an error; it just won't show the data.

A Reality Check on Databricks Optimizer

It is worth noting that modern engines like Databricks have incredibly powerful query optimizers. In simple cases (like my example above), the engine is often smart enough to realize that a LEFT JOIN acts effectively like an INNER JOIN (if it detects the constraints) and will automatically optimize the execution plan. So, in basic tests, you might see zero performance difference because Databricks "fixed" the query for you behind the scenes.

But here is the catch: Real-world Power BI reports are rarely simple. We are often dealing with complex DAX calculations (VAR and RETURN) that translate into SQL behemoths spanning 500 to 1,000 lines of code with multiple nested joins. In these heavy scenarios, we shouldn't rely on the database optimizer to "guess" right every time. By explicitly enabling Referential Integrity, you guarantee the optimal path (INNER JOIN) regardless of query complexity, removing the guesswork from performance tuning.

Final thought

Use this feature! It’s fantastic for optimization. But before you do:

❗️Ensure your ETL pipeline is robust.
❗️Ensure your ETL pipeline is robust.
❗️Ensure your ETL pipeline is robust.

"Assume Referential Integrity" is powerful, but it relies on you knowing your data inside and out.

References

Assume Referential Integrity Setting in Power BI Desktop - Power BI | Microsoft Learn