Practice Free Databricks Certified Professional Data Engineer Exam Online Questions
The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.
A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:

A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.
Which statement explains the cause of this failure?
- A . Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.
- B . The activity details table already exists; CHECK constraints can only be added during initial table creation.
- C . The activity details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.
- D . The activity details table already contains records; CHECK constraints can only be added prior to inserting values into a table.
- E . The current table schema does not contain the field valid coordinates; schema evolution will need to be enabled before altering the table to add a constraint.
C
Explanation:
The failure is that the code to add CHECK constraints to the Delta Lake table fails when executed. The code uses ALTER TABLE ADD CONSTRAINT commands to add two CHECK constraints to a table named activity_details. The first constraint checks if the latitude value is between -90 and 90, and the second constraint checks if the longitude value is between -180 and 180. The cause of this failure is that the activity_details table already contains records that violate these constraints, meaning that they have invalid latitude or longitude values outside of these ranges. When adding CHECK constraints to an existing table, Delta Lake verifies that all existing data satisfies the constraints before adding them to the table. If any record violates the constraints, Delta Lake throws an exception and aborts the operation.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Add a CHECK constraint to an existing table” section.
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-alter-table.html#add-constraint
A data engineer needs to install the PyYAML Python package for YAML file processing within their Databricks environment. However, the Databricks workspace is air-gapped and does not have direct internet access to download packages from PyPI. The engineer has already downloaded the required PyYAML wheel (.whl) file onto their laptop. The data engineer wants to install the PyYAML package from the local wheel file so that it is automatically available whenever any new cluster is provisioned in their Databricks workspace.
Which approach should the data engineer use?
- A . Upload the PyYAML.whl file to a Unity Catalog volume. Add the path to the Unity Catalog allowlist if required. Then create a cluster-scoped init script that executes pip install /path/to/PyYAML.whl.
- B . Set up a private PyPI repository, register the wheel there, and create a cluster-scoped init script that executes /databricks/python/bin/pip install –index-url=https://{repo-url} PyYAML on the cluster.
- C . Upload the PyYAML.whl file under the data engineer’s user home directory in the Workspace, and create a cluster-scoped init script that executes %pip install /path/to/PyYAML.whl on the shared cluster.
- D . Add the PyYAML.whl file directly to Databricks Git Repos and assume that any cluster linked to the
Repo will automatically have PyYAML installed from that file.
A
Explanation:
Databricks documents that libraries and init scripts can be sourced from Unity Catalog volumes, and for standard access mode, relevant paths can require Unity Catalog allowlist configuration. Databricks also documents that init scripts run on every cluster startup, which is the mechanism that makes a package automatically available whenever a new cluster is provisioned. (Databricks Documentation)
Option A fits the air-gapped requirement because it does not depend on internet access and uses a startup-time installation mechanism.
Option B still depends on reachable repository infrastructure.
Option C is incorrect because %pip is notebook magic, not the correct form inside an init script.
Option D is unsupported because storing a wheel in Git Repos does not automatically install it on cluster creation. (Databricks Documentation)
A data engineering team is setting up deployment automation. To deploy workspace assets remotely using the Databricks CLI command, they must configure it with proper authentication.
Which authentication approach will provide the highest level of security?
- A . Use a service principal with OAuth token federation.
- B . Use a service principal ID and its OAuth client secret.
- C . Use a service principal and its Personal Access Token.
- D . Use a shared user account and its OAuth client secret.
A
Explanation:
The most secure and enterprise-recommended authentication method for Databricks automation is OAuth token federation with service principals.
This configuration allows service principals (non-human identities) to authenticate using temporary
OAuth access tokens from a trusted identity provider (such as Azure AD or AWS IAM federation).
These tokens are short-lived and scoped, significantly reducing credential exposure risks.
By contrast, static client secrets (B) or PATs (C) are long-lived and require periodic manual rotation, increasing security vulnerability. Shared user accounts (D) violate least-privilege and auditability principles. Therefore, A provides the strongest, most compliant authentication model for automated CLI and CI/CD workflows.
A data engineer is attempting to execute the following PySpark code:
df = spark.read.table("sales")
result = df.groupBy("region").agg(sum("revenue"))
However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase.
Which technique should be applied to reduce shuffling during the groupBy aggregation operation?
- A . Caching the DataFrame df.
- B . Repartition by region before aggregation.
- C . Use coalesce() after the aggregation.
- D . Use broadcast join.
B
Explanation:
Databricks documents that shuffle occurs when Spark redistributes data across partitions for grouping or joining. To optimize aggregation performance, repartitioning by the grouping key (region) ensures rows with the same key are co-located in the same partition, thus minimizing shuffle movement. Caching improves reuse of DataFrames but does not reduce shuffle volume. coalesce() reduces the number of partitions after computation and cannot prevent shuffle. Broadcast joins are unrelated to single-table aggregations. The recommended practice for reducing shuffle in aggregation is explicit repartitioning by the grouping column.
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.
The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?
- A . The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
- B . Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
- C . Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
- D . Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
- E . Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
D
Explanation:
This is the correct answer because it accurately presents information about Delta Lake and Databricks that may impact the decision-making process of a junior data engineer who is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields. Delta Lake and Databricks support schema inference and evolution, which means that they can automatically infer the schema of a table from the source data and allow adding new columns or changing column types without affecting existing queries or pipelines. However, schema inference and evolution may not always be desirable or reliable, especially when dealing with complex or nested data structures or when enforcing data quality and consistency across different systems. Therefore, setting types manually can provide greater assurance of data quality enforcement and avoid potential errors or conflicts due to incompatible or unexpected data types.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Schema inference and partition of streaming DataFrames/Datasets” section.
A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The join hinges on finding which IPv4 range each event’s address falls into.
Table 1: network_events (≈ 5 billion rows)
event_id ip_int
42 3232235777
Table 2: ip_ranges (≈ 2 million rows)
start_ip_int end_ip_int country
3232235520 3232236031 US
The query is currently very slow:
SELECT n.event_id, n.ip_int, r.country
FROM network_events n
JOIN ip_ranges r
ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;
Which change will most dramatically accelerate the query while preserving its logic?
- A . Increase spark.sql.shuffle.partitions from 200 to 10000.
- B . Add a range-join hint /*+ RANGE_JOIN(r, 65536) */.
- C . Force a sort-merge join with /*+ MERGE(r) */.
- D . Add a broadcast hint: /*+ BROADCAST(r) */ for ip_ranges.
B
Explanation:
The query joins billions of rows (network_events) with millions of rows (ip_ranges) using a range predicate (BETWEEN). Unlike equality joins (=), range joins are not efficiently handled by broadcast or sort-merge joins because:
Broadcast Join (D): Effective for small tables but only for equality joins. Since this query uses a range condition, broadcast will not reduce the complexity of scanning billions of records across non-equality conditions.
Sort-Merge Join (C): Works for ordered joins but is inefficient on range conditions. Sorting billions of records adds excessive overhead and will not resolve the bottleneck.
Increasing Shuffle Partitions (A): Only spreads out shuffle work but does not address the fundamental inefficiency of range-based lookups at scale.
Range Joins in Spark (RANGE_JOIN hint):
Databricks provides range join optimizations specifically for conditions such as BETWEEN. By applying a RANGE_JOIN hint, Spark can build optimized data structures (such as interval indexes or partition pruning strategies) that map billions of input rows to ranges much faster. This avoids brute-force scans and unnecessary shuffle costs.
Thus, Option B is the correct solution because:
It leverages range-join optimization, which is purpose-built for queries joining massive event logs to smaller lookup tables with IP ranges.
This ensures Spark can evaluate billions of rows against millions of ranges with optimized matching logic, drastically improving query performance while preserving correctness.
Reference: Databricks SQL Performance Tuning Guide C Range Joins and Join Hints (RANGE_JOIN, BROADCAST, MERGE).
A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.
What are the minimal cluster permissions that allow the development team to accomplish this?
- A . CAN ATTACH TO
- B . CAN MANAGE
- C . CAN VIEW
- D . CAN RESTART
C
Explanation:
Databricks provides different permission levels to control access to clusters. The correct minimal permission required for viewing driver logs is CAN VIEW.
Databricks Cluster Permission Levels:
CAN ATTACH TO:
Allows users to attach notebooks to a cluster but does not allow them to view logs.
Not sufficient for viewing driver logs.
CAN MANAGE:
Grants full control over the cluster, including starting, stopping, and editing configurations.
Too broad for this requirement.
CAN VIEW (Correct Answer):
Allows users to view cluster details, logs, and status but not modify any configurations. Minimal required permission for viewing logs. CAN RESTART:
Grants permission to restart the cluster, but does not include log access.
Not sufficient for viewing logs.
Conclusion:
The minimal permission needed to allow the development team to view driver logs is CAN VIEW.
Reference: Databricks Cluster Permissions Documentation
While reviewing a query’s execution in the Databricks Query Profiler, a data engineer observes that the Top Operators panel shows a Sort operator with high Time Spent and Memory Peak metrics. The Spark UI also reports frequent data spilling.
How should the data engineer address this issue?
- A . Switch to a broadcast join to reduce memory usage.
- B . Repartition the DataFrame to a single partition before sorting.
- C . Convert the sort operation to a filter operation.
- D . Increase the number of shuffle partitions to better distribute data.
D
Explanation:
When Spark performs wide transformations such as sortBy or orderBy, large data volumes can exceed memory limits, causing disk spilling. The official Databricks performance tuning guide recommends increasing the shuffle partition count to distribute the data more evenly across executors. By default, Spark uses a fixed number of shuffle partitions (e.g., 200), which can lead to memory imbalance and spill if some partitions are too large. Increasing this number (via spark.sql.shuffle.partitions) results in smaller partitions, reduced in-memory pressure, and improved sort performance. Other options like broadcast joins or single partition sorts do not apply to single-table sorts, and converting to filters changes query logic. Thus, option D is the correct remedy.
A data engineer is masking a column containing email addresses. The goal is to produce output strings of identical length for all rows, while generating different outputs for different email values.
Which SQL function should be used to achieve this?
- A . mask(email, ‘?’)
- B . hash(email)
- C . sha1(email)
- D . sha2(email, 0)
B
Explanation:
The hash() function in Databricks SQL returns a deterministic fixed-length integer (or hexadecimal string) derived from the input. When applied to sensitive identifiers like email addresses, it produces a unique value for each distinct input while ensuring uniform output size, making it suitable for anonymization where referential consistency is required.
Functions like mask() perform pattern-based substitutions that change string lengths, and sha1() or sha2() produce long hexadecimal strings of varying lengths (depending on hash size), which may not match requirements for fixed-length masking.
Therefore, the correct choice for fixed-length, deterministic pseudonymization of email addresses is hash(email), as it maintains analytical usability while anonymizing sensitive data.
A query is taking too long to run. After investigating the Spark UI, the data engineer discovered a significant amount of disk spill. The compute instance being used has a core-to-memory ratio of 1:2.
What are the two steps the data engineer should take to minimize spillage? (Choose 2 answers)
- A . Choose a compute instance with a higher core-to-memory ratio.
- B . Choose a compute instance with more disk space.
- C . Increase spark.sql.files.maxPartitionBytes.
- D . Reduce spark.sql.files.maxPartitionBytes.
- E . Choose a compute instance with more network bandwidth.
A,D
Explanation:
Databricks recommends addressing disk spilling―which occurs when Spark tasks run out of memory―by increasing memory per core and controlling partition size. Selecting an instance type with a higher memory-to-core ratio (A) provides each task with more available RAM, directly reducing the chance of spilling to disk. Additionally, reducing spark.sql.files.maxPartitionBytes (D) creates smaller partitions, preventing any single task from holding too much data in memory. Increasing partition size (C) or disk capacity (B) does not solve memory bottlenecks, and bandwidth (E) affects network I/O, not spill behavior. Therefore, the correct actions are A and D.
