Practice Free Databricks Certified Professional Data Engineer Exam Online Questions
Which statement describes the default execution mode for Databricks Auto Loader?
- A . New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
- B . Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.
- C . Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.
- D . New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.
A
Explanation:
Databricks Auto Loader simplifies and automates the process of loading data into Delta Lake. The default execution mode of the Auto Loader identifies new files by listing the input directory. It incrementally and idempotently loads these new files into the target Delta Lake table. This approach ensures that files are not missed and are processed exactly once, avoiding data duplication. The other options describe different mechanisms or integrations that are not part of the default behavior of the Auto Loader.
Reference:
Databricks Auto Loader Documentation: Auto Loader Guide
Delta Lake and Auto Loader: Delta Lake Integration
A query is taking too long to run. After investigating the Spark UI, the data engineer discovered a significant amount of disk spill. The compute instance being used has a core-to-memory ratio of 1:2.
What are the two steps the data engineer should take to minimize spillage? (Choose 2 answers)
- A . Choose a compute instance with a higher core-to-memory ratio.
- B . Choose a compute instance with more disk space.
- C . Increase spark.sql.files.maxPartitionBytes.
- D . Reduce spark.sql.files.maxPartitionBytes.
- E . Choose a compute instance with more network bandwidth.
A, D
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer Documents:
Databricks recommends addressing disk spilling―which occurs when Spark tasks run out of memory―by increasing memory per core and controlling partition size. Selecting an instance type with a higher memory-to-core ratio (A) provides each task with more available RAM, directly reducing the chance of spilling to disk. Additionally, reducing spark.sql.files.maxPartitionBytes (D) creates smaller partitions, preventing any single task from holding too much data in memory. Increasing partition size (C) or disk capacity (B) does not solve memory bottlenecks, and bandwidth (E) affects network I/O, not spill behavior. Therefore, the correct actions are A and D.
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?
- A . Whenever a database is being created, make sure that the location keyword is used
- B . When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.
- C . Whenever a table is being created, make sure that the location keyword is used.
- D . When tables are created, make sure that the external keyword is used in the create table statement.
- E . When the workspace is being configured, make sure that external cloud object storage has been mounted.
C
Explanation:
This is the correct answer because it ensures that this requirement is met. The requirement is that all tables in the Lakehouse should be configured as external Delta Lake tables. An external table is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created by using the location keyword to specify the path to an existing directory in a cloud storage system, such as DBFS or S3. By creating external tables, the data engineering team can avoid losing data if they drop or overwrite the table, as well as leverage
existing data without moving or copying it.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Create an external table” section.
Which statement describes Delta Lake Auto Compaction?
- A . An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.
- B . Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.
- C . Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
- D . Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
- E . An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.
E
Explanation:
This is the correct answer because it describes the behavior of Delta Lake Auto Compaction, which is a feature that automatically optimizes the layout of Delta Lake tables by coalescing small files into larger ones. Auto Compaction runs as an asynchronous job after a write to a table has succeeded and checks if files within a partition can be further compacted. If yes, it runs an optimize job with a default target file size of 128 MB. Auto Compaction only compacts files that have not been compacted previously.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Auto Compaction for Delta Lake on Databricks” section.
"Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously."
https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.
Which statement describes what the number alongside this field represents?
- A . The job_id is returned in this field.
- B . The job_id and number of times the job has been are concatenated and returned.
- C . The number of times the job definition has been run in the workspace.
- D . The globally unique ID of the newly triggered run.
D
Explanation:
When triggering a job run using the Databricks CLI, the run_id field in the response represents a globally unique identifier for that particular run of the job. This run_id is distinct from the job_id. While the job_id identifies the job definition and is constant across all runs of that job, the run_id is unique to each execution and is used to track and query the status of that specific job run within the Databricks environment. This distinction allows users to manage and reference individual executions of a job directly.
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.
The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?
- A . The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
- B . Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
- C . Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
- D . Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
- E . Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
D
Explanation:
This is the correct answer because it accurately presents information about Delta Lake and Databricks
that may impact the decision-making process of a junior data engineer who is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields. Delta Lake and Databricks support schema inference and evolution, which means that they can automatically infer the schema of a table from the source data and allow adding new columns or changing column types without affecting existing queries or pipelines. However, schema inference and evolution may not always be desirable or reliable, especially when dealing with complex or nested data structures or when enforcing data quality and consistency across different systems. Therefore, setting types manually can provide greater assurance of data quality enforcement and avoid potential errors or conflicts due to incompatible or unexpected data types.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Schema inference and partition of streaming DataFrames/Datasets” section.
A data engineer is designing a system to process batch patient encounter data stored in an S3 bucket, creating a Delta table (patient_encounters) with columns encounter_id, patient_id, encounter_date, diagnosis_code, and treatment_cost. The table is queried frequently by patient_id and encounter_date, requiring fast performance. Fine-grained access controls must be enforced. The engineer wants to minimize maintenance and boost performance.
How should the data engineer create the patient_encounters table?
- A . Create an external table in Unity Catalog, specifying an S3 location for the data files. Enable
predictive optimization through table properties, and configure Unity Catalog permissions for access controls. - B . Create a managed table in Unity Catalog. Configure Unity Catalog permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.
- C . Create a managed table in Unity Catalog. Configure Unity Catalog permissions for access controls, schedule jobs to run OPTIMIZE and VACUUM commands daily to achieve best performance.
- D . Create a managed table in Hive Metastore. Configure Hive Metastore permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.
B
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer Documents:
Databricks documentation specifies that Unity Catalog managed tables are the preferred choice for secure, low-maintenance Delta Lake architectures. Managed tables provide full lifecycle management, including metadata, file storage, and access control integration with Unity Catalog. Fine-grained permissions can be enforced at the column and row level through built-in Unity Catalog governance.
Additionally, Predictive Optimization (Auto Optimize + Auto Compaction) automatically manages file sizes, metadata pruning, and layout optimization, eliminating the need for manual maintenance such as scheduling OPTIMIZE or VACUUM.
External tables (A) require manual path management, and Hive Metastore tables (D) do not support Unity Catalog access policies. Therefore, creating a managed Unity Catalog table with predictive optimization provides both the security and performance benefits needed, making B the correct solution.
A Delta table of weather records is partitioned by date and has the below schema:
date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter:
latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?
- A . All records are cached to an operational database and then the filter is applied
- B . The Parquet file footers are scanned for min and max statistics for the latitude column
- C . All records are cached to attached storage and then the filter is applied
- D . The Delta log is scanned for min and max statistics for the latitude column
- E . The Hive metastore is scanned for min and max statistics for the latitude column
D
Explanation:
This is the correct answer because Delta Lake uses a transaction log to store metadata about each table, including min and max statistics for each column in each data file. The Delta engine can use
this information to quickly identify which files to load based on a filter condition, without scanning the entire table or the file footers. This is called data skipping and it can improve query performance significantly.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; [Databricks Documentation], under “Optimizations – Data Skipping” section.
In the Transaction log, Delta Lake captures statistics for each data file of the table. These statistics indicate per file:
– Total number of records
– Minimum value in each column of the first 32 columns of the table
– Maximum value in each column of the first 32 columns of the table
– Null value counts for in each column of the first 32 columns of the table
When a query with a selective filter is executed against the table, the query optimizer uses these statistics to generate the query result. it leverages them to identify data files that may contain records matching the conditional filter.
For the SELECT query in the question, The transaction log is scanned for min and max statistics for the price column
The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.
A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.
Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
- A . ‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.
- B . No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
- C . “Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.
- D . “Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.
C
Explanation:
In Databricks, using the Secrets module allows for secure management of sensitive information such as database credentials. Granting ‘Read’ permissions on a secret key that maps to database credentials for a specific team ensures that only members of that team can access these credentials. This approach aligns with the principle of least privilege, granting users the minimum level of access required to perform their jobs, thus enhancing security.
Reference: Databricks Documentation on Secret Management: Secrets
A data engineer is configuring a Databricks Asset Bundle to deploy a job with granular permissions.
The requirements are:
• Grant the data-engineers group CAN_MANAGE access to the job.
• Ensure the auditors’ group can view the job but not modify/run it.
• Avoid granting unintended permissions to other users/groups.
How should the data engineer deploy the job while meeting the requirements?
- A . resources:
jobs:
my-job:
name: data-pipeline
tasks: […]
job_clusters: […]
permissions:
– group_name: data-engineers
level: CAN_MANAGE
– group_name: auditors
level: CAN_VIEW
– group_name: admin-team
level: IS_OWNER - B . resources:
jobs:
my-job:
name: data-pipeline
tasks: […]
job: […]
permissions:
– group_name: data-engineers
level: CAN_MANAGE
permissions:
– group_name: auditors
level: CAN_VIEW - C . permissions:
– group_name: data-engineers
level: CAN_MANAGE
– group_name: auditors
level: CAN_VIEW
resources:
jobs:
my-job:
name: data-pipeline
tasks: […]
job_clusters: […] - D . resources:
jobs:
my-job:
name: data-pipeline
tasks: […]
job_clusters: […]
permissions:
– group_name: data-engineers
level: CAN_MANAGE
– group_name: auditors
level: CAN_VIEW
D
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
Databricks Asset Bundles (DABs) allow jobs, clusters, and permissions to be defined as code in YAML configuration files. According to the Databricks documentation on job permissions and bundle deployment, when defining permissions within a job resource, they must be scoped directly under that specific job’s definition. This ensures that permissions are applied only to the intended job resource and not inadvertently propagated to other jobs or resources.
In this scenario, the data engineer must grant the data-engineers group CAN_MANAGE access, allowing them to configure, edit, and manage the job, while the auditors group should only have CAN_VIEW, giving them read-only access to see configurations and results without the ability to modify or execute. Importantly, no additional groups should be granted permissions, in order to follow the principle of least privilege.
Options A and B introduce unnecessary or unintended groups (like admin-team in A) or define permissions outside of the job scope (as in B). Option C improperly separates the permissions block outside the job resource, which is not aligned with Databricks bundle best practices.
Option D is the correct approach because it defines the job resource my-job with its name, tasks, clusters, and the exact intended permissions (CAN_MANAGE for data-engineers and CAN_VIEW for auditors). This aligns with Databricks’ principle of least privilege and ensures compliance with governance standards in Unity Catalog-enabled workspaces.
Reference: Databricks Asset Bundles documentation ― Managing Jobs and Permissions
