Practice Free Databricks Certified Data Analyst Associate Exam Online Questions
An analyst writes a query that contains a query parameter. They then add an area chart visualization to the query. While adding the area chart visualization to a dashboard, the analyst chooses "Dashboard Parameter" for the query parameter associated with the area chart.
Which of the following statements is true?
- A . The area chart will use whatever is selected in the Dashboard Parameter while all or the other visualizations will remain changed regardless of their parameter use.
- B . The area chart will use whatever is selected in the Dashboard Parameter along with all of the other visualizations in the dashboard that use the same parameter.
- C . The area chart will use whatever value is chosen on the dashboard at the time the area chart is added to the dashboard.
- D . The area chart will use whatever value is input by the analyst when the visualization is added to the dashboard. The parameter cannot be changed by the user afterwards.
- E . The area chart will convert to a Dashboard Parameter.
B
A Dashboard Parameter is a parameter that is configured for one or more visualizations within a dashboard and appears at the top of the dashboard. The parameter values specified for a Dashboard Parameter apply to all visualizations reusing that particular Dashboard Parameter1. Therefore, if the analyst chooses “Dashboard Parameter” for the query parameter associated with the area chart, the area chart will use whatever is selected in the Dashboard Parameter along with all of the other visualizations in the dashboard that use the same parameter. This allows the user to filter the data across multiple visualizations using a single parameter widget2.
Reference: Databricks SQL dashboards, Query parameters
Which of the following approaches can be used to ingest data directly from cloud-based object storage?
- A . Create an external table while specifying the DBFS storage path to FROM
- B . Create an external table while specifying the DBFS storage path to PATH
- C . It is not possible to directly ingest data from cloud-based object storage
- D . Create an external table while specifying the object storage path to FROM
- E . Create an external table while specifying the object storage path to LOCATION
E
Explanation:
External tables are tables that are defined in the Databricks metastore using the information stored in a cloud object storage location. External tables do not manage the data, but provide a schema and a table name to query the data. To create an external table, you can use the CREATE EXTERNAL TABLE statement and specify the object storage path to the LOCATION clause. For example, to create an external table named ext_table on a Parquet file stored in S3, you can use the following statement: SQL
CREATE EXTERNAL TABLE ext_table (
col1 INT,
col2 STRING
)
STORED AS PARQUET
LOCATION ‘s3://bucket/path/file.parquet’
AI-generated code. Review and use carefully. More info on FAQ.
Reference: External tables
The stakeholders.customers table has 15 columns and 3,000 rows of data.
The following command is run:
After running SELECT * FROM stakeholders.eur_customers, 15 rows are returned. After the command executes completely, the user logs out of Databricks.
After logging back in two days later, what is the status of the stakeholders.eur_customers view?
- A . The view remains available and SELECT * FROM stakeholders.eur_customers will execute correctly.
- B . The view has been dropped.
- C . The view is not available in the metastore, but the underlying data can be accessed with SELECT * FROM delta. `stakeholders.eur_customers`.
- D . The view remains available but attempting to SELECT from it results in an empty result set because data in views are automatically deleted after logging out.
- E . The view has been converted into a table.
A
Explanation:
In Databricks, a view is a saved SQL query definition that references existing tables or other views.
Once created, a view remains persisted in the metastore (such as Unity Catalog or Hive Metastore)
until it is explicitly dropped.
Key points:
Views do not store data themselves but reference data from underlying tables.
Logging out or being inactive does not delete or alter views.
Unless a user or admin explicitly drops the view or the underlying data/table is deleted, the view continues to function as expected.
Therefore, after logging back in―even days later―a user can still run SELECT * FROM stakeholders.eur_customers, and it will return the same data (provided the underlying table hasn’t changed).
Reference: Views – Databricks Documentation
A data analyst is processing a complex aggregation on a table with zero null values and their query returns the following result:
Which of the following queries did the analyst run to obtain the above result?
A)
B)
C)
D)
E)
- A . Option A
- B . Option B
- C . Option C
- D . Option D
- E . Option E
B
Explanation:
The result set provided shows a combination of grouping by two columns (group_1 and group_2) with subtotals for each level of grouping and a grand total. This pattern is typical of a GROUP BY … WITH ROLLUP operation in SQL, which provides subtotal rows and a grand total row in the result set.
Considering the query options:
A) Option A: GROUP BY group_1, group_2 INCLUDING NULL – This is not a standard SQL clause and would not result in subtotals and a grand total.
B) Option B: GROUP BY group_1, group_2 WITH ROLLUP – This would create subtotals for each unique group_1, each combination of group_1 and group_2, and a grand total, which matches the result set provided.
C) Option C: GROUP BY group_1, group 2 – This is a simple GROUP BY and would not include subtotals or a grand total.
D) Option D: GROUP BY group_1, group_2, (group_1, group_2) – This syntax is not standard and would likely result in an error or be interpreted as a simple GROUP BY, not providing the subtotals and grand total.
E) Option E: GROUP BY group_1, group_2 WITH CUBE – The WITH CUBE operation produces subtotals for all combinations of the selected columns and a grand total, which is more than what is shown in
the result set.
The correct answer is Option B, which uses WITH ROLLUP to generate the subtotals for each level of grouping as well as a grand total. This matches the result set where we have subtotals for each group_1, each combination of group_1 and group_2, and the grand total where both group_1 and group_2 are NULL.
A stakeholder has provided a data analyst with a lookup dataset in the form of a 50-row CSV file. The data analyst needs to upload this dataset for use as a table in Databricks SQL.
Which approach should the data analyst use to quickly upload the file into a table for use in Databricks SOL?
- A . Create a table by uploading the file using the Create page within Databricks SQL
- B . Create a table via a connection between Databricks and the desktop facilitated by Partner Connect.
- C . Create a table by uploading the file to cloud storage and then importing the data to Databricks.
- D . Create a table by manually copying and pasting the data values into cloud storage and then
importing the data to Databricks.
A
Explanation:
Databricks provides a user-friendly interface that allows data analysts to quickly upload small datasets, such as a 50-row CSV file, and create tables within Databricks SQL. The steps are as follows: Access the Data Upload Interface:
In the Databricks workspace, navigate to the sidebar and click on New > Add or upload data.
Select Create or modify a table.
Upload the CSV File:
Click on the browse button or drag and drop the CSV file directly onto the designated area. The interface supports uploading up to 10 files simultaneously, with a total size limit of 2 GB. Configure Table Settings:
After uploading, a preview of the data is displayed.
Specify the table name, select the appropriate schema, and configure any additional settings as needed.
Create the Table:
Once all configurations are set, click on the Create Table button to finalize the process.
This method is efficient for quickly importing small datasets without the need for additional tools or complex configurations. Options B, C, and D involve more complex or manual processes that are unnecessary for this task.
Reference: Create or modify a table using file upload
Which of the following layers of the medallion architecture is most commonly used by data analysts?
- A . None of these layers are used by data analysts
- B . Gold
- C . All of these layers are used equally by data analysts
- D . Silver
- E . Bronze
B
Explanation:
The gold layer of the medallion architecture contains data that is highly refined and aggregated, and powers analytics, machine learning, and production applications. Data analysts typically use the gold layer to access data that has been transformed into knowledge, rather than just information. The gold layer represents the final stage of data quality and optimization in the lakehouse.
Reference: What is the medallion lakehouse architecture?
A data analyst has been asked to use the below table sales_table to get the percentage rank of products within region by the sales:
The result of the query should look like this:
Which of the following queries will accomplish this task?
A)
B)
C)
D)
- A . Option A
- B . Option B
- C . Option C
- D . Option D
B
Explanation:
The correct query to get the percentage rank of products within region by the sales is option B. This query uses the PERCENT_RANK() window function to calculate the relative rank of each product within each region based on the sales amount. The window function is partitioned by region and ordered by sales in descending order. The result is aliased as rank and displayed along with the region and product columns.
The other options are incorrect because:
A) Option A uses the RANK() window function instead of the PERCENT_RANK() function. The RANK() function returns the rank of each row within the partition, but not the percentage rank. Also, the query does not have a GROUP BY clause, which is required for aggregate functions like SUM().
C) Option C uses the DENSE_RANK() window function instead of the PERCENT_RANK() function. The DENSE_RANK() function returns the rank of each row within the partition, but not the percentage rank. Also, the query does not have a GROUP BY clause, which is required for aggregate functions like SUM().
D) Option D uses the ROW_NUMBER() window function instead of the PERCENT_RANK() function. The ROW_NUMBER() function returns the sequential number of each row within the partition, but not the percentage rank. Also, the query does not have a GROUP BY clause, which is required for aggregate functions like SUM().
Reference:
1: PERCENT_RANK (Transact-SQL)
2: Window functions in Databricks SQL
3: Databricks Certified Data Analyst Associate Exam Guide
A data analyst has created a user-defined function using the following line of code:
CREATE FUNCTION price(spend DOUBLE, units DOUBLE)
RETURNS DOUBLE
RETURN spend / units;
Which of the following code blocks can be used to apply this function to the customer_spend and customer_units columns of the table customer_summary to create column customer_price?
- A . SELECT PRICE customer_spend, customer_units AS customer_price FROM customer_summary
- B . SELECT price FROM customer_summary
- C . SELECT function(price(customer_spend, customer_units)) AS customer_price FROM
customer_summary - D . SELECT double(price(customer_spend, customer_units)) AS customer_price FROM customer_summary
- E . SELECT price(customer_spend, customer_units) AS customer_price FROM customer_summary
E
Explanation:
A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment1. To apply a UDF to a table, the syntax is SELECT udf_name(column_name) AS alias FROM table_name2. Therefore, option E is the correct way to use the UDF price to create a new
column customer_price based on the existing columns customer_spend and customer_units from the table customer_summary.
Reference: What are user-defined functions (UDFs)?
User-defined scalar functions – SQL
V