Practice Free NCP-AII Exam Online Questions
A cluster administrator is preparing to update the firmware on a DGX HI 00 system, including the GPU tray (baseboard).
What is the correct sequence of steps to perform a safe and successful firmware upgrade?
- A . Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.
- B . Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.
- C . Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.
- D . Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.
After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?
- A . SSD write endurance and RAM capacity.
- B . Total energy consumption and NVLink bandwidth.
- C . Average CPU usage >80% and Docker container uptime.
- D . No thermal throttling events and consistent GPU utilization ≥95% throughout the test.
What information does the "ibnodes" command display?
- A . All host & server names
- B . All hosts & switches
- C . All server names
- D . All channel adapters
You are leading a project to enhance the energy efficiency of a data center that heavily relies on Al workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers.
Which strategy should you prioritize to develop more accurate energy-efficiency metrics?
- A . Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.
- B . Develop benchmarks tailored to specific workloads, such as MLPerf for Al applications, to better understand energy use in real-world scenarios.
- C . Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.
- D . Use watts-used as the primary measure of efficiency, as it accurately reflects the power input at any given time.
You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager (BCM) cluster.
Which two of the following actions are essential for a successful OS installation on the cluster’s head node? Pick the 2 correct responses below
- A . Set the desired time zone and configure NTP synchronization during the OS installation wizard.
- B . Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.
- C . Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.
- D . H Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.
After configuring NGC CLI with ngc config set, a user receives "Authentication failed" errors when pulling containers.
What step was most likely omitted?
- A . Running sudo systemctl restart docker after configuration.
- B . Entering the API key during ngc config set or storing it in ~/.ngc/config.
- C . Installing the CLI with apt-get instead of manual extraction.
- D . Setting –format_type=json to enable API interactions.
A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler.
What command achieves this?
- A . esxcli system module parameters set -m nvidia -p
- B . nvidia-smi-i 0-mig 1
- C . mixconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2
- D . esxcli-i 0-mig 18
You are tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for Al workloads. The environment previously had an older driver stack and custom kernel modules.
What is the most important step to successfully upgrade the drivers without causing conflicts?
- A . Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.
- B . Validate the driver version post-install since the fresh install will overwrite the legacy drivers.
- C . Keep the older driver running alongside the new version in case you need to roll back the upgrade.
- D . Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.
During a multi-day NeMo burn-in, intermittent "GPU fell off bus" errors occur.
Which diagnostic approach isolates hardware faults?
- A . Run DCGM diagnostics alongside burn-in to monitor GPU health metrics
- B . Enable HPL_USE_NVSHMEM for alternative memory sharing
- C . Reduce blocksize to 500MB to lower memory pressure
- D . Switch from BERT to GPT models for simpler computations
After NCCL burn-in reports "transport retry count exceeded," which corrective action addresses the underlying fabric issue?
- A . Increase NCCL_IB_TIMEOUT to tolerate longer latencies
- B . Insect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables
- C . Reduce message size to decrease network utilization
- D . Switch from Ring to Tree algorithms via NCCL_ALGO=tree
