Practice Free NCP-AII Exam Online Questions
Refer to the output:
~ $ sudo nvsm show healthinfo ―Timestamp: Sat Dec 16 16:26:32 2017 -0800 Version: 17.12-5
Checks―BIOS Revision [5.11]…………………….
DGX Serial Number [YSY72800016)………………
Checking output of ‘lspci’ for expected GPU’s
Missing GPU at PCI address ’07:00.0′
Verify installed InfiniBand controllers………………..Healthy
Verify PCIe switches…………………………….Healthy
…[output truncated)
What insights can a system administrator gain regarding the DGX system’s health?
- A . A GPU tray upgrade failed.
- B . A GPU is missing on the DGX system.
- C . A GPU driver upgrade has failed.
- D . The system has passed the hardware health check successfully.
B
Explanation:
The output provided is a result of the NVIDIA System Management (NVSM) tool, specifically the nvsm show healthinfo command. NVSM is an essential diagnostic framework for NVIDIA DGX systems that monitors hardware health, identifies faults, and helps ensure the system remains within its validated operational state.
In this specific diagnostic trace, the system reports that the "Verify installed GPU’s" check has returned a status of Unhealthy. To provide a root cause, NVSM cross-references the live hardware enumeration from the lspci command against the system’s known "Golden Configuration" (the hardware manifest defined in the firmware). The explicit error message, "Missing GPU at PCI address ’07:00.0’", indicates that the system expects a GPU module to be present at that specific PCIe bus address, but the hardware is not responding or visible to the bus.
This insight allows a system administrator to conclude that a GPU is missing from the logical perspective of the system. This is a critical hardware fault rather than a software or driver issue. In a DGX H100 or A100 system, this could be caused by a physical module failure, a power delivery issue to that specific segment of the GPU baseboard, or a failure in the PCIe switch fabric. Because the DGX relies on a full set of 8 GPUs for high-speed collective communications (NCCL), a single missing GPU will prevent the node from participating in large-scale training jobs, requiring physical inspection or a GPU tray replacement (RMA).
Consider this scenario. You have a containerized A1 application that requires specific CUDA libraries. You want to manage the deployment and scaling of this application across multiple Kubernetes clusters, some of which might have different versions of NVIDIA drivers installed.
How would you handle the CUDA dependency management in this multi-cluster environment to ensure compatibility and reproducibility?
- A . Package the CUDA libraries directly into the application container image, ensuring that each cluster uses the same version.
- B . Use a multi-stage Docker build to dynamically download and install the appropriate CUDA libraries based on the driver version detected on each cluster.
- C . Leverage an init container to install the correct CUDA libraries before the application container starts, based on the detected driver version.
- D . Build a different container image for each Kubernetes cluster, specifically tailored to the driver version installed on that cluster.
- E . Use the ‘latest’ tag when pulling your container image to ensure the most up-to-date version is used.
A
Explanation:
Packaging the CUDA libraries into the container image ensures consistent behavior across all clusters, regardless of the driver version installed on each cluster’s nodes. It is the most portable solution.
Options B and C introduce runtime dependencies and increase complexity.
Option D becomes unwieldy to manage.
Option E is highly discouraged as the image can be updated without your knowledge.
You have a dataset with many small files (e.g., images). Directly reading these files can result in high metadata access overhead.
What are the MOST effective strategies to mitigate this problem?
- A . Combine small files into larger archive files (e.g., using TAR or HDF5)
- B . Store the files on a file system with very small block sizes
- C . Use a key-value store (e.g., LevelDB, RocksDB) to store the file data and metadata
- D . Increase the number of metadata servers in the parallel file system
- E . Replicate the dataset to increase read availability
A, D
Explanation:
Combining small files reduces the number of metadata operations. A key-value store can provide efficient access to small data items. Increasing metadata servers reduces the load on individual metadata servers. Smaller block sizes exacerbate the problem. Replicating data doesn’t address metadata overhead.
A systems engineer is updating firmware across a large DGX cluster using automation.
What is the best practice for minimizing risk and ensuring cluster health during and after the process?
- A . Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.
- B . To save time, simultaneously update all nodes in the cluster without draining or diagnostics.
- C . Update nodes that have reported faults, leaving others on older firmware.
- D . Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.
A
Explanation:
Updating firmware on an NVIDIA DGX cluster is a critical operation that involves multiple sensitive components, including the GPU baseboard, the BMC, the motherboard tray (SBC), and the InfiniBand HCAs. In a production environment, "Batching" is the industry standard to prevent a single corrupted firmware image or update failure from taking down the entire AI factory. The process must begin with "Draining" the nodes in the workload scheduler (like Slurm or Kubernetes) to ensure no active training jobs are interrupted. Running pre-update diagnostics―using tools like nvsm show health or dcgmi diag―is vital to establish a baseline and ensure the hardware is stable before applying changes. Once the firmware is applied in a controlled batch, post-update verification is required to confirm the system returns to a "Healthy" state and that all versions match the target manifest. This "Rolling Update" strategy allows the engineer to pause the automation if a specific node fails to return to service, protecting the overall availability of the cluster. Skipping diagnostics (Option D) or leaving nodes on mismatched versions (Option C) creates "configuration drift," which leads to unpredictable performance in collective communication libraries.
An A1 server exhibits frequent kernel panics under heavy GPU load. ‘dmesg’ reveals the following error: ‘NVRM: Xid (PCl:0000:3B:00): 79, pid=…, name=…, GPU has fallen off the bus.’
Which of the following is the least likely cause of this issue?
- A . Insufficient power supply to the GPIJ, causing it to become unstable under load.
- B . A loose or damaged PCle riser cable connecting the GPU to the motherboard.
- C . A driver bug in the NVIDIA drivers, leading to GPU instability.
- D . Overclocking the GPU beyond its stable limits.
- E . A faulty CPU.
E
Explanation:
The error message GPU has fallen off the bus strongly suggests a hardware-related issue with the GPU’s connection to the motherboard or its power supply. Insufficient power, a loose riser cable, driver bugs and overclocking can all lead to this. A faulty CPU, while capable of causing system instability, is less directly related to the GPIJ falling off the bus and therefore the least likely cause in this specific scenario.
You are building a cloud-native application that uses both CPU and GPU resources. You want to optimize resource utilization and cost by scheduling CPU-intensive tasks on nodes without GPUs and GPU-intensive tasks on nodes with GPUs.
How would you achieve this node selection and workload placement in Kubernetes?
- A . Use node affinity rules to schedule CPU-intensive tasks on nodes with GPUs and GPU-intensive tasks on nodes without GPUs.
- B . Use node affinity rules to schedule CPU-intensive tasks on nodes without GPUs and GPU-intensive tasks on nodes with GPUs.
- C . Use taints and tolerations to dedicate nodes without GPUs to CPU-intensive tasks and nodes with GPUs to GPU-intensive tasks.
- D . Use resource quotas to limit the CPU resources available on nodes with GPUs and the GPU resources available on nodes without GPUs.
- E . Use labels to identify the CPU and GPU-intensive nodes.
ABC
Explanation:
Node affinity rules allow you to specify constraints on which nodes a pod can be scheduled on. By using node affinity rules, you can ensure that CPU-intensive tasks are scheduled on nodes without GPUs and GPU-intensive tasks are scheduled on nodes with GPUs. This optimizes resource utilization and cost. Taints and tolerations can be used, but affinity is more flexible. Resource quotas limit resource usage but do not control placement.
You need to verify the integrity of the BlueField OS image before flashing it to the SmartNIC.
Which method would provide the strongest guarantee that the image has not been tampered with?
- A . Comparing the image size with the expected size from the download source.
- B . Checking the MD5 checksum of the image against the published MD5 checksum.
- C . Verifying the SHA256 checksum of the image against the published SHA256 checksum.
- D . Downloading the image from a trusted source.
- E . Utilizing gpg to verify the image.
C,E
Explanation:
SHA256 provides a stronger collision resistance than MD5, making it a more reliable method for verifying image integrity. Comparing the image size is not a reliable method, as a small change can be made without significantly altering the size. Downloading from a trusted source is important, but doesn’t guarantee the image hasn’t been tampered with after it was published. GPG provides cryptographic assurance that the file originated from and was signed by an identified party.
You’re deploying BlueField OS to multiple SmartNICs with varying hardware revisions.
How can you ensure that the correct device tree is loaded for each specific SmartNIC during the boot process?
- A . Create a single device tree that is compatible with all SmartNIC revisions. The kernel will automatically handle compatibility.
- B . Use a bootloader (e.g., U-Boot) that can detect the hardware revision and load the appropriate device tree based on a predefined mapping.
- C . Manually specify the device tree file in the bootloader configuration for each SmartNIC.
- D . Embed the device tree within the kernel image. The kernel will automatically select the correct one during boot.
- E . Relocate the device tree after the OS is running, but before services are started using a custom script.
B
Explanation:
A bootloader like U-Boot is designed to handle hardware detection and conditional loading of resources like device trees. It can identify the SmartNIC revision and load the corresponding DTB file. Creating a single compatible DTB is difficult and may not fully utilize hardware capabilities. Manually specifying the DTB for each NIC is not scalable. Embedding the DTB in the kernel is uncommon. Attempting to modify the device tree at runtime could lead to instability.
You are evaluating different parallel file systems for an AI training cluster. You need a file system that supports POSIX compliance and offers high bandwidth and low latency.
Which of the following options are viable candidates?
- A . BeeGFS
- B . GiusterFS
- C . Ceph
- D . Lustre
- E . NFS
A,D
Explanation:
BeeGFS and Lustre are designed for high-performance computing and AI workloads, offering high bandwidth, low latency, and POSIX compliance. GlusterFS and Ceph are more general-purpose distributed file systems. NFS is generally not suitable for demanding AI workloads due to its performance limitations.
You have installed an NVIDIA ConnectX-7 network adapter in an A1 server and configured RDMA over Converged Ethernet (RoCE). During validation, you observe very high latency between two servers communicating over RoCE.
Which of the following are potential causes? (Choose two)
- A . Incorrect MTU size configuration on the network interfaces.
- B . The network switch does not support RoCE.
- C . The network cables are damaged.
- D . The GPU driver is outdated.
- E . Insufficient memory on the network adapter.
A,B
Explanation:
RoCE requires specific switch support and a properly configured MTU. Damaged cables could cause packet loss, but usually not consistently high latency. GPU drivers are irrelevant. Network adapter memory is unlikely to cause high latency unless extremely undersized, a less likely scenario than incorrect configuration or lack of RoCE support.
