Practice Free NCP-AII Exam Online Questions
You are setting up a multi-node A1 cluster with NVIDIA GPUs and InfiniBand for inter-node communication. You need to ensure the InfiniBand network is functioning optimally for GPU-accelerated workloads.
What steps would you take to validate the InfiniBand installation and performance?
- A . Run ‘ibstat’ to check InfiniBand interface status, use ‘ping’ to test connectivity, and rely on NCCL’s internal checks during training.
- B . Run ‘ibstat’ to check InfiniBand interface status, use ‘ibping’ and ‘ibperf to test latency and bandwidth, and verify correct NCCL configuration (e.g., during a distributed training run.
- C . Configure a static IP address on the InfiniBand interfaces, and rely on the operating system’s network diagnostics.
- D . Use ‘nvidia-smi’ to monitor InfiniBand traffic, and rely on CUDA-aware MPl for communication validation.
- E . Verify the InfiniBand drivers are installed and then run a standard TCP benchmark between the nodes.
B
Explanation:
Sibstat’ verifies interface status. ‘ibping’ and ‘ibperf are InfiniBand-specific tools for latency and bandwidth testing. NCCL (NVIDIA Collective Communications Library) is critical for distributed training, and provides valuable diagnostic information. The other options are either incomplete or rely on tools not specific to InfiniBand.
An A1 server is exhibiting unusually high CPU utilization during a GPU-accelerated workload.
How can you determine if the CPU is becoming a bottleneck, preventing the GPUs from achieving their full potential?
- A . Monitor CPU utilization using ‘top’ or ‘htop’ while the workload is running; sustained high utilization suggests a bottleneck.
- B . Use ‘nvidia-smi’ to check the GPU utilization; low GPIJ utilization combined with high CPU utilization indicates a potential CPU bottleneck.
- C . Profile the application with a CPU profiler (e.g., ‘perf) to identify CPU-bound functions.
- D . Run a CPU-intensive benchmark in parallel with the GPIJ workload to observe performance degradation.
- E . All of the above
E
Explanation:
All the options provide valid methods for identifying a CPU bottleneck. Monitoring CPU and GPU utilization, profiling the application, and running parallel benchmarks all help to determine if the CPU is limiting GPU performance.
You are tasked with configuring an NVIDIAA100 GPU for a multi-tenant environment using MIG.
The goal is to provide the following MIG configuration:
- A . 0/1 /2/3/4/5/6
- B . 1/1/1/1/1/1/1
- C . 7/0/0/0/0/0/0
- D . 14/0/0/0/0/0/0
- E . 6/0/0/0/0/0/0
A
Explanation:
The question specifies that you need the configuration that allows 7x1g.10gb instances and one 7g.80gb. The format is num_1g/num_2g/num_3g/num_4g/num_7g/num_10g/num_20g/num_40g. In the provided configuration, 0/1/2/3/4/5/6 means one 7g, then we need the instances.
Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules.
What is the most important step to successfully upgrade the drivers without causing conflicts?
- A . Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.
- B . Validate the driver version post-install since the fresh install will overwrite the legacy drivers.
- C . Keep the older driver running alongside the new version in case you need to roll back the upgrade.
- D . Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.
D
Explanation:
NVIDIA AI infrastructure relies on a tightly coupled stack involving the GPU driver and the DOCA (Data Center Infrastructure-on-a-Chip Architecture) drivers, which include the OFED stack for InfiniBand. When upgrading from an older version, "residue" from previous installations or custom kernel modules can cause version mismatches or symbol errors, preventing the new drivers from loading. The "100% verified" best practice is to perform a clean installation. This involves completely uninstalling the existing drivers (using apt-get purge or the .run file’s –uninstall flag) and ensuring that kernel modules are removed from memory. This prevents conflict between the legacy nvidia.ko and the new version. Once the system is "clean," the new DOCA and GPU drivers can be installed as a matched set, ensuring that features like GPUDirect RDMA function correctly without being hindered by legacy configuration files.
Consider an AI server equipped with two NVIDIAAI 00 GPUs interconnected with NVLink. You want to maximize the memory bandwidth available to a CUDA application. You observe that the application’s performance doesn’t scale linearly with the number of GPUs.
Which of the following coding techniques or configurations could potentially improve inter-GPU memory access performance?
- A . Ensure all memory allocations are performed on GPU O to minimize data transfer.
- B . Use CUDA-aware MPl for inter-GPU communication to leverage NVLink.
- C . Employ Unified Memory (I-JM) with prefetching to automatically migrate data between GPUs as needed.
- D . Manually manage data transfers between GPUs using ‘cudaMemcpyPeer’ to exploit NVLink bandwidth. Choose the GPU with more free memory for allocations.
- E . Disable NVLink to force the application to use PCle, which might provide more consistent performance.
B
Explanation:
ScudaMemcpyPeer allows explicit, optimized data transfers between GPUs using NVLink. Unified Memory with prefetching can simplify development, but might not always provide the best performance. CUDA-aware MPl is typically used for inter-node communication, not intra-node GPU-GPU. Allocating all memory on one GPU defeats the purpose of multi-GPU acceleration. PCle will be slower than NVLink. Manually managing memory transfers, while complex, gives the programmer the most control over leveraging NVLink bandwidth.
You’ve replaced a faulty NVIDIA Quadro RTX 8000 GPU with an identical model in a workstation. The system boots, and ‘nvidia-smi’ recognizes the new GPU. However, when rendering complex 3D scenes in Maya, you observe significantly lower performance compared to before the replacement. Profiling with the NVIDIA Nsight Graphics debugger shows that the GPU is only utilizing a small fraction of its available memory bandwidth.
What are the TWO most likely contributing factors?
- A . The new GPU’s PCle link speed is operating at a lower generation (e.g., Gen3 instead of Gen4).
- B . The NVIDIA OptiX denoiser is not properly configured or enabled.
- C . The workstation’s power plan is set to ‘Power Saver,’ limiting GPU performance.
- D . The Maya scene file contains corrupted or inefficient geometry.
- E . The newly installed GPU’s VBIOS has not been properly flashed, causing an incompatibility issue.
A,C
Explanation:
Low memory bandwidth utilization after replacing a GPU suggests a bottleneck in data transfer or power delivery. A lower PCle link speed (A) would severely limit the GPU’s ability to receive data from the CPU and system memory, resulting in underutilization. Setting the power plan to ‘Power Saver’ (C) restricts the GPU’s power budget, preventing it from reaching its maximum clock speed and memory bandwidth. While OptiX denoiser configuration (B) and scene file issues (D) can impact rendering performance, they are less likely to directly cause low memory bandwidth utilization. VBIOS issues (E) could cause problems, but incorrect or missing VBIOS can cause driver errors or complete system unresponsiveness.
During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s.
What does this indicate about the fabric performance?
- A . Inconclusive; rerun with point-to-point tests.
- B . Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.
- C . Critical failure; bus bandwidth exceeds hardware capabilities.
- D . Suboptimal performance; algorithm bandwidth should match bus bandwidth.
B
Explanation:
When evaluating NVIDIA Collective Communications Library (NCCL) performance, it is vital to distinguish between Algorithm Bandwidth and Bus Bandwidth. For an all_reduce operation, the Bus Bandwidth represents the effective data transfer rate across the hardware links, which includes the overhead of the ring or tree collective algorithm. In an NDR (400G) InfiniBand fabric, the theoretical peak per link is 50 GB/s (unidirectional). In a 64-GPU cluster (8 nodes of 8 GPUs), achieving a bus bandwidth of 656 GB/s indicates that the fabric is efficiently utilizing the multiple 400G rails available on the DGX H100. This result is considered optimal as it reflects near-line-rate performance when accounting for network headers and synchronization overhead. Algorithm bandwidth is naturally lower because it represents the "useful" data moved from the application’s perspective. If the bus bandwidth were significantly lower, it would suggest congestion, cable faults, or sub-optimal routing.
You are configuring a server with multiple GPUs for CUDA-aware MPI.
Which environment variable is critical for ensuring proper GPU affinity, so that each MPI process uses the correct GPU?
- A . CUDA VISIBLE DEVICES
- B . CUDA DEVICE ORDER
- C . LD LIBRARY PATH
- D . MPI GPU SUPPORT
- E . CUDA LAUNCH BLOCKING-I
A
Explanation:
‘CUDA VISIBLE DEVICES’ is essential for GPU affinity. It allows you to specify which GPUs are visible to a particular process. Without it, all processes might try to use the same GPU, leading to performance bottlenecks. controls the order in which GPUs are enumerated. specifies the path to shared libraries. is hypothetical. forces synchronous CUDA calls.
Which of the following techniques can be used to optimize storage performance for deep learning training?
- A . Data prefetching
- B . Data compression using lossless algorithms (e.g., gzip)
- C . Using a larger block size for the file system
- D . Data sharding
- E . Data deduplication
A,C,D
Explanation:
Data prefetching anticipates future data needs and loads data into the cache before it is requested. A larger block size can improve I/O throughput for large files. Data sharding distributes data across multiple storage devices to increase parallelism. Data compression, while saving space, can add overhead during training. Data deduplication is not normally usefull for training data sets.
You are tasked with installing NVIDIA GPUs into a server that supports both single and double-width cards. You want to maximize GPU density.
What is the MOST important factor to consider when choosing between single and double-width cards?
- A . The clock speed of the GPUs.
- B . The amount of VRAM on the GPUs.
- C . The available PCIe slots and their spacing within the server chassis, and the server’s cooling capacity.
- D . The price of the GPUs.
- E . The brand of the GPUs.
C
Explanation:
While clock speed, VRAM, price, and brand are relevant, the physical constraints of the server (PCle slot availability/spacing and cooling capacity) are paramount when deciding between single and double-width cards. Double-width cards offer more performance but require more space and cooling. If spacing isn’t proper, and the cooling isn’t adequet, performance is not relevant. If the cooling is inadequate, and cards are too close together, performance will suffer due to throttling.
