Practice Free NCP-AII Exam Online Questions
You are troubleshooting a performance issue with NVMe-oF traffic being accelerated by a BlueField-2 DPU. You suspect a problem with the RDMA configuration.
Which of the following ‘perfquery" commands would provide the MOST relevant information to diagnose potential RDMA issues such as packet loss or congestion?
- A . ‘perfquery -x’ (general link statistics)
- B . ‘perfquery -s;’ (switch statistics)
- C . ‘perfquery (QOS statistics)
- D . ‘perfquery -P (port counters including packet loss and congestion)
- E . ‘perfquery -G’ (global counters)
D
Explanation:
‘perfquery -P’ provides port counters, including critical information about packet loss, congestion, and other RDMA-related metrics atthe port level. This is the MOST relevant command for diagnosing performance problems related to RDMA within an NVMe-oF setup. Other options provide less specific or relevant information.
In a large-scale InfiniBand fabric, you need to implement a mechanism to prioritize traffic for a specific application that requires low latency and high bandwidth. You want to leverage Quality of Service (QOS) to achieve this.
Which of the following steps are essential to properly configure QOS in this scenario? (Select THREE)
- A . Configure VLAN tagging on the application’s traffic to isolate it from other traffic.
- B . Map the application’s traffic to a specific traffic class with appropriate priority settings within the InfiniBand switches.
- C . Configure Weighted Fair Queueing (WFQ) or Strict Priority Queueing on the egress ports of the InfiniBand switches to prioritize the application’s traffic class.
- D . Disable Adaptive Routing (AR) to ensure that the application’s traffic always takes the shortest path.
- E . Mark the application’s traffic with appropriate DiffServ Code Point (DSCP) values.
B,C,E
Explanation:
Effective QOS requires traffic classification (DSCP marking), mapping to appropriate traffic classes with priority settings, and configuring queueing mechanisms (WFQ/Strict Priority Queueing) on egress ports to enforce the priority. VLAN tagging is useful for network segmentation but not directly for QOS. Disabling AR might reduce path diversity, but could also lead to congestion if the shortest path is already heavily utilized.
What command is needed to measure BER (Bit Error Rate)?
- A . mlxconfig -d <device> q
- B . ethtool -S <device>
- C . mlxlink -d <device> -c -e
- D . mstflint -d <device> q full
C
Explanation:
In NVIDIA networking environments, specifically those utilizing InfiniBand or high-speed Ethernet via ConnectX adapters, monitoring the physical link quality is critical for preventing packet loss and RDMA retransmissions. The mlxlink tool is part of the NVIDIA Firmware Tools (MFT) package and is the primary utility for checking the status and health of the physical link. Using the -d flag specifies the device (e.g., /dev/mst/mt4123_pciconf0), while the -c (counters) and -e (error counters/BER) flags provide a detailed readout of the link’s performance. Bit Error Rate (BER) is a fundamental metric for signal integrity. NVIDIA systems typically distinguish between "Raw BER" (errors before Forward Error Correction) and "Effective BER" (errors remaining after FEC). A high BER often points to a failing transceiver, a dirty fiber connector, or a marginal DAC cable. While ethtool can show general statistics in Ethernet mode, mlxlink is the verified method for granular BER measurement across InfiniBand and high-speed fabrics, allowing engineers to determine if a link meets the "Error-Free" operation standards required for large-scale AI collective communications like NCCL.
You are implementing a security policy on a BlueField-2 DPU to filter traffic based on specific application signatures.
Which technology, supported by BlueField, allows you to achieve deep packet inspection (DPI) and apply security rules based on the detected application?
- A . TC (Traffic Control) with ‘iptables’ rules.
- B . eBPF (extended Berkeley Packet Filter) with XDP (eXpress Data Path).
- C . OVS (Open vSwitch) with OpenFlow rules.
- D . IPsec (Internet Protocol Security) tunnels.
- E . Netfilter with connection tracking.
B
Explanation:
eBPF with XDP is the most suitable technology for deep packet inspection (DPI) on BlueField. It allows you to run custom code at near-line speed to inspect packets and apply security rules based on application signatures. TC and Netfilter are less efficient for DPI, OVS/OpenFlow are more for switching policies, and IPsec focuses on encryption.
You’re optimizing an AMD EPYC server with 4 NVIDIAAIOO GPUs for a large language model training workload. You observe that the GPUs are consistently underutilized (50-60% utilization) while the CPUs are nearly maxed out.
Which of the following is the MOST likely bottleneck?
- A . Insufficient CPU cores to prepare and feed data to the GPUs.
- B . The PCle interconnect between the CPUs and GPUs is saturated.
- C . The system RAM is too small, causing excessive swapping.
- D . The storage system (SSD/NVMe) is too slow, leading to data starvation.
- E . The NCCL (NVIDIA Collective Communications Library) is not properly configured for inter-GPU communication.
A
Explanation:
When CPUs are maxed out and GPUs are underutilized, it suggests that the CPUs are unable to keep up with the data preparation and feeding requirements of the GPUs. Insufficient CPU cores become the bottleneck. While other options can contribute, the CPU being the primary bottleneck is the most likely cause in this scenario.
Your AI training pipeline involves a pre-processing step that reads data from a large HDF5 file. You notice significant delays during this step. You suspect the HDF5 file structure might be contributing to the slow read times.
What optimization technique is MOST likely to improve read performance from this HDF5 file?
- A . Converting the HDF5 file to a CSV file.
- B . Storing the HDF5 file on a network file system like NFS.
- C . Reorganizing the HDF5 file to improve data contiguity and chunking.
- D . Compressing the HDF5 file using gzip.
- E . Encrypting the HDF5 file for enhanced security.
C
Explanation:
Reorganizing the HDF5 file (option C) to improve data contiguity and chunking is the most effective optimization. HDF5 performance is highly dependent on how the data is laid out within the file. Contiguous data and optimal chunk sizes allow for more efficient 1/0 operations. Converting to CSV (A) loses the hierarchical structure of HDF5. Storing on NFS (B) adds network overhead. Compression (D) can reduce storage space but increases decompression overhead. Encryption (E) adds overhead without improving read performance.
Consider a scenario where you want to reset your NVIDIAA100 GPU back to a non-MIG mode state after having previously configured MIG.
Which of the following steps are required?
- A . Run ‘nvidia-smi ―set-mig-mode=disable -i O’ and then reboot the system.
- B . Run ‘nvidia-smi ―destroy-mig-config -i 0’, then run ‘nvidia-smi ―set-mig-mode=disable -i 0’, and finally reboot.
- C . Run ‘nvidia-smi ―set-mig-mode=disable -i O’ followed by ‘nvidia-smi ―reset-default-mig-mode -i
- D . Run ‘nvidia-smi ―set-mig-mode=disable -i, then power off the system and physically remove and re-install the GPU.
- E . Run ‘nvidia-smi ―set-mig-mode=disable -i O’, then run ‘nvidia-smi -i 0 -migrr 0’, and finally reboot.
A
Explanation:
To reset the GPU to non-MIG mode, the command ‘nvidia-smi ―set-mig-mode=disable -i (Y must be executed, followed by a system reboot. This ensures that the changes are applied during the next boot process. Destroying MIG config is not required to just disable the MIG mode, neither is physically reinstalling the GPU.
For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?
- A . broadcast_perf -b 4G -e 16G -w 160
- B . all_reduce_perf -b 8G -e 32G -c 1000 -z 1 -G 1000
- C . all_reduce_perf -b 8G -e 32G -z 1 -G 1000
- D . reduce_scatter_perf -f 2 -g 8
B
Explanation:
The NVIDIA Collective Communications Library (NCCL) tests are the gold standard for validating the interconnect performance of a GPU cluster. For a long-duration burn-in (48 hours), the goal is not just to measure peak bandwidth, but to stress the fabric under load to catch intermittent hardware failures or "Silent Data Corruption" (SDC). The all_reduce_perf test is the most intensive as it involves bidirectional data flow across all GPUs. The specific parameters in Option B are critical: -b 8G -e 32G sets the message size range to large buffers that saturate the 400G InfiniBand links; -c 1000 ensures a high number of iterations for statistical significance; -z 1 (check) is the most vital flag, as it enables verification of the mathematical result. If a bit flips during transmission due to a faulty transceiver, the -z 1 flag will catch the mismatch and report a failure. Finally, -G 1000 ensures the test runs long enough to reach thermal equilibrium across the switches and HCAs.
You are developing a distributed deep learning application that uses multiple GPUs across several Docker containers running on different physical servers.
How do you ensure that each container can access and utilize the GPUs on its respective host?
- A . Install the same version of NVIDIA drivers on all host machines and configure a network file system (NFS) to share the CUDA libraries between the containers.
- B . Ensure the NVIDIA Container Toolkit is installed and configured on each host machine, and use a container orchestration platform like Kubernetes to manage the deployment and GPU assignment.
- C . Manually configure each container to use the ‘CUDA VISIBLE DEVICES’ environment variable to specify the GPUs it should use on its respective host.
- D . Use Docker Swarm and specify GPU resource constraints in the ‘docker-compose.yml’ file to allocate GPUs to each service.
- E . Create a custom Docker network and configure each container to use the network’s gateway as the default GPU device.
B
Explanation:
The most robust solution for distributed GPU utilization is to leverage a container orchestration platform like Kubernetes (B) along with the NVIDIA Container Toolkit. Kubernetes handles scheduling, resource allocation (including GPUs), and networking across multiple nodes.
The NVIDIA Container Toolkit ensures that each container can access the GPUs on its host. While (C) is useful, it’s not sufficient for multi-server deployments. Docker Swarm (D) can work but lacks the sophisticated GPU scheduling capabilities of Kubernetes. NFS sharing (A) is unnecessary and can introduce performance bottlenecks. A custom Docker network (E) doesn’t directly address GPU access.
Which of the following are key benefits of using NVIDIA Spectrum-X switches in an A1 infrastructure compared to traditional Ethernet switches? (Select THREE)
- A . Lower cost per port.
- B . Support for RoCE (RDMA over Converged Ethernet) and InfiniBand protocols, enabling high-bandwidth, low-latency communication.
- C . Advanced telemetry and monitoring capabilities for network performance optimization.
- D . Hardware-based acceleration for collective communication operations used in distributed A1 training.
- E . Native support for IPv6.
B,C,D
Explanation:
Spectrum-X switches are designed for high-performance computing and A1 workloads. They support RoCE and InfiniBand for low- latency communication, offer advanced telemetry for network optimization, and include hardware-based acceleration for collective communication operations, improving the efficiency of distributed A1 training. While Spectrum-X supports IPv6, this is also a common feature in modern Ethernet switches. Spectrum-X switches typically have a higher cost per port compared to basic Ethernet switches due to their advanced features and performance.
