Practice Free NCP-AII Exam Online Questions
You are building a cloud-native application that uses both CPU and GPU resources. You want to optimize resource utilization and cost by scheduling CPU-intensive tasks on nodes without GPUs and GPU-intensive tasks on nodes with GPUs.
How would you achieve this node selection and workload placement in Kubernetes?
- A . Use node affinity rules to schedule CPU-intensive tasks on nodes with GPUs and GPU-intensive tasks on nodes without GPUs.
- B . Use node affinity rules to schedule CPU-intensive tasks on nodes without GPUs and GPU-intensive tasks on nodes with GPUs.
- C . Use taints and tolerations to dedicate nodes without GPUs to CPU-intensive tasks and nodes with GPUs to GPU-intensive tasks.
- D . Use resource quotas to limit the CPU resources available on nodes with GPUs and the GPU resources available on nodes without GPUs.
- E . Use labels to identify the CPU and GPU-intensive nodes.
B
Explanation:
Node affinity rules allow you to specify constraints on which nodes a pod can be scheduled on. By using node affinity rules, you can ensure that CPU-intensive tasks are scheduled on nodes without GPUs and GPU-intensive tasks are scheduled on nodes with GPUs. This optimizes resource utilization and cost. Taints and tolerations can be used, but affinity is more flexible. Resource quotas limit resource usage but do not control placement.
You are tasked with ensuring optimal power efficiency for a GPU server running machine learning workloads. You want to dynamically adjust the GPU’s power consumption based on its utilization.
Which of the following methods is the MOST suitable for achieving this, assuming the server’s BIOS and the NVIDIA drivers support it?
- A . Manually set the GPU’s power limit using ‘nvidia-smi -pl and create a script to monitor utilization and adjust the power limit periodically.
- B . Configure the server’s BIOS/UEFI to use a power-saving profile, which will automatically reduce the GPU’s power consumption when idle.
- C . Enable Dynamic Boost in the NVIDIA Control Panel (if available), which will automatically allocate power between the CPU and GPU based on their current needs.
- D . Use NVIDIA’s Data Center GPU Manager (DCGM) to monitor GPU utilization and dynamically adjust the power limit based on a predefined policy.
- E . Disable ECC (Error Correcting Code) on the GPU to reduce power consumption.
D
Explanation:
DCGM provides the most comprehensive and automated solution for dynamic power management. It can monitor GPIJ utilization in real-time and adjust the power limit based on predefined policies, ensuring optimal power efficiency without manual intervention. Manually adjusting the power limit is possible but requires scripting and continuous monitoring. Dynamic Boost is typically for laptops, and BIOS power profiles may not be fine-grained enough. Disabling ECC reduces power but compromises data integrity.
While running a large A1 training job, you observe the following output from ‘nvidia-smi’:
GPU 0: P2
GPU 1: P2
GPU 2: P2
GPU 3: P2
What does the ‘P2’ state indicate, and what steps should you take to investigate this further in the context of validating optimal hardware operation?
- A . P2 indicates the GPUs are in a high-performance state; no further investigation is needed.
- B . P2 indicates the GPUs are in a low-power idle state; investigate if the driver is correctly configured and the workload is properly utilizing the GPUs.
- C . P2 indicates a critical error state; immediately halt the training job and check the system logs for hardware failures.
- D . P2 indicates the GPU is running at maximum clock speed. Check for thermal throttling.
B
Explanation:
P2 typically indicates a power-saving state where the GPU is operating at a reduced clock speed. It’s crucial to investigate whether the workload is demanding sufficient resources from the GPUs, and whether power limits or other configuration settings are preventing the GPUs from reaching their maximum performance state. Review ‘nvidia-smi -q’ output for power usage and clock speeds to verify proper operation. Other power states (PO, Pl) represent varying levels of performance.
You are configuring a server with multiple GPUs for CUDA-aware MPI.
Which environment variable is critical for ensuring proper GPU affinity, so that each MPI process uses the correct GPU?
- A . CUDA VISIBLE DEVICES
- B . CUDA DEVICE ORDER
- C . LD LIBRARY PATH
- D . MPI GPU SUPPORT
- E . CUDA LAUNCH BLOCKING-I
A
Explanation:
‘CUDA VISIBLE DEVICES’ is essential for GPU affinity. It allows you to specify which GPUs are visible to a particular process. Without it, all processes might try to use the same GPU, leading to performance bottlenecks. controls the order in which GPUs are enumerated. specifies the path to shared libraries. is hypothetical. forces synchronous CUDA calls.
You are planning the network infrastructure for a DGX SuperPOD. You need to ensure that the network fabric can handle the high bandwidth and low latency requirements of A1 training workloads.
Which network technology is the RECOMMENDED choice for interconnecting the DGX nodes within the SuperPOD, and why?
- A . Gigabit Ethernet, because it’s widely available and inexpensive.
- B . 10 Gigabit Ethernet, for a balance between cost and performance.
- C . InfiniBand, due to its high bandwidth, low latency, and RDMA support.
- D . Wi-Fi 6, for wireless connectivity and flexibility.
- E . Token Ring, because it’s a reliable and deterministic networking protocol.
C
Explanation:
InfiniBand is the recommended network technology for DGX SuperPODs due to its high bandwidth, low latency, and support for RDMA (Remote Direct Memory Access). RDMA allows GPIJs to directly access each other’s memory without involving the CPU, significantly reducing latency and improving performance for distributed A1 training workloads. Ethernet, even at higher speeds, generally doesn’t offer the same level of performance and RDMA capabilities as InfiniBand.
You’re troubleshooting a DGX-I server exhibiting performance degradation during a large-scale distributed training job. ‘nvidia-smü shows all GPUs are detected, but one GPU consistently reports significantly lower utilization than the others. Attempts to reschedule orkloads to that GPU frequently result in CUDA errors.
Which of the following is the MOST likely cause and the BEST initial roubleshooting step?
- A . A driver issue affecting only one GPU; reinstall NVIDIA drivers completely.
- B . A software bug in the training script utilizing that specific GPU’s resources inefficiently; debug the training script.
- C . A hardware fault with the GPU, potentially thermal throttling or memory issues; run ‘nvidia-smi -i -q’ to check temperatures, power limits, and error counts.
- D . Insufficient cooling in the server rack; verify adequate airflow and cooling capacity for the rack.
- E . Power supply unit (PSU) overload, causing reduced power delivery to that GPU; monitor PSU load and check PSU specifications.
C
Explanation:
While all options are possibilities, the consistently lower utilization and CUDA errors point strongly to a hardware fault. Running nvidia-smi -i -q’ provides detailed telemetry data, including temperature, power limits, and ECC error counts, which are crucial for diagnosing GPU hardware issues.
Which of the following techniques are effective for improving inter-GPU communication performance in a multi-GPU Intel Xeon server used for distributed deep learning training with NCCL?
- A . Enabling PCle peer-to-peer transfers between GPUs.
- B . Utilizing InfiniBand or RoCE interconnects if available.
- C . Increasing the system RAM size to minimize data transfer to disk.
- D . Configuring NCCL to use the correct network interface and transport protocol (e.g., 1B, Socket).
- E . Disabling CPU frequency scaling to maintain consistent performance.
A,B,D
Explanation:
Improving inter-GPU communication involves optimizing the network used for transferring data between GPUs. PCle peer-to-peer, InfiniBand/RoCE, and proper NCCL configuration all contribute to faster communication. Increasing RAM size helps with data caching but doesn’t directly affect inter-GPU communication speed. Disabling CPU frequency scaling is about CPU performance stability, not inter-GPU communication directly.
A distributed training job using multiple nodes, each with eight NVIDIA GPUs, experiences significant performance degradation. You notice that the network bandwidth between nodes is consistently near its maximum capacity. However, ‘nvidia-smi’ shows low GPU utilization on some nodes.
What is the MOST likely cause?
- A . The GPUs are overheating, causing thermal throttling.
- B . Data is not being distributed evenly across the nodes; some nodes are waiting for data from others.
- C . The NVIDIA drivers are outdated, causing communication bottlenecks.
- D . The network interface cards (NICs) are faulty, causing packet loss and retransmissions.
- E . The CPU is heavily loaded, causing contention for network resources.
B
Explanation:
High network bandwidth utilization combined with low GPU utilization on some nodes strongly suggests a data imbalance. Some nodes are likely waiting for data from other nodes, causing them to be idle while the network is saturated. This is a common problem in distributed training and requires addressing the data distribution strategy. While other factors (overheating, outdated drivers, faulty NICs, CPU load) could contribute, they are less likely to be the primary cause given the observed symptoms.
You are configuring a network bridge on a Linux host that will connect multiple physical network interfaces to a virtual machine. You need to ensure that the virtual machine receives an IP address via DHCP.
Which of the following is the correct command sequence to create the bridge interface ‘br0’, add physical interfaces ‘eth0’ and ‘eth1’ to it, and bring up the bridge interface? Assume the required packages are installed. Consider using ‘ip’ command.
A )
B )
C )
D )
E )
- A . Option A
- B . Option B
- C . Option C
- D . Option D
- E . Option E
D
Explanation:
Option D is the correct sequence using the Sip’ command. First, create the bridge ‘ bro’. Then, add the physical interfaces ‘eth0 and "eth1’ as slaves to the bridge. Next, bring up the physical interfaces. After that, bring up the bridge interface. Finally, use "dhclient bro to obtain an IP address for the bridge via DHCP.
Option C is the old way, using ‘brctr and ‘ifconfig’, which are deprecated. The others lack the crucial step of bringing up the bridge after attaching the physical interfaces and before running ‘dhclient’.
A large A1 model is training using a dataset stored on a network-attached storage (NAS) device. The data transfer speeds are significantly lower than expected. After initial troubleshooting, you discover that the MTU (Maximum Transmission Unit) size on the network interfaces of the training server and the NAS device are mismatched. The server is configured with an MTIJ of 1500, while the NAS device is configured with an MTU of 9000 (Jumbo Frames).
What is the MOST likely consequence of this MTU mismatch, and what action should you take?
- A . Data packets will be fragmented, leading to increased overhead and reduced performance. Configure both the server and the NAS device to use the same MTU size (either 1500 or 9000).
- B . The connection between the server and the NAS device will be unreliable, resulting in data corruption. Increase the MTU size on both devices to the maximum supported value.
- C . The server will be unable to communicate with the NAS device. Reduce the MTU size on the server to match the MTU size of the NAS device.
- D . The data transfer will be limited to the lowest common MTU size, but there will be no significant performance impact. No action is required.
- E . Data packets will be retransmitted, increasing the latency but still getting the full throughput. Configure the server to use Path MTU Discovery (PMTUD).
A
Explanation:
An MTU mismatch (option A) will cause fragmentation, where larger packets are broken down into smaller packets before being transmitted, adding overhead and reducing performance. The solution is to configure both devices to use the same MTU size. Choosing 1500 ensures compatibility, while 9000 requires the entire network path to support jumbo frames.