Practice Free NCP-AII Exam Online Questions
You are using NVIDIA Spectrum-X switches in your A1 infrastructure. You observe high latency between two GPU servers during a large distributed training job. After analyzing the switch telemetry, you suspect a suboptimal routing path is contributing to the problem.
Which of the following methods offers the MOST granular control for influencing traffic flow within the Spectrum-X fabric to mitigate this?
- A . Adjust the Equal-Cost Multi-Path (ECMP) hashing algorithm globally on all switches.
- B . Configure QOS (Quality of Service) policies to prioritize traffic from the high-latency GPU servers.
- C . Implement Adaptive Routing (AR) or Dynamic Load Balancing (DLB) features available on the Spectrum-X switches to dynamically adjust paths based on network conditions.
- D . Manually configure static routes on the Spectrum-X switches to force traffic between the GPU servers along a specific path.
- E . Disable IPv6 to simplify routing decisions.
C
Explanation:
Adaptive Routing (AR) and Dynamic Load Balancing (DLB) are features specifically designed to dynamically adjust paths based on real-time network conditions in Spectrum-X. This provides the most granular and automated way to respond to congestion and optimize traffic flow compared to static routing or global ECMP adjustments. QOS prioritizes, but doesn’t change the chosen path.
What is the role of GPUDirect RDMA in an NVLink Switch-based system, and how does it improve performance?
- A . It allows GPUs to directly access each other’s memory without involving the CPIJ, reducing latency and CPU overhead.
- B . It provides a mechanism for GPUs to offload compute-intensive tasks to the CPU, improving overall system throughput.
- C . It enables direct communication between GPUs and storage devices, bypassing the network interface.
- D . It facilitates the virtualization of GPUs, allowing multiple virtual machines to share a single physical GPIJ.
- E . It encrypts data transmitted between GPUs, enhancing security.
A
Explanation:
GPUDirect RDMA enables direct memory access between GPUs, bypassing the CPU and reducing latency. This significantly improves performance for applications that require frequent data transfers between GPUs. Other options describe functionalities that are not associated with RDMA in this context.
Which protocol is commonly used in Spine-Leaf architectures for dynamic routing and load balancing across multiple paths?
- A . STP (Spanning Tree Protocol)
- B . OSPF (Open Shortest Path First)
- C . VRRP (Virtual Router Redundancy Protocol)
- D . ECMP (Equal-Cost Multi-Path)
- E . BGP (Border Gateway Protocol)
D
Explanation:
ECMP (Equal-Cost Multi-Path) is crucial for efficiently utilizing the multiple paths available in a Spine-Leaf architecture. It allows traffic to be distributed across these paths, improving throughput and reducing congestion. OSPF and BGP can be used for routing but do not inherently provide per-packet load balancing. STP is used to prevent loops, and VRRP provides router redundancy, neither of which directly address load balancing across multiple equal-cost paths.
You are observing high latency in your GPU-accelerated inference service deployed on Kubernetes. You suspect that GPU resource contention might be the cause.
What steps can you take to diagnose and mitigate this issue within the Kubernetes environment? (Multiple Answers)
- A . Monitor GPU utilization metrics (e.g., GPU utilization, memory usage) using tools like ‘nvidia-smi’ or Prometheus and Grafana.
- B . Implement resource quotas to limit the GPU resources that each namespace can consume.
- C . Utilize MIG (Multi-lnstance GPU) to partition GPUs and isolate workloads.
- D . Increase the number of replicas in the deployment to distribute the load across more GPUs.
- E . Restarting the cluster periodically.
A,B,C
Explanation:
Monitoring GPU utilization helps identify resource contention. Resource quotas prevent one namespace from consuming excessive GPU resources. MIG allows partitioning GPUs for workload isolation. Increasing replicas (option D) might help, but it doesn’t address the underlying contention issue. Restarting the cluster is not a solution for ongoing contention.
In an InfiniBand fabric, what is the primary role of the Subnet Manager (SM) with respect to routing?
- A . To forward packets based on destination IP addresses, similar to a traditional IP router.
- B . To discover the network topology, calculate routing paths, and program the forwarding tables (LID tables) in the switches.
- C . To monitor the network for congestion and dynamically adjust packet priorities using Quality of Service (QOS) mechanisms.
- D . To provide a command-line interface for users to manually configure routing tables on each InfiniBand switch.
- E . To act as a firewall, blocking unauthorized traffic based on pre-defined rules.
B
Explanation:
The Subnet Manager (SM) is responsible for discovering the InfiniBand topology, calculating routes, and programming the forwarding tables (LID tables) within the switches. This is crucial for establishing connectivity and ensuring efficient data transfer within the fabric. InfiniBand uses LID (Local Identifier) based routing, not IP addresses directly.
You encounter an error during MIG instance creation using ‘nvidia-smi’ stating ‘Insufficient GPU resources’.
Which of the following could be the cause? (Select all that apply)
- A . The requested MIG configuration exceeds the GPU’s available resources (e.g., compute or memory).
- B . The NVIDIA driver version is outdated and does not support the requested MIG configuration.
- C . The GPU is already fully utilized by other MIG instances or processes.
- D . The GPIJ is in a bad state and needs to be reset.
- E . There is no error; MIG always creates instances regardless of resources.
A,B,C
Explanation:
The ‘Insufficient GPIJ resources’ error indicates that the requested MIG instance creation cannot be fulfilled due to limitations in available resources (A) such as compute or memory. Outdated drivers (B) may not support the requested MIG configurations and hence can lead to resource management problems. When other instances or processes already consume all available resources (C), the operation can’t continue. A GPU in a bad state might cause issues, but the specific error message points to resource exhaustion more directly. MIG does not bypass resource checks (E).
You are tasked with upgrading the NVIDIA driver on a Kubernetes node hosting GPU-accelerated A1 workloads.
To minimize downtime and ensure a smooth transition, which sequence of steps should you follow?
- A . Drain the node, upgrade the driver, reboot the node, and uncordon it.
- B . Upgrade the driver directly on the node, reboot the node, and let Kubernetes automatically reschedule the workloads.
- C . Cordon the node, upgrade the driver, reboot the node, and uncordon it.
- D . Delete all pods running on the node, upgrade the driver, reboot the node, and recreate the pods.
- E . Upgrade the NVIDIA container toolkit, then upgrade the driver, reboot the node, and uncordon it.
C
Explanation:
Cordoning the node prevents new pods from being scheduled on it. After upgrading the driver and rebooting, uncordoning the node allows Kubernetes to resume scheduling workloads. Draining the node before upgrading can cause unnecessary downtime if pods are migrated before the upgrade process starts. The NVIDIA container toolkit must be compatible to the NVIDIA driver, but the upgrade sequence follows Option C steps.
Run GPU diagnostics.
Explanation:
Checking temperature is crucial first to avoid damaging the GPU if it’s overheating. Reseating addresses potential connectivity issues. Running diagnostics identifies hardware faults. Updating the driver should be done after hardware checks to ensure the card isn’t faulty.
A critical AI model training job consistently fails on a specific GPU server in your cluster after running for approximately 24 hours.
Monitoring data shows a sudden drop in GPU power consumption followed by a system reboot. All other GPUs on the server appear normal. The server has redundant PSUs.
What is the MOST likely cause?
- A . A software bug in the A1 model causing a kernel panic specifically triggered after 24 hours of execution.
- B . Thermal runaway on the GPU due to a failing thermal interface material (TIM) between the GPU die and the heatsink.
- C . A transient power supply issue affecting only one of the redundant PSUs, triggering a system-wide protection mechanism.
- D . ECC memory errors accumulating over time, eventually leading to a non-recoverable system fault.
- E . A driver crash, causing the GPU to reset and the system to reboot.
B
Explanation:
Thermal runaway (B) is the most probable cause. The 24-hour delay suggests a gradual heat buildup. A failing TIM would cause the GPU to overheat until it triggers a thermal shutdown, resulting in the power drop and reboot. While a PSU issue (C) is possible, redundant PSUs should prevent a complete failure unless one PSU is completely dead and the second PSU is overloaded by the entire load for a short period. The other options are less likely to cause this specific failure pattern.
You need to uninstall all NVIDIA drivers and associated packages from a Linux system cleanly.
Which command sequence is the most reliable for achieving this after stopping the display manager (e.g., ‘sudo systemctl stop gdm3’)?
- A . ‘sudo apt purge nvidia- (on Debian/Ubuntu-based systems)
- B . Running the .run’ installer with the ‘―uninstall’ option (if the driver was installed this way)
- C . ‘sudo yum remove nvidia- (on RHEL/CentOS-based systems)
- D . Deleting the Vusr/lib/nvidia and ‘lusr/share/nvidia directories.
- E . ‘sudo dnf remove nvidia- (on Fedora-based systems)
A,B,C,E
Explanation:
The appropriate package manager command (like ‘apt purge’, ‘yum remove’, or ‘dnf remove’ with the ‘nvidia- wildcard) is the most reliable for removing driver packages. If the driver was installed using the ‘ .’run’ installer, running it with ‘―uninstall’ is also effective. Directly deleting directories is not recommended as it may leave behind configuration files and dependencies. Always stop the display manager before uninstalling to avoid conflicts.