Practice Free NCP-AII Exam Online Questions
An InfiniBand server stops working, and a system administrator runs the "ibstat" command that provides the following output:
CA ‘mlx5_1’
CA type: MT4115
Number of ports: 2
Firmware version: 10.20.1010
Hardware version: 0
Node GUID: 0x0002c90300002f78
System image GUID: 0x0002c90300002f7b
Port 1:
State: Initializing
Physical state: Linkup
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c90300002f79
Link layer: InfiniBand
What is the cause of the issue?
- A . The HCA port is faulty.
- B . There is no running SM in the fabric.
- C . The neighboring switch port is faulty.
- D . The cable is disconnected.
B
Explanation:
The ibstat command is a fundamental diagnostic tool in the NVIDIA InfiniBand stack used to query the status of local Host Channel Adapters (HCAs). In the provided output, the most critical data points are the Physical state, the State, and the SM lid.
The Physical state: Linkup confirms that the electrical or optical connection between the server’s HCA and the neighboring switch port is established and healthy at the physical layer. This immediately rules out a disconnected cable (Option D) or a completely dead hardware port (Options A and C). However, the State: Initializing indicates that while the "wires" are connected, the logical InfiniBand protocol has not finished its handshake.
In an InfiniBand fabric, the Subnet Manager (SM) is the centralized "brain" responsible for discovering nodes, assigning Local Identifiers (LIDs), and configuring routing tables. The output shows Base lid: 0 and SM lid: 0, which signifies that the port has not been assigned a LID and cannot find an active Subnet Manager to talk to. Without a running SM to transition the port from "Initializing" to "Active," no RDMA traffic can pass through the fabric. This scenario typically occurs if the SM service has crashed on the management node, or if the SM is disabled on the managed switches. Therefore, the root cause is the absence of an operational Subnet Manager in the fabric to complete the logical link initialization.
An InfiniBand server stops working, and a system administrator runs the "ibstat" command that provides the following output:
CA ‘mlx5_1’
CA type: MT4115
Number of ports: 2
Firmware version: 10.20.1010
Hardware version: 0
Node GUID: 0x0002c90300002f78
System image GUID: 0x0002c90300002f7b
Port 1:
State: Initializing
Physical state: Linkup
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c90300002f79
Link layer: InfiniBand
What is the cause of the issue?
- A . The HCA port is faulty.
- B . There is no running SM in the fabric.
- C . The neighboring switch port is faulty.
- D . The cable is disconnected.
B
Explanation:
The ibstat command is a fundamental diagnostic tool in the NVIDIA InfiniBand stack used to query the status of local Host Channel Adapters (HCAs). In the provided output, the most critical data points are the Physical state, the State, and the SM lid.
The Physical state: Linkup confirms that the electrical or optical connection between the server’s HCA and the neighboring switch port is established and healthy at the physical layer. This immediately rules out a disconnected cable (Option D) or a completely dead hardware port (Options A and C). However, the State: Initializing indicates that while the "wires" are connected, the logical InfiniBand protocol has not finished its handshake.
In an InfiniBand fabric, the Subnet Manager (SM) is the centralized "brain" responsible for discovering nodes, assigning Local Identifiers (LIDs), and configuring routing tables. The output shows Base lid: 0 and SM lid: 0, which signifies that the port has not been assigned a LID and cannot find an active Subnet Manager to talk to. Without a running SM to transition the port from "Initializing" to "Active," no RDMA traffic can pass through the fabric. This scenario typically occurs if the SM service has crashed on the management node, or if the SM is disabled on the managed switches. Therefore, the root cause is the absence of an operational Subnet Manager in the fabric to complete the logical link initialization.
During the physical installation of an NVIDIA GPU, you accidentally touch the gold connector pins on the card.
What is the recommended course of action BEFORE inserting the GPU into the PCle slot?
- A . Blow on the pins to remove any dust or debris.
- B . Wipe the pins with a dry cloth.
- C . Clean the pins with isopropyl alcohol and a lint-free swab, ensuring they are completely dry before installation.
- D . Use compressed air to clean the pins.
- E . It is okay to insert it directly as is.
C
Explanation:
Touching the pins can transfer oils or static electricity that can interfere with the electrical connection. Cleaning with isopropyl alcohol and a lint-free swab removes these contaminants and ensures a proper connection. Blowing, wiping with a dry cloth, or using compressed air may not be sufficient and could even introduce more contaminants.
After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL].
What is the most likely cause?
- A . The BCM license expired after HA configuration.
- B . Network connectivity issues between the primary and secondary head nodes.
- C . The secondary head node lacks NVIDIA GPU drivers.
- D . The cluster nodes are powered on during the HA configuration.
B
Explanation:
In a Bright Cluster Manager HA setup, the database (MySQL/MariaDB) must remain perfectly synchronized between the active and standby head nodes to allow for a seamless transition. This synchronization typically occurs over a dedicated management or heartbeat network. If cmsh status shows the database service as [FAIL] on the secondary node, it almost always points to a communication breakdown. Without a stable network path, the secondary node cannot receive the binary logs from the primary node to keep its local copy up to date. While licensing (Option A) is important, a license failure usually disables management capabilities entirely rather than just the MySQL sync. Furthermore, head nodes are management servers and do not require GPU drivers (Option C) for their primary function. Ensuring low-latency, reliable connectivity between the two head nodes is the primary troubleshooting step for resolving "MySQL FAIL" states in BCM.
A user reports that their GPU-accelerated application is crashing with a CUDA error related to ‘out of memory’. You have confirmed that the GPU has sufficient physical memory.
What are the likely causes and troubleshooting steps?
- A . The application is leaking GPU memory. Use a memory profiling tool like ‘cuda-memcheck’ to identify the source of the leak.
- B . The application is requesting a larger block of memory than is available in a single allocation. Try breaking the allocation into smaller chunks or using managed memory.
- C . The CUDA driver version is incompatible with the CUDA runtime version used by the application. Update the CUDA driver to match the runtime version.
- D . The process has exceeded the maximum number of GPU contexts allowed. Reduce the number of concurrent CUDA applications running on the GPU.
- E . The system’s virtual memory is exhausted. Increase the swap space.
A,B
Explanation:
Memory leaks and single-allocation limits are common causes of ‘out of memory’ errors, even when sufficient physical memory exists. ‘cuda-memcheck’ is specifically designed to find memory errors in CUDA applications. While driver incompatibility is possible, leaks and allocation size limits are more frequent occurrences.
You are configuring a Mellanox InfiniBand network for a DGXAIOO cluster.
What is the RECOMMENDED subnet manager for a large, high-performance A1 training environment, and why?
- A . OpenSM, because it’s the default and easiest to configure.
- B . UFM (Unified Fabric Manager), because it provides advanced management, monitoring, and optimization capabilities.
- C . IBA management tools that ship with the OS (e.g., ‘ibnetdiscover’).
- D . Any subnet manager; the performance difference is negligible.
- E . A custom-built subnet manager using the InfiniBand verbs API.
B
Explanation:
UFM is the recommended subnet manager for large A1 training environments using DGX systems. It offers advanced features like real-time monitoring, congestion control, adaptive routing, and telemetry, which are crucial for maximizing performance and stability in demanding workloads. OpenSM lacks these advanced features and is not suitable for large, performance-critical clusters.
While running a large A1 training job, you observe the following output from ‘nvidia-smi’:
GPU 0: P2
GPU 1: P2
GPU 2: P2
GPU 3: P2
What does the ‘P2’ state indicate, and what steps should you take to investigate this further in the context of validating optimal hardware operation?
- A . P2 indicates the GPUs are in a high-performance state; no further investigation is needed.
- B . P2 indicates the GPUs are in a low-power idle state; investigate if the driver is correctly configured and the workload is properly utilizing the GPUs.
- C . P2 indicates a critical error state; immediately halt the training job and check the system logs for hardware failures.
- D . P2 indicates the GPU is running at maximum clock speed. Check for thermal throttling.
B
Explanation:
P2 typically indicates a power-saving state where the GPU is operating at a reduced clock speed. It’s crucial to investigate whether the workload is demanding sufficient resources from the GPUs, and whether power limits or other configuration settings are preventing the GPUs from reaching their maximum performance state. Review ‘nvidia-smi -q’ output for power usage and clock speeds to verify proper operation. Other power states (PO, Pl) represent varying levels of performance.
You are observing that the memory bandwidth being achieved by your CUDA application on an NVIDIAAIOO GPU is significantly lower than the theoretical peak bandwidth.
Which of the following could be potential causes for this, and what actions can you take to validate or mitigate them? (Select all that apply)
- A . The application is using uncoalesced memory access patterns. Refactor the code to ensure contiguous memory access by threads within a warp.
- B . The application is using a small transfer size per kernel launch. Increase the amount of data processed per kernel launch to amortize the overhead of kernel launch and data transfer.
- C . The GPU is being limited by power capping. Increase the power limit using ‘nvidia-smi -pl’ (if permitted) to allow the GPU to operate at higher clock speeds.
- D . The application is using single precision floating-point operations. Switch to double precision to increase memory bandwidth utilization.
- E . The system memory is fully occupied. Deallocate some memory.
A,B,C
Explanation:
Uncoalesced memory access, small transfer sizes, and power capping are all factors that can limit achieved memory bandwidth. Switching to double precision will increase memory usage not necessarily bandwidth utilization (though the impact can vary depending on the workload). Power cap can definitely limit GPU performance, so raising it could help, as could code optimization. Therefore, the answer is A, B, and C, E is not usually relevant.
You are setting up a BlueField-2 SmartNIC and want to offload network functions.
Which of the following are valid methods for enabling hardware offload capabilities?
- A . Using the ‘ethtoor command to enable specific offload features like checksum offload, TCP segmentation offload (TSO), and UDP fragmentation offload (UFO).
- B . Modifying the device tree to enable specific hardware features.
- C . Installing and configuring the appropriate Mellanox OFED drivers, which automatically enable many hardware offload features.
- D . Running a custom script that programs the hardware offload engines directly.
- E . Recompiling the Linux Kernel with the correct compilation flags.
A,C
Explanation:
The ‘ethtoor command is used to configure various network interface settings, including enabling/disabling hardware offload features. Installing the correct Mellanox OFED drivers is crucial, as they provide the necessary modules and tools to utilize the hardware offload capabilities. While device tree modification can influence hardware behavior, it’s less common and typically handled by driver configuration. A custom script directly programming the hardware is unlikely and driver recompilation may be required, but often isn’t necessary with default settings.
You are using NVIDIA Spectrum-X switches in your A1 infrastructure. You observe high latency between two GPU servers during a large distributed training job. After analyzing the switch telemetry, you suspect a suboptimal routing path is contributing to the problem.
Which of the following methods offers the MOST granular control for influencing traffic flow within the Spectrum-X fabric to mitigate this?
- A . Adjust the Equal-Cost Multi-Path (ECMP) hashing algorithm globally on all switches.
- B . Configure QOS (Quality of Service) policies to prioritize traffic from the high-latency GPU servers.
- C . Implement Adaptive Routing (AR) or Dynamic Load Balancing (DLB) features available on the Spectrum-X switches to dynamically adjust paths based on network conditions.
- D . Manually configure static routes on the Spectrum-X switches to force traffic between the GPU servers along a specific path.
- E . Disable IPv6 to simplify routing decisions.
C
Explanation:
Adaptive Routing (AR) and Dynamic Load Balancing (DLB) are features specifically designed to dynamically adjust paths based on real-time network conditions in Spectrum-X. This provides the most granular and automated way to respond to congestion and optimize traffic flow compared to static routing or global ECMP adjustments. QOS prioritizes, but doesn’t change the chosen path.
