Practice Free NCP-AII Exam Online Questions
You are installing the NGC CLI using ‘pip’ behind a corporate proxy. The installation fails due to connection errors.
How do you configure pip’ to use the proxy during the NGC CLI installation?
- A . Set the ‘http_proxy’ and ‘https_proxy’ environment variables with the proxy server address and port.
- B . Use the ‘―proxy option with the ‘pip install’ command, specifying the proxy server address and port.
- C . Create a ‘pip.conf file in the appropriate directory (e.g., ‘―/.pip/pip.conf) and configure the proxy settings within the file.
- D . Download the NGC CLI package manually and install it offline.
- E . NGC CLI does not support the use of proxies
A
Explanation:
You can configure ‘pip’ to use a proxy by setting environment variables (A), using the ‘―proxy’ option with the ‘pip install’ command (B), or creating a ‘pip.conf file (C). Downloading the package manually (D) is a workaround for the initial installation but doesn’t address ongoing communication with NGC.
Option E is incorrect since NGC CLI does support proxy configuration.
You are installing the NGC CLI using ‘pip’ behind a corporate proxy. The installation fails due to connection errors.
How do you configure pip’ to use the proxy during the NGC CLI installation?
- A . Set the ‘http_proxy’ and ‘https_proxy’ environment variables with the proxy server address and port.
- B . Use the ‘―proxy option with the ‘pip install’ command, specifying the proxy server address and port.
- C . Create a ‘pip.conf file in the appropriate directory (e.g., ‘―/.pip/pip.conf) and configure the proxy settings within the file.
- D . Download the NGC CLI package manually and install it offline.
- E . NGC CLI does not support the use of proxies
A
Explanation:
You can configure ‘pip’ to use a proxy by setting environment variables (A), using the ‘―proxy’ option with the ‘pip install’ command (B), or creating a ‘pip.conf file (C). Downloading the package manually (D) is a workaround for the initial installation but doesn’t address ongoing communication with NGC.
Option E is incorrect since NGC CLI does support proxy configuration.
Consider the following ‘Ispci’ output snippet after installing an NVIDIA GPU:
03:00.0 VGA compatible controller:
NVIDIA Corporation Device 2236 (rev al)
Subsystem: Dell Device 1234
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia drm, nvidia
What does this output indicate, and what is the POTENTIAL issue?
- A . The GPU is correctly installed, and the proprietary NVIDIA driver is loaded.
- B . The GPU is installed, but the open-source ‘nouveau’ driver is loaded instead of the proprietary NVIDIA driver, which could lead to suboptimal performance.
- C . Solution: Blacklist ‘nouveau’ and ensure the NVIDIA driver is correctly configured.
- D . The GPU is not properly installed, as indicated by ‘Device 2236’. Solution: Re-seat the GPU.
- E . The GPU is installed, but the ‘nvidiafb’ driver is causing conflicts. Solution: Remove the ‘nvidiafb’ module.
- F . The output indicates a hardware failure and requires a replacement of the GPU.
AC
Explanation:
The output shows that both the ‘nvidia’ driver and the ‘nouveau’ driver are loaded. ‘nouveau’ is an open-source driver, and while it allows the GPU to function, it does not provide the performance benefits of the proprietary NVIDIA driver. The presence of ‘nouveau’ can sometimes interfere with the proper loading of the NVIDIA driver. Blacklisting ‘nouveau’ ensures that only the NVIDIA driver is used.
You’re configuring a RoCEv2 network for your AI infrastructure.
Which UDP port number range is commonly used for RoCEv2 traffic, and why is it important to be aware of this?
- A . 0-1023, because these are well-known ports.
- B . 4791, which is reserved for VXLAN.
- C . 49152-65535, the dynamic/private port range, to avoid conflicts with other services.
- D . 1024-49151, the registered port range, for general application use.
- E . Any UDP port number can be used without issue.
B
Explanation:
RoCEv2 commonly uses the dynamic/private port range (49152-65535). This avoids potential conflicts with well-known or registered ports used by other services. Knowing this range is crucial for configuring firewalls and network policies to allow RoCEv2 traffic to flow freely.
After replacing a faulty NVIDIA GPU, the system boots, and ‘nvidia-smi’ detects the new card. However, when you run a CUDA program, it fails with the error "‘no CUDA-capable device is detected’". You’ve confirmed the correct drivers are installed and the GPU is properly seated.
What’s the most probable cause of this issue?
- A . The new GPU is incompatible with the existing system BIOS.
- B . The CUDA toolkit is not properly configured to use the new GPU.
- C . The ‘LD LIBRARY PATH* environment variable is not set correctly.
- D . The user running the CUDA program does not have the necessary permissions to access the GPU.
- E . The GPIJ is not properly initialized by the system due to a missing or incorrect ACPI configuration.
D
Explanation:
The error "no CUDA-capable device is detected", even when ‘nvidia-smi’ sees the GPIJ, points to a lower-level system issue that prevents CUDA from accessing the card. In such scenarios, ACPI (Advanced Configuration and Power Interface) misconfiguration is frequently the culprit. ACPI handles device initialization and power management. If ACPI doesn’t properly configure the new GPU, CUDA programs won’t be able to access it. Checking and correcting ACPI configuration would be the first line of action, which includes ensuring proper settings in the system BIOS/IJEFI related to PCI devices, especially those related to GPU/accelerators. LD LIBRARY PATH would affect runtime linking of CUDA libraries, but not the base device detection. User permissions are less likely to be the cause since ‘nvidia-smr works.
You are tasked with creating a custom Docker image for a deep learning application that requires a specific version of cuDNN. You want to minimize the image size while ensuring that the cuDNN libraries are correctly installed and configured.
What is the most efficient way to achieve this?
- A . Download the cuDNN archive from NVIDIA, extract the libraries, and manually copy them to the appropriate locations within the Dockerfile.
- B . Use a multi-stage Docker build, using a base image with the desired CUDA version for building and then copying only the necessary cuDNN libraries to a smaller runtime image.
- C . Install the entire CUDA toolkit within the Docker image, even if only cuDNN is needed.
- D . Use the NVIDIA Container Toolkit to dynamically inject the cuDNN libraries into the container at runtime.
- E . Use a pre-built CUDA base image and install cuDNN during the container run.
B
Explanation:
A multi-stage Docker build (B) is the most efficient approach. It allows you to use a larger image with the CUDA toolkit for building and then copy only the necessary cuDNN libraries to a smaller runtime image, minimizing the final image size. Manually copying libraries (A) is tedious and error-prone. Installing the entire CUDA toolkit (C) unnecessarily increases the image size. The NVIDIA Container Toolkit (D) focuses on enabling GPU access, not dynamically injecting specific libraries. Running an install of cuDNN during the container run is problematic since the image should be self-contained.
You are configuring a server with NVIDIA GPUs for optimal power efficiency. You want to leverage NVIDIA’s power management features to minimize energy consumption during idle periods.
Which of the following actions would be the MOST effective in achieving this goal, without significantly impacting performance during active workloads?
- A . Reduce the GPU’s clock speeds to the lowest possible setting, regardless of workload.
- B . Enable NVIDIA’s Adaptive Clocking and Power Limiting features, allowing the GPU to dynamically adjust its clock speeds and power consumption based on the workload.
- C . Disable all GPU power management features to ensure maximum performance at all times.
- D . Remove one or more GPUs from the server to reduce overall power consumption.
- E . Set a very low static power limit for the GPUs, significantly restricting their performance even during active workloads.
B
Explanation:
Enabling NVIDIA’s Adaptive Clocking and Power Limiting features is the MOST effective approach. These features allow the GPU to dynamically adjust its clock speeds and power consumption based on the workload, minimizing energy consumption during idle periods while maximizing performance during active workloads. Setting a fixed low clock speed (A) or power limit (E) would severely impact performance. Disabling power management (C) wastes energy. Removing GPUs (D) reduces performance capacity.
Which of the following is the MOST important reason for using a dedicated storage network (e.g., InfiniBand or RoCE) for AI/ML workloads compared to using the existing Ethernet network?
- A . Improved security due to network isolation.
- B . Lower latency and higher bandwidth for data transfer.
- C . Simplified network management and configuration.
- D . Reduced cost compared to upgrading the existing Ethernet infrastructure.
- E . Automatic Quality of Service (QOS) prioritization for AI/ML traffic.
B
Explanation:
The primary benefit of a dedicated storage network like InfiniBand or RoCE is the significant improvement in latency and bandwidth (option B) compared to Ethernet. These technologies are designed for high-performance computing and can handle the intense 1/0 demands of AI/ML workloads. While security (A) can be improved through isolation, and QOS (E) is possible, the performance advantage is the most crucial factor. Cost (D) is generally higher, and management (C) can be more complex.
A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks.
Which feature should be implemented on the Ethernet switches?
- A . Implement redundant switches with spanning tree protocol.
- B . MLAG for bonded interfaces across redundant switches.
- C . Use only one switch for all management and storage traffic.
- D . Disable VLANs and use unmanaged switches.
B
Explanation:
For the "North-South" and "Management/Storage" Ethernet fabrics in an NVIDIA AI Factory, high availability is paramount. Unlike the InfiniBand compute fabric, which uses its own routing logic, the Ethernet side relies on standard data center protocols. To provide true hardware redundancy and double the available bandwidth (Load Balancing), NVIDIA recommends MLAG (Multi-Chassis Link Aggregation). MLAG allows two physical switches to appear as a single logical unit to the DGX nodes. The DGX can then bond its two Ethernet NICs (e.g., in an 802.3ad LACP bond) and connect one cable to each switch. This configuration provides several benefits: if one switch fails, the traffic seamlessly stays on the other link without the slow convergence times associated with Spanning Tree Protocol (Option A). Furthermore, it allows the cluster to utilize the combined bandwidth of both links for heavy storage traffic (like NFS or S3 ingestion). Using a single switch (Option C) or unmanaged hardware (Option D) creates single points of failure and lacks the traffic isolation (VLANs) required for secure AI infrastructure.
Consider a scenario where you need to isolate GPU workloads in a multi-tenant Kubernetes cluster.
Which of the following Kubernetes constructs would be MOST suitable for achieving strong isolation at both the resource and network level?
- A . Using namespaces with resource quotas and network policies.
- B . Using labels and selectors to schedule workloads on specific GPU nodes.
- C . Using taints and tolerations to dedicate GPU nodes to specific workloads.
- D . Using pod affinity and anti-affinity rules to control pod placement.
- E . Using node affinity only.
A
Explanation:
Namespaces provide logical isolation within a Kubernetes cluster. Resource quotas limit the resources (including GPIJs) that a namespace can consume, while network policies control network traffic between namespaces, ensuring strong isolation. Options B, C, D, and E provide some level of control over pod placement but do not offer the same level of resource and network isolation as namespaces with resource quotas and network policies.
