Pre-Summer Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dm70dm

NCP-AII NVIDIA AI Infrastructure Questions and Answers

Questions 4

During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?

Options:

A.

device use dgx001 ; interfaces add vlan vlan100 ; set parent bond0 ; set mode 1 ; set network internalnet

B.

device use dgx001 ; interfaces add bond bond0 ; append interfaces enp225s0f1np1 enp97s0f1np1 ; set mode 1 ; set network internalnet

C.

device use dgx001 ; interfaces set enp225s0f1np1 network internalnet ; interfaces set enp97s0f1np1 network internalnet

D.

device use dgx001 ; interfaces delete enp225s0f1np1 ; interfaces delete enp97s0f1np1

Buy Now
Questions 5

An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?

Options:

A.

nvsm show power

B.

nvsm show powermode

C.

nvsm show health

D.

nvsm show alerts

Buy Now
Questions 6

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Options:

A.

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.

Power drain then restart the DGX and check if the performance degradation resolves.

D.

Increase the fan speed to maximum and check whether the performance improves.

Buy Now
Questions 7

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

Options:

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Buy Now
Questions 8

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Options:

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Buy Now
Questions 9

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

Options:

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Buy Now
Questions 10

An administrator needs to manually deploy the BlueField image on a target DPU. The administrator downloads the new image file and needs to flash it to the hardware. Which command should the administrator use?

Options:

A.

/opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl

B.

apt install doca-runtime

C.

dd if=/root/bf.image of=/dev/bf bs=4096k

D.

bfb-install --rshim

Buy Now
Questions 11

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

Options:

A.

ngc config get

B.

ngc init

C.

ngc config set

D.

ngc config update

Buy Now
Questions 12

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.

Reduction of problem size (N) to accelerate computation.

B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.

Automatic NVLink bandwidth doubling via driver updates.

Buy Now
Questions 13

You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

Options:

A.

Unlimited scalability by adding more DPUs without architectural changes.

B.

Elimination of latency issues in data processing tasks.

C.

Reduced CPU load by offloading data processing tasks to DPUs.

D.

Enhanced I/O performance with NVMe storage access speeds.

Buy Now
Questions 14

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

Options:

A.

export HPL_OOC_SAFE_SIZE=4.0

B.

export HPL_OOC_MODE=0

C.

export HPL_OOC_NUM_STREAMS=8

D.

export HPL_OOC_MAX_GPU_MEM=90

Buy Now
Questions 15

A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?

Options:

A.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

B.

esxcli graphics module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

C.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=FRL=0x01 "

D.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x00 "

Buy Now
Questions 16

An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?

Options:

A.

Sufficient networking, water-cooled racks, adequate rack power, sufficient storage, and rack space.

B.

Sufficient storage, sufficient networking, adequate rack power, and compatible hardware.

C.

Sufficient CPU capacity, PCIe slot allocation, sufficient cooling in the data center, and rack space.

D.

Sufficient cooling in the data center, adequate rack power, compatible hardware, and PCIe slot allocation.

Buy Now
Questions 17

Which function is used to collect the cluster counters information?

Options:

A.

SM

B.

PM

C.

GM

D.

FM

Buy Now
Questions 18

During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?

Pick the 2 correct responses below.

Options:

A.

MPI configuration error; rerun with --cpu-affinity adjustments.

B.

Network packet loss; analyze ibdiagnet reports.

C.

Thermal throttling due to cooling issues; check nvidia-smi dmon.

D.

Memory corruption; reboot the node and reduce problem size N.

Buy Now
Questions 19

After configuring NGC CLI with ngc config set, a user receives ”Authentication failed” errors when pulling containers. What step was most likely omitted?

Options:

A.

Installing the CLI with apt-get instead of manual extraction.

B.

Entering the API key during ngc config set or storing it in ~/.ngc/config.

C.

Setting --format_type=json to enable API interactions.

D.

Running sudo systemctl restart docker after configuration.

Buy Now
Questions 20

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

Options:

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node ' s GPUs need to be healthy.

Buy Now
Questions 21

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.

B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.

Critical failure; bus bandwidth exceeds hardware capabilities.

D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

Buy Now
Questions 22

An engineer is tasked with configuring Out-of-Band management for a DGX BasePOD deployment. Which network design will best ensure secure and reliable Out-of-Band management operations?

Options:

A.

Use a single VLAN for both Out-of-Band management and compute fabric to simplify network design.

B.

Configure Out-of-Band management interfaces to be accessible from any subnet within the data center for maximum flexibility.

C.

Connect Out-of-Band management ports to the same switch as user traffic for easier troubleshooting.

D.

Place all BMC and management interfaces on an isolated Out-of-Band network with access restricted by firewall rules.

Buy Now
Questions 23

An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?

Options:

A.

Lane power variance < 3dB across all transceivers.

B.

Transceiver model matching QSFP-DD specifications.

C.

Temperature fluctuations > 5°C during validation.

D.

Effective BER > 1.5E-254 during a < 6-hour monitoring window.

Buy Now
Questions 24

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize to develop more accurate energy-efficiency metrics?

Options:

A.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

B.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Use watts-used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

Buy Now
Questions 25

An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?

Options:

A.

nvidia-smi -q | grep " GPU Stress Test "

B.

sudo nvsm stress-test --force

C.

stress --cpu $(nproc) --io $(nproc) --timeout 600

D.

./gpu_burn 60

Buy Now
Questions 26

A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

Options:

A.

Enroll the MOK and sign NVIDIA kernel modules.

B.

Reinstall Docker without the NVIDIA runtime.

C.

Disable SELinux to relax unnecessary security policies.

D.

Run Docker with sudo for elevated privileges.

Buy Now
Questions 27

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

Options:

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

Buy Now
Questions 28

After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?

Options:

A.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 systemctl

B.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 ls -la

C.

sudo docker run --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

D.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

Buy Now
Questions 29

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

Options:

A.

Disable Secure Boot permanently in BIOS/UEFI settings.

B.

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Buy Now
Questions 30

An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?

Options:

A.

sudo ubuntu-drivers uninstall

B.

sudo rm -rf /usr/lib/nvidia

C.

sudo apt-get remove nvidia-driver-550

D.

sudo apt-get purge nvidia-* & & sudo apt-get autoremove

Buy Now
Questions 31

A system administrator needs to validate a GPU-based server and ensure that no errors occur under load. What command should be used?

Options:

A.

nvsm dump health

B.

stress-test --usage

C.

nvsm show health

D.

nvsm stress-test

Buy Now
Questions 32

A healthcare organization is deploying an AI system to analyze patient data for predictive diagnostics. The system must comply with strict data protection regulations such as HIPAA, ensuring that sensitive information remains confidential and secure. Considering the need for robust security measures, which combination of strategies should the organization prioritize to protect against data breaches and ensure regulatory compliance?

Options:

A.

Deploy data masking to obscure sensitive data during processing and use role-based access control (RBAC) to limit data access based on user roles.

B.

Use tokenization to replace sensitive data with non-sensitive tokens and employ multi-factor authentication (MFA) for system access.

C.

Implement symmetric encryption for all data at rest and rely solely on password-based access controls.

D.

Rely on asymmetric encryption for all communications and use data deduplication to minimize storage costs without additional security measures.

Buy Now
Questions 33

Your company is planning to expand its AI capabilities significantly over the next five years. To future-proof your storage infrastructure, you need a solution that can scale in both capacity and performance. Which of the following strategies best ensures that your storage infrastructure remains adaptable to future AI demands?

Options:

A.

Deploy an all-flash array and remove data tiering to reduce latency.

B.

Implement single-tier cloud storage solution to leverage cloud scalability.

C.

Use a hybrid cloud model combining scalable cloud resources with on-premises infrastructure.

D.

Implement on-premises block storage system with periodic hardware upgrades.

Buy Now
Questions 34

You must validate all physical cabling as part of the network bring-up phase in a new NVIDIA GPU cluster deployment. The design requires you to confirm that each cable matches the intended topology, all links are functional, and future troubleshooting and scalability are supported. Which two steps are essential to an effective recommended cabling validation process during cluster deployment?

Pick the 2 correct responses below.

Options:

A.

Focus on validating the highest bandwidth links first, deferring non-critical cable mislabeling until after initial workloads are deployed and tested.

B.

Run link tests only after the entire network is built and powered on to avoid redundant troubleshooting during bring-up.

C.

Run the cable validation process incrementally during deployment, section by section, to catch and resolve errors as early as possible.

D.

Compare every cable’s physical connection to the planned topology diagram and validate correct ports and link paths.

Buy Now
Questions 35

A system administrator needs to install a GPU/DPU in a server. The server has a free PCI-e slot, there are enough free PCI-e lanes, and there is enough room for the card. Which procedure should be followed?

Options:

A.

Ensure the server has enough power. Verify compatibility of cables with server ' s platform. Make sure the server is down to remove cables safely. Do not wear an ESD bracelet.

B.

Ensure the server has enough power. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

C.

Ensure the server has enough power. Make sure the server is up and running with attached cables. Wear an ESD bracelet.

D.

Ensure the server has enough power. Verify compatibility of cables with server ' s platform. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

Buy Now
Questions 36

What command is needed to measure BER (Bit Error Rate)?

Options:

A.

mlxconfig -d < device > q

B.

ethtool -S < device >

C.

mlxlink -d < device > -c -e

D.

mstflint -d < device > q full

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: May 30, 2026
Questions: 123

PDF + Testing Engine

$49.5  $164.99

Testing Engine

$37.5  $124.99
buy now NCP-AII testing engine

PDF (Q&A)

$31.5  $104.99
buy now NCP-AII pdf
dumpsmate guaranteed to pass

24/7 Customer Support

DumpsMate's team of experts is always available to respond your queries on exam preparation. Get professional answers on any topic of the certification syllabus. Our experts will thoroughly satisfy you.

Site Secure

mcafee secure

TESTED 30 May 2026