Table of Contents
NVIDIA Nsight Systems provides system-wide performance analysis for GPU applications, showing CPU activity, GPU kernels, memory transfers, and API calls on a unified timeline. Understanding GPU performance requires visibility into both the CPU-side code that launches kernels and the GPU-side execution that processes data—Nsight Systems captures both simultaneously.
The timeline view reveals how CPU and GPU activities overlap and interact, exposing idle time, serialization bottlenecks, and opportunities for better concurrency. You can see exactly when kernels launch, how long memory transfers take, and where the CPU waits for GPU completion. This holistic view is essential for optimizing the overall application, not just individual kernels.
Nsight Systems works with any CUDA application without requiring source code modifications, though adding NVTX annotations improves the clarity of timeline visualizations. It supports both command-line profiling for automated workflows and GUI visualization for interactive analysis.
Command-line profiling captures a trace that can be analyzed later:
$ nsys profile ./myprogram
$ nsys profile -o report ./myprogram # custom output name
$ nsys profile --stats=true ./myprogram # print summary stats
# Generate multiple output formats
$ nsys profile -o report --export=sqlite,text ./myprogramCommon options:
-o, --output=NAME # output file name (without extension)
-t, --trace=TRACE # what to trace: cuda,nvtx,osrt,cublas,cudnn
--stats=true # print summary statistics
--force-overwrite=true # overwrite existing report
-w, --show-output=true # show application output
--sample=cpu # CPU sampling
--cudabacktrace=true # CUDA API backtracesTrace specific APIs:
$ nsys profile -t cuda,nvtx ./myprogram # CUDA + NVTX markers
$ nsys profile -t cuda,osrt ./myprogram # CUDA + OS runtime
$ nsys profile -t cuda,cublas ./myprogram # CUDA + cuBLASOpen the report in the GUI for interactive timeline exploration:
$ nsys-ui report.nsys-rep # open in GUI
$ nsys stats report.nsys-rep # command-line statisticsExport to different formats for custom analysis:
$ nsys export -t sqlite report.nsys-rep # SQLite database
$ nsys export -t text report.nsys-rep # text summaryThe GUI timeline shows:
- CPU thread activity and call stacks
- CUDA API calls (cudaMalloc, cudaMemcpy, kernel launches)
- GPU kernel executions with duration
- Memory transfers between host and device
- NVTX ranges and markers
NVTX (NVIDIA Tools Extension) lets you add custom markers and ranges to your code, making the timeline easier to understand. Ranges show up as colored bars in the timeline, helping you correlate application phases with GPU activity.
#include <nvtx3/nvToolsExt.h>
void myFunction() {
nvtxRangePush("myFunction");
// ... work ...
nvtxRangePop();
}
// Or with colors
nvtxEventAttributes_t attr = {0};
attr.version = NVTX_VERSION;
attr.colorType = NVTX_COLOR_ARGB;
attr.color = 0xFF00FF00; // green
attr.messageType = NVTX_MESSAGE_TYPE_ASCII;
attr.message.ascii = "Important Section";
nvtxRangePushEx(&attr);Compile with:
$ nvcc -o myprogram main.cu -lnvToolsExtProfile and get summary statistics:
$ nsys profile --stats=true -o system ./myprogram
# Shows kernel times, memory transfer times, API call countsIdentify CPU-GPU synchronization issues:
$ nsys profile -t cuda ./myprogram
$ nsys-ui system.nsys-rep
# Look for gaps between kernel launches (CPU idle or sync points)Profile a running application:
$ nsys profile --attach-pid=1234 -o attached
# Or launch with delayed start
$ nsys profile --delay=5 ./myprogram # start profiling after 5 secondsProfile specific duration:
$ nsys profile --duration=10 ./myprogram # profile for 10 secondsProfile applications running on remote machines or clusters where you can't run the GUI directly:
# On remote machine
$ nsys profile -o /tmp/report ./myprogram
# Copy report to local machine
$ scp remote:/tmp/report.nsys-rep .
$ nsys-ui report.nsys-repNsight Systems GUI also supports connecting to remote machines via SSH for interactive profiling sessions.
Common performance issues visible in the timeline:
Issue Timeline Pattern
─────────────────────────────────────────────────────────────────
CPU-GPU serialization Gaps between kernel launches
Excessive synchronization Many cudaDeviceSynchronize calls
Memory transfer overhead Large cudaMemcpy blocks
Kernel launch overhead Many small kernels with gaps
Underutilized GPU Long CPU sections, short GPU bursts