Usage: nvprof [options] [application] [application-arguments] Options: --aggregate-mode <on|off> Turn on/off aggregate mode for events and metrics specified by subsequent "--events" and "--metrics" options. Those event/metric values will be collected for each domain instance, instead of the whole device. Allowed values: on - turn on aggregate mode (default) off - turn off aggregate mode --analysis-metrics Collect profiling data that can be imported to Visual Profiler's "analysis" mode. Note: Use "--export-profile" to specify an export file. --concurrent-kernels <on|off> Turn on/off concurrent kernel execution. If concurrent kernel execution is off, all kernels running on one device will be serialized. Allowed values: on - turn on concurrent kernel execution (default) off - turn off concurrent kernel execution --continuous-sampling-interval <interval> Set the continuous mode sampling interval in milliseconds. Minimum is 1 ms. Default is 2 ms. --dependency-analysis Generate event dependency graph for host and device activities and run dependency analysis. --device-buffer-size <size in MBs> Set the device memory size (in MBs) reserved for storing profiling data for non-CDP operations, especially for concurrent kernel tracing, for each buffer on a context. The default value is 8MB. The size should be a positive integer. --device-cdp-buffer-size <size in MBs> Set the device memory size (in MBs) reserved for storing profiling data for CDP operations for each buffer on a context. The default value is 8MB. The size should be a positive integer. --devices <device ids> Change the scope of subsequent "--events", "--metrics", "--query-events" and "--query-metrics" options. Allowed values: all - change scope to all valid devices comma-separated device IDs - change scope to specified devices --event-collection-mode <mode> Choose event collection mode for all events/metrics Allowed values: kernel - events/metrics are collected only for durations of kernel executions (default) continuous - events/metrics are collected for duration of application. This is not applicable for non-tesla devices. This mode is compatible only with NVLink events/metrics. This modeis incompatible with "--profile-all-processes" or "--profile-child-processes" or "--replay-mode kernel" or "--replay-mode application". -e, --events <event names> Specify the events to be profiled on certain device(s). Multiple event names separated by comma can be specified. Which device(s) are profiled is controlled by the "--devices" option. Otherwise events will be collected on all devices. For a list of available events, use "--query-events". Use "--events all" to profile all events available for each device. Use "--devices" and "--kernels" to select a specific kernel invocation. --kernel-latency-timestamps <on|off> Turn on/off collection of kernel latency timestamps, namely queued and submitted. The queued timestamp is captured when a kernel launch command was queued into the CPU command buffer. The submitted timestamp denotes when the CPU command buffer containing this kernel launch was submitted to the GPU. Turning this option on may incur an overhead during profiling. Allowed values: on - turn on collection of kernel latency timestamps off - turn off collection of kernel latency timestamps (default) --kernels <kernel path syntax> Change the scope of subsequent "--events", "--metrics" options. The syntax is as follows: <kernel name> Limit scope to given kernel name. or <context id/name>:<stream id/name>:<kernel name>:<invocation> The context/stream IDs, names, kernel name and invocation can be regular expressions. Empty string matches any number or characters. If <context id/name> or <stream id/name> is a positive number, it's strictly matched against the CUDA context/stream ID. Otherwise it's treated as a regular expression and matched against the context/stream name specified by the NVTX library. If the invocation count is a positive number, it's strictly matched against the invocation of the kernel. Otherwise it's treated as a regular expression. Example: --kernels "1:foo:bar:2" will profile any kernel whose name contains "bar" and is the 2nd instance on context 1 and on stream named "foo". -m, --metrics <metric names> Specify the metrics to be profiled on certain device(s). Multiple metric names separated by comma can be specified. Which device(s) are profiled is controlled by the "--devices" option. Otherwise metrics will be collected on all devices. For a list of available metrics, use "--query-metrics". Use "--metrics all" to profile all metrics available for each device. Use "--devices" and "--kernels" to select a specific kernel invocation. Note: "--metrics all" does not include some metrics which are needed for Visual Profiler's source level analysis. For that, use "--analysis-metrics". --pc-sampling-period <period> Specify PC Sampling period in cycles, at which the sampling records will be dumped. Allowed values for the period are integers between 5 to 31 both inclusive. This will set the sampling period to (2^period) cycles Default value is a number between 5 and 12 based on the setup.Note: Only available for GM20X+. --profile-all-processes Profile all processes launched by the same user who launched this nvprof instance. Note: Only one instance of nvprof can run with this option at the same time. Under this mode, there's no need to specify an application to run. --profile-api-trace <none|runtime|driver|all> Turn on/off CUDA runtime/driver API tracing. Allowed values: none - turn off API tracing runtime - only turn on CUDA runtime API tracing driver - only turn on CUDA driver API tracing all - turn on all API tracing (default) --profile-child-processes Profile the application and all child processes launched by it. --profile-from-start <on|off> Enable/disable profiling from the start of the application. If it's disabled, the application can use {cu,cuda}Profiler{Start,Stop} to turn on/off profiling. Allowed values: on - enable profiling from start (default) off - disable profiling from start --profiling-semaphore-pool-size <count> Set the profiling semaphore pool size reserved for storing profiling data for serialized kernels and memory operations for each context. The default value is 65536. The size should be a positive integer. --query-events List all the events available on the device(s). Device(s) queried can be controlled by the "--devices" option. --query-metrics List all the metrics available on the device(s). Device(s) queried can be controlled by the "--devices" option. --replay-mode <mode> Choose replay mode used when not all events/metrics can be collected in a single run. Allowed values: disabled - replay is disabled, events/metrics couldn't be profiled will be dropped kernel - each kernel invocation is replayed (default) application - the entire application is replayed. This modeis incompatible with "--profile-all-processes" or "profile-child-processes". -a, --source-level-analysis <source level analysis names> Specify the source level metrics to be profiled on a certain kernel invocation. Use "--devices" and "--kernels" to select a specific kernel invocation. Allowed values: one or more of the following, separated by commas global_access: global access shared_access: shared access branch: divergent branch instruction_execution: instruction execution pc_sampling: pc sampling, available only for GM20X+ Note: Use "--export-profile" to specify an export file. --system-profiling <on|off> Turn on/off power, clock, and thermal profiling. Allowed values: on - turn on system profiling off - turn off system profiling (default) -t, --timeout <seconds> Set an execution timeout (in seconds) for the CUDA application. Note: Timeout starts counting from the moment the CUDA driver is initialized. If the application doesn't call any CUDA APIs, timeout won't be triggered. --track-memory-allocations <on|off> Turn on/off tracking of memory operations, which involves recording timestamps, memory size, memory type and program counters of the memory allocations and frees. Turning this option on may incur an overhead during profiling. Allowed values: on - turn on tracking of memory allocations and free off - turn off tracking of memory allocations and free (default) --unified-memory-profiling <per-process-device|off> Configure unified memory profiling. Allowed values: per-process-device - collect counts for each process and each device (default) off - turn off unified memory profiling --cpu-profiling <on|off> Turn on CPU profiling. Note: CPU profiling is not supported in multi-process mode. --cpu-profiling-explain-ccff <filename> Path to a PGI pgexplain.xml file that should be used to interpret Common Compiler Feedback Format (CCFF) messages. --cpu-profiling-frequency <frequency> Set the CPU profiling frequency in samples per second. Default is 25Hz. Maximum is 500Hz. --cpu-profiling-max-depth <depth> Set the maximum depth of each call stack. Zero means no limit. Default is zero. --cpu-profiling-mode <flat|top-down|bottom-up> Set the output mode of CPU profiling. Allowed values: flat - Show flat profile top-down - Show parent functions at the top bottom-up - Show parent functions at the bottom (default) --cpu-profiling-percentage-threshold <threshold> Filter out the entries that are below the set percentage threshold. The limit should be an integer between 0 and 100, inclusive. Zero means no limit. Default is zero. --cpu-profiling-scope <function|instruction> Choose the profiling scope. Allowed values: function - Each level in the stack trace represents a distinct function (default) instruction - Each level in the stack trace represents a distinct instruction address --cpu-profiling-show-ccff <on|off> Choose whether to print Common Compiler Feedback Format (CCFF) messages embedded in the binary. Note: this option implies "--cpu-profiling-scope instruction".Default is off. --cpu-profiling-show-library <on|off> Choose whether to print the library name for each sample. --cpu-profiling-thread-mode <separated|aggregated> Set the thread mode of CPU profiling. Allowed values: separated - Show separate profile for each thread aggregated - Aggregate data from all threads (default) --cpu-profiling-unwind-stack <on|off> Choose whether to unwind the CPU call-stack at each sample point. Default is on. --openacc-profiling <on|off> Enable/disable recording information from the OpenACC profiling interface. Note: if the OpenACC profiling interface is available depends on the OpenACC runtime. Default is on. --context-name <name> Name of the CUDA context. "%i" in the context name string is replaced with the ID of the context. "%p" in the context name string is replaced with the process ID of the application being profiled. "%q{<ENV>}" in the context name string is replaced with the value of the environment variable "<ENV>". If the environment variable is not set it's an error. "%h" in the context name string is replaced with the hostname of the system. "%%" in the context name string is replaced with "%". Any other character following "%" is illegal. --csv Use comma-separated values in the output. --demangling <on|off> Turn on/off C++ name demangling of function names. Allowed values: on - turn on demangling (default) off - turn off demangling -u, --normalized-time-unit <s|ms|us|ns|col|auto> Specify the unit of time that will be used in the output. Allowed values: s - second, ms - millisecond, us - microsecond, ns - nanosecond col - a fixed unit for each column auto (default) - the scale is chosen for each value based on its length. --openacc-summary-mode <mode> Set how durations are computed in the OpenACC summary. Allowed values: exclusive: show exclusive times (default) inclusive: show inclusive times --print-api-summary Print a summary of CUDA runtime/driver API calls. --print-api-trace Print CUDA runtime/driver API trace. --print-dependency-analysis-trace Print dependency analysis trace. --print-gpu-summary Print a summary of the activities on the GPU (including CUDA kernels and memcpy's/memset's). --print-gpu-trace Print individual kernel invocations (including CUDA memcpy's/memset's) and sort them in chronological order. In event/metric profiling mode, show events/metrics for each kernel invocation. --print-openacc-constructs Include parent construct names in OpenACC profile. --print-openacc-summary Print a summary of the OpenACC profile. --print-openacc-trace Print a trace of the OpenACC profile. -s, --print-summary Print a summary of the profiling result on screen. Note: This is the default unless "--export-profile" or other print options are used. --print-summary-per-gpu Print a summary of the profiling result for each GPU. --process-name <name> Name of the process. "%p" in the process name string is replaced with the process ID of the application being profiled. "%q{<ENV>}" in the process name string is replaced with the value of the environment variable "<ENV>". If the environment variable is not set it's an error. "%h" in the process name string is replaced with the hostname of the system. "%%" in the process name string is replaced with "%". Any other character following "%" is illegal. --quiet Suppress all nvprof output. --stream-name <name> Name of the CUDA stream. "%i" in the stream name string is replaced with the ID of the stream. "%p" in the stream name string is replaced with the process ID of the application being profiled. "%q{<ENV>}" in the stream name string is replaced with the value of the environment variable "<ENV>". If the environment variable is not set it's an error. "%h" in the stream name string is replaced with the hostname of the system. "%%" in the stream name string is replaced with "%". Any other character following "%" is illegal. -o, --export-profile <filename> Export the result file which can be imported later or opened by the NVIDIA Visual Profiler. "%p" in the file name string is replaced with the process ID of the application being profiled. "%q{<ENV>}" in the file name string is replaced with the value of the environment variable "<ENV>". If the environment variable is not set it's an error. "%h" in the file name string is replaced with the hostname of the system. "%%" in the file name string is replaced with "%". Any other character following "%" is illegal. By default, this option disables the summary output. Note: If the application being profiled creates child processes, or if '--profile-all-processes' is used, the "%p" format is needed to get correct export files for each process. -f, --force-overwrite Force overwriting all output files (any existing files will be overwritten). -i, --import-profile <filename> Import a result profile from a previous run. --log-file <filename> Make nvprof send all its output to the specified file, or one of the standard channels. The file will be overwritten. If the file doesn't exist, a new one will be created. "%1" as the whole file name indicates standard output channel (stdout). "%2" as the whole file name indicates standard error channel (stderr). Note: This is the default. "%p" in the file name string is replaced with the process ID of the application being profiled. "%q{<ENV>}" in the file name string is replaced with the value of the environment variable "<ENV>". If the environment variable is not set it's an error. "%h" in the file name string is replaced with the hostname of the system. "%%" in the file name is replaced with "%". Any other character following "%" is illegal. --print-nvlink-topology Print nvlink topology -h, --help Print this help information. -V, --version Print version information of this tool.