Software Development - Raymond P. Burkholder

Friday, May 9. 2025

Installing LibTorch with Cuda on NVIDIA GeForce RTX 4070

For trying out some LSTM Machine Learning algorithms with my TradeFrame Algorithmic Trading Library, I wanted to install LibTorch with NVidia/Cuda support for hardware accelerating learning.

Do not install the nvidia-driver yet. It is part of the cuda deployment package. Only install headers, which are necessary for building kernal modules.

$ sudo apt install linux-headers-$(uname -r)

I used Installing C++ Distributions of PyTorch as a starting point. However, their example is CPU based. My desire is for a Cuda based installation. This meant going to the CUDA Zone and start the Download process. My configuration options were: Linux, x86_64, Debian, 12, deb (local).

Using the "deb (local)" with a complete file seemed to be the only way to ensure all components were available.

The steps, as of this writing, were:

wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda-repo-debian12-12-9-local_12.9.0-575.51.03-1_amd64.deb
sudo dpkg -i cuda-repo-debian12-12-9-local_12.9.0-575.51.03-1_amd64.deb
sudo cp /var/cuda-repo-debian12-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9

Install the open version of the nvidia drivers:

sudo apt-get install -y nvidia-open

See if the nouveau driver is installed.

$ lsmod |grep nouveau

If so, then run these commands to enable the nvidia driver and to blacklist the nouveau driver and reboot:

sudo mv /etc/modprobe.d/nvidia.conf.dpkg-new  /etc/modprobe.d/nvidia.conf
sudo update-initramfs -u

There is also the NVIDIA CUDA Installation Guide for Linux for further information.

The following changes are required for a successful compile of the example application below:

$ diff math_functions.h /etc/alternatives/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h
2556c2556
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 sinpi(double x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 sinpi(double x) noexcept (true);
2579c2579
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  sinpif(float x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  sinpif(float x) noexcept (true);
2601c2601
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 cospi(double x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 cospi(double x) noexcept (true);
2623c2623
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  cospif(float x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  cospif(float x) noexcept (true);

use this file to fix the above:

$ cat cuda_fix.sh
#!/bin/sh
header_file=/etc/alternatives/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h
sudo sed -i 's/sinpi(double x);/sinpi(double x) noexcept (true);/' $header_file
sudo sed -i 's/sinpif(float x);/sinpif(float x) noexcept (true);/' $header_file
sudo sed -i 's/cospi(double x);/cospi(double x) noexcept (true);/' $header_file
sudo sed -i 's/cospif(float x);/cospif(float x) noexcept (true);/' $header_file

The PyTorch LibTorch library can be downloaded from PyTorch Start Locally. Choose the C++/Java option with Cuda 12.8 (as of this writing). An appropriate link is presented. Download and expand the file into a development directory. LibTorch doesn't have at this moment a build for Cuda 12.9, but is referenced as 12.8.

The most recent can be found at https://download.pytorch.org/libtorch/nightly/cu128/libtorch-shared-with-deps-latest.zip.

It is probably advised to NOT use the Debian package, as it may be out of date: pytorch-cuda.

Expand the libtorch package and deploy to /usr/local/share/libtorch.

To test out the installation, I then created a subdirectory containing a couple of files. The first is the test code example-app.cpp:

#include <torch/torch.h>
#include <iostream>

int main() {
  torch::Tensor tensor = torch::rand({2, 3});
  std::cout << tensor << std::endl;
}

The second file is the CMakeLists.txt file. This is my version:

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)

cmake_policy(SET CMP0104 NEW)
cmake_policy(SET CMP0105 NEW)
project(example-app)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
set(CMAKE_CUDA_STANDARD 17)

add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)

Then to build the example:

mkdir build
cmake \
  -DCMAKE_PREFIX_PATH=/usr/local/share/libtorch \
  -DCMAKE_CUDA_ARCHITECTURES=native \
  -DCMAKE_BUILD_TYPE=DEBUG \
  -DCMAKE_CUDA_COMPILER=/etc/alternatives/cuda/bin/nvcc \
  -Dnvtx3_dir=/usr/local/cuda/targets/x86_64-linux/include/nvtx3  \
  ..
make
./example-app

Notes:

PREFIX_PATH points to the directory of your expanded libtorch download
CMAKE_CUDA_ARCHITECTURES provides a 'native' cuda solution, the build process will determine the specific gpu for which to build
CMAKE_BUILD_TYPE can be DEBUG or RELEASE
CMAKE_CUDA_COMPILER needs to be set, by using /etc/alternatives, these are softlinks to the version you desire (as were installed by the cuda installation)
nvtx3_dir is required, as the current libtorch library seems to still refer to nvtx and not nvtx3

If you get output along the lines of:

-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 5.0;8.0;8.6;8.9;9.0;9.0a;10.0;10.0a;10.1a;12.0;12.0a

My system has two RTX 4070 cards, and can be verified with (an extract is shown with important parts, noticing that the nvidia driver is properly shown):

$ sudo lshw -c video
  *-display
    product: Arrow Lake-S [Intel Graphics]
    configuration: depth=32 driver=i915 latency=0 mode=3840x2160 resolution=3840,2160 visual=truecolor xres=3840 yres=2160
  *-display
    product: AD103 [GeForce RTX 4070]
    configuration: driver=nvidia latency=0
  *-display
    product: AD103 [GeForce RTX 4070]
    configuration: driver=nvidia latency=0

Therefore, the output of my cmake process will include gpu specific selections:

-- Autodetected CUDA architecture(s):  8.9 8.9
-- Added CUDA NVCC flags for: -gencode;arch=compute_89,code=sm_89

And running the generated binary results in valid output:

$ ./example-app
0.7141  0.9744  0.3179
0.7794  0.9281  0.7529
[ CPUFloatType{2,3} ]

Continue reading "Installing LibTorch with Cuda on NVIDIA GeForce..." »

Posted by

	July '25
Mo	Tu	We	Th	Fr	Sa	Su
Friday, July 18. 2025
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Google Analytics

Google Custom Search

Categories

Syndicate This Blog

Friday, May 9. 2025

Saturday, December 14. 2024

Sunday, February 4. 2024

Sunday, January 21. 2024

Wednesday, January 10. 2024

Sunday, September 24. 2023

Wednesday, July 19. 2023

Sunday, June 11. 2023

Sunday, March 27. 2022

Sunday, March 20. 2022

Saturday, February 5. 2022

Saturday, January 16. 2021

Monday, December 28. 2020

Tuesday, June 16. 2020

Calendar

Archives