Raymond P. Burkholder - Things I Do - Entries from May 2025

Saturday, May 31. 2025

Trends in Trading

Deep Learning Enhanced Multi-Day Turnover Quantitative Trading Algorithm for Chinese A-Share Market

This paper presents a sophisticated multi-day turnover quantitative trading algorithm that integrates advanced deep learning techniques with comprehensive cross-sectional stock prediction for the Chinese A-share market. Our framework combines five interconnected modules: initial stock selection through deep cross-sectional prediction networks, opening signal distribution analysis using mixture models for arbitrage identification, market capitalization and liquidity-based dynamic position sizing, grid-search optimized profit-taking and stop-loss mechanisms, and multi-granularity volatility-based market timing models. The algorithm employs a novel approach to balance capital efficiency with risk management through adaptive holding periods and sophisticated entry/exit timing. Trained on comprehensive A-share data from 2010-2020 and rigorously backtested on 2021-2024 data, our method achieves remarkable performance with 15.2\% annualized returns, maximum drawdown constrained below 5\%, and a Sharpe ratio of 1.87. The strategy demonstrates exceptional scalability by maintaining 50-100 daily positions with a 9-day maximum holding period, incorporating dynamic profit-taking and stop-loss mechanisms that enhance capital turnover efficiency while preserving risk-adjusted returns. Our approach exhibits robust performance across various market regimes while maintaining high capital capacity suitable for institutional deployment.

SealOS+: A Sealos-based Approach for Adaptive Resource Optimization Under Dynamic Workloads for Securities Trading System

As securities trading systems transition to a microservices architecture, optimizing system performance presents challenges such as inefficient resource scheduling and high service response delays. Existing container orchestration platforms lack tailored performance optimization mechanisms for trading scenarios, making it difficult to meet the stringent 50ms response time requirement imposed by exchanges. This paper introduces SealOS+, a Sealos-based performance optimization approach for securities trading, incorporating an adaptive resource scheduling algorithm leveraging deep reinforcement learning, a three-level caching mechanism for trading operations, and a Long Short-Term Memory (LSTM) based load prediction model. Real-world deployment at a securities exchange demonstrates that the optimized system achieves an average CPU utilization of 78\%, reduces transaction response time to 105ms, and reaches a peak processing capacity of 15,000 transactions per second, effectively meeting the rigorous performance and reliability demands of securities trading.

Posted by

Raymond P. Burkholder

in Machine Learning at 13:32

DNS

Domainator: Detecting and Identifying DNS-Tunneling Malware Using Metadata Sequences

In recent years, malware with tunneling (or: covert channel) capabilities is on the rise. While malware research led to several methods and innovations, the detection and differentiation of malware solely based on its DNS tunneling features is still in its infancy. Moreover, no work so far has used the DNS tunneling traffic to gain knowledge over the current actions taken by the malware. In this paper, we present Domainator, an approach to detect and differentiate state-of-the-art malware and DNS tunneling tools without relying on trivial (but quickly altered) features such as "magic bytes" that are embedded into subdomains. Instead, we apply an analysis of sequential patterns to identify specific types of malware. We evaluate our approach with 7 different malware samples and tunneling tools and can identify the particular malware based on its DNS traffic. We further infer the rough behavior of the particular malware through its DNS tunneling artifacts. Finally, we compare our Domainator with related methods.

Posted by

Raymond P. Burkholder

in Machine Learning at 13:26

Agriculture

Farm-LightSeek: An Edge-centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs

Amid the challenges posed by global population growth and climate change, traditional agricultural Internet of Things (IoT) systems is currently undergoing a significant digital transformation to facilitate efficient big data processing. While smart agriculture utilizes artificial intelligence (AI) technologies to enable precise control, it still encounters significant challenges, including excessive reliance on agricultural expert knowledge, difficulties in fusing multimodal data, poor adaptability to dynamic environments, and bottlenecks in real-time decision-making at the edge. Large language models (LLMs), with their exceptional capabilities in knowledge acquisition and semantic understanding, provide a promising solution to address these challenges. To this end, we propose Farm-LightSeek, an edge-centric multimodal agricultural IoT data analytics framework that integrates LLMs with edge computing. This framework collects real-time farmland multi-source data (images, weather, geographic information) via sensors, performs cross-modal reasoning and disease detection at edge nodes, conducts low-latency management decisions, and enables cloud collaboration for model updates. The main innovations of Farm-LightSeek include: (1) an agricultural "perception-decision-action" closed-loop architecture; (2) cross-modal adaptive monitoring; and (3)a lightweight LLM deployment strategy balancing performance and efficiency. Experiments conducted on two real-world datasets demonstrate that Farm-LightSeek consistently achieves reliable performance in mission-critical tasks, even under the limitations of edge computing resources. This work advances intelligent real-time agricultural solutions and highlights the potential for deeper integration of agricultural IoT with LLMs.

Olfactory Inertial Odometry: Methodology for Effective Robot Navigation by Scent

Olfactory navigation is one of the most primitive mechanisms of exploration used by organisms. Navigation by machine olfaction (artificial smell) is a very difficult task to both simulate and solve. With this work, we define olfactory inertial odometry (OIO), a framework for using inertial kinematics, and fast-sampling olfaction sensors to enable navigation by scent analogous to visual inertial odometry (VIO). We establish how principles from SLAM and VIO can be extrapolated to olfaction to enable real-world robotic tasks. We demonstrate OIO with three different odour localization algorithms on a real 5-DoF robot arm over an odour-tracking scenario that resembles real applications in agriculture and food quality control. Our results indicate success in establishing a baseline framework for OIO from which other research in olfactory navigation can build, and we note performance enhancements that can be made to address more complex tasks in the future.

Learning to See More: UAS-Guided Super-Resolution of Satellite Imagery for Precision Agriculture

Unmanned Aircraft Systems (UAS) and satellites are key data sources for precision agriculture, yet each presents trade-offs. Satellite data offer broad spatial, temporal, and spectral coverage but lack the resolution needed for many precision farming applications, while UAS provide high spatial detail but are limited by coverage and cost, especially for hyperspectral data. This study presents a novel framework that fuses satellite and UAS imagery using super-resolution methods. By integrating data across spatial, spectral, and temporal domains, we leverage the strengths of both platforms cost-effectively. We use estimation of cover crop biomass and nitrogen (N) as a case study to evaluate our approach. By spectrally extending UAS RGB data to the vegetation red edge and near-infrared regions, we generate high-resolution Sentinel-2 imagery and improve biomass and N estimation accuracy by 18% and 31%, respectively. Our results show that UAS data need only be collected from a subset of fields and time points. Farmers can then 1) enhance the spectral detail of UAS RGB imagery; 2) increase the spatial resolution by using satellite data; and 3) extend these enhancements spatially and across the growing season at the frequency of the satellite flights. Our SRCNN-based spectral extension model shows considerable promise for model transferability over other cropping systems in the Upper and Lower Chesapeake Bay regions. Additionally, it remains effective even when cloud-free satellite data are unavailable, relying solely on the UAS RGB input. The spatial extension model produces better biomass and N predictions than models built on raw UAS RGB images. Once trained with targeted UAS RGB data, the spatial extension model allows farmers to stop repeated UAS flights. While we introduce super-resolution advances, the core contribution is a lightweight and scalable system for affordable on-farm use.

Posted by

Raymond P. Burkholder

in Machine Learning at 13:21

Leaps in Thought

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Hopfield networks are associative memory (AM) systems, designed for storing and retrieving patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the amount of training data reaches its critical memory load $- spurious\,\,states$, or unintended stable points, emerge at the end of the retrieval dynamics, leading to incorrect recall. In this work, we examine diffusion models, commonly used in generative modeling, from the perspective of AMs. The training phase of diffusion model is conceptualized as memory encoding (training data is stored in the memory). The generation phase is viewed as an attempt of memory retrieval. In the small data regime the diffusion model exhibits a strong memorization phase, where the network creates distinct basins of attraction around each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime, a different phase appears where an increase in the size of the training set fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, have distinct basins of attraction around them. Our findings provide: a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of AMs, theoretical prediction of existence of spurious states, empirical validation of this prediction in commonly-used diffusion models.

Posted by

Raymond P. Burkholder

in Machine Learning at 13:05

Simulating What Didn't Happen

Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen

Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outcomes such as fatalities. We argue that the path to achieving Vision Zero, i.e., the complete elimination of traffic fatalities and severe injuries, requires a paradigm shift from traditional crash-only learning to a new form of counterfactual safety learning: reasoning not only about what happened, but also about the vast set of plausible yet perilous scenarios that could have happened under slightly different circumstances. To operationalize this shift, our proposed agenda bridges macro to micro. Guided by crash-rate priors, generative scene engines, diverse driver models, and causal learning, near-miss events are synthesized and explained. A crash-focused digital twin testbed links micro scenes to macro patterns, while a multi-objective validator ensures that simulations maintain statistical realism. This pipeline transforms sparse crash data into rich signals for crash prediction, enabling the stress-testing of vehicles, roads, and policies before deployment. By learning from crashes that almost happened, we can shift traffic safety from reactive forensics to proactive prevention, advancing Vision Zero.

Posted by

Raymond P. Burkholder

in Machine Learning at 13:03

Wednesday, May 28. 2025

Continual / Continuous / Continuity

A Continual Offline Reinforcement Learning Benchmark for Navigation Tasks

Autonomous agents operating in domains such as robotics or video game simulations must adapt to changing tasks without forgetting about the previous ones. This process called Continual Reinforcement Learning poses non-trivial difficulties, from preventing catastrophic forgetting to ensuring the scalability of the approaches considered. Building on recent advances, we introduce a benchmark providing a suite of video-game navigation scenarios, thus filling a gap in the literature and capturing key challenges : catastrophic forgetting, task adaptation, and memory efficiency. We define a set of various tasks and datasets, evaluation protocols, and metrics to assess the performance of algorithms, including state-of-the-art baselines. Our benchmark is designed not only to foster reproducible research and to accelerate progress in continual reinforcement learning for gaming, but also to provide a reproducible framework for production pipelines -- helping practitioners to identify and to apply effective approaches.

Continual Reinforcement Learning

Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation. Such an approach is well suited to environments the agent perceives as changing, which renders any static policy ineffective over time. The few simulators explicitly designed for empirical research in continual RL are often limited in scope or complexity, and it is now common for researchers to modify episodic RL environments by artificially incorporating abrupt task changes during interaction. In this paper, we introduce AgarCL, a research platform for continual RL that allows for a progression of increasingly sophisticated behaviour. AgarCL is based on the game Agar.io, a non-episodic, high-dimensional problem featuring stochastic, ever-evolving dynamics, continuous actions, and partial observability. Additionally, we provide benchmark results reporting the performance of DQN, PPO, and SAC in both the primary, challenging continual RL problem, and across a suite of smaller tasks within AgarCL, each of which isolates aspects of the full environment and allow us to characterize the challenges posed by different aspects of the game.

Continuous Thought Machines

Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural activity by abstracting away temporal dynamics. In this paper we challenge that paradigm. By incorporating neuron-level processing and synchronization, we can effectively reintroduce neural timing as a foundational element. We present the Continuous Thought Machine (CTM), a model designed to leverage neural dynamics as its core representation. The CTM has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals; and (2) neural synchronization employed as a latent representation. The CTM aims to strike a balance between oversimplified neuron abstractions that improve computational efficiency, and biological realism. It operates at a level of abstraction that effectively captures essential temporal dynamics while remaining computationally tractable for deep learning. We demonstrate the CTM's strong performance and versatility across a range of challenging tasks, including ImageNet-1K classification, solving 2D mazes, sorting, parity computation, question-answering, and RL tasks. Beyond displaying rich internal representations and offering a natural avenue for interpretation owing to its internal process, the CTM is able to perform tasks that require complex sequential reasoning. The CTM can also leverage adaptive compute, where it can stop earlier for simpler tasks, or keep computing when faced with more challenging instances. The goal of this work is to share the CTM and its associated innovations, rather than pushing for new state-of-the-art results. To that end, we believe the CTM represents a significant step toward developing more biologically plausible and powerful artificial intelligence systems.

Posted by

Raymond P. Burkholder

in Machine Learning at 01:03

Saturday, May 24. 2025

Data Sources

Interpretable Machine Learning for Macro Alpha: A News Sentiment Case Study

This study introduces an interpretable machine learning (ML) framework to extract macroeconomic alpha from global news sentiment. We process the Global Database of Events, Language, and Tone (GDELT) Project's worldwide news feed using FinBERT -- a Bidirectional Encoder Representations from Transformers (BERT) based model pretrained on finance-specific language -- to construct daily sentiment indices incorporating mean tone, dispersion, and event impact. These indices drive an XGBoost classifier, benchmarked against logistic regression, to predict next-day returns for EUR/USD, USD/JPY, and 10-year U.S. Treasury futures (ZN). Rigorous out-of-sample (OOS) backtesting (5-fold expanding-window cross-validation, OOS period: c. 2017-April 2025) demonstrates exceptional, cost-adjusted performance for the XGBoost strategy: Sharpe ratios achieve 5.87 (EUR/USD), 4.65 (USD/JPY), and 4.65 (Treasuries), with respective compound annual growth rates (CAGRs) exceeding 50% in Foreign Exchange (FX) and 22% in bonds. Shapley Additive Explanations (SHAP) affirm that sentiment dispersion and article impact are key predictive features. Our findings establish that integrating domain-specific Natural Language Processing (NLP) with interpretable ML offers a potent and explainable source of macro alpha.

Posted by

Raymond P. Burkholder

in Machine Learning at 00:31

Thursday, May 22. 2025

FPGA Convolutional Neural Networks

FPGA-based Acceleration for Convolutional Neural Networks: A Comprehensive Review

Convolutional Neural Networks (CNNs) are fundamental to deep learning, driving applications across various domains. However, their growing complexity has significantly increased computational demands, necessitating efficient hardware accelerators. Field-Programmable Gate Arrays (FPGAs) have emerged as a leading solution, offering reconfigurability, parallelism, and energy efficiency. This paper provides a comprehensive review of FPGA-based hardware accelerators specifically designed for CNNs. It presents and summarizes the performance evaluation framework grounded in existing studies and explores key optimization strategies, such as parallel computing, dataflow optimization, and hardware-software co-design. It also compares various FPGA architectures in terms of latency, throughput, compute efficiency, power consumption, and resource utilization. Finally, the paper highlights future challenges and opportunities, emphasizing the potential for continued innovation in this field.

An Overview of Arithmetic Adaptations for Inference of Convolutional Neural Networks on Re-configurable Hardware

Convolutional Neural Networks (CNNs) have gained high popularity as a tool for computer vision tasks and for that reason are used in various applications. There are many different concepts, like single shot detectors, that have been published for detecting objects in images or video streams. However, CNNs suffer from disadvantages regarding the deployment on embedded platforms such as re-configurable hardware like Field Programmable Gate Arrays (FPGAs). Due to the high computational intensity, memory requirements and arithmetic conditions, a variety of strategies for running CNNs on FPGAs have been developed. The following methods showcase our best practice approaches for a TinyYOLOv3 detector network on a XILINX Artix-7 FPGA using techniques like fusion of batch normalization, filter pruning and post training network quantization.

Posted by

Raymond P. Burkholder

in Machine Learning at 03:33

AI Agents

Pel, A Programming Language for Orchestrating AI Agents

The proliferation of Large Language Models (LLMs) has opened new frontiers in computing, yet controlling and orchestrating their capabilities beyond simple text generation remains a challenge. Current methods, such as function/tool calling and direct code generation, suffer from limitations in expressiveness, scalability, cost, security, and the ability to enforce fine-grained control. This paper introduces Pel, a novel programming language specifically designed to bridge this gap. Inspired by the strengths of Lisp, Elixir, Gleam, and Haskell, Pel provides a syntactically simple, homoiconic, and semantically rich platform for LLMs to express complex actions, control flow, and inter-agent communication safely and efficiently. Pel's design emphasizes a minimal, easily modifiable grammar suitable for constrained LLM generation, eliminating the need for complex sandboxing by enabling capability control at the syntax level. Key features include a powerful piping mechanism for linear composition, first-class closures enabling easy partial application and functional patterns, built-in support for natural language conditions evaluated by LLMs, and an advanced Read-Eval-Print-Loop (REPeL) with Common Lisp-style restarts and LLM-powered helper agents for automated error correction. Furthermore, Pel incorporates automatic parallelization of independent operations via static dependency analysis, crucial for performant agentic systems. We argue that Pel offers a more robust, secure, and expressive paradigm for LLM orchestration, paving the way for more sophisticated and reliable AI agentic frameworks.

Posted by

Raymond P. Burkholder

in Machine Learning at 02:40

Saturday, May 17. 2025

Surveys

Introduction to Predictive Coding Networks for Machine Learning

Predictive coding networks (PCNs) constitute a biologically inspired framework for understanding hierarchical computation in the brain, and offer an alternative to traditional feedforward neural networks in ML. This note serves as a quick, onboarding introduction to PCNs for machine learning practitioners. We cover the foundational network architecture, inference and learning update rules, and algorithmic implementation. A concrete image-classification task (CIFAR-10) is provided as a benchmark-smashing application, together with an accompanying Python notebook containing the PyTorch implementation.

Navigating the Edge-Cloud Continuum: A State-of-Practice Survey

The edge-cloud continuum has emerged as a transformative paradigm that meets the growing demand for low-latency, scalable, end-to-end service delivery by integrating decentralized edge resources with centralized cloud infrastructures. Driven by the exponential growth of IoT-generated data and the need for real-time responsiveness, this continuum features multi-layered architectures. However, its adoption is hindered by infrastructural challenges, fragmented standards, and limited guidance for developers and researchers. Existing surveys rarely tackle practical implementation or recent industrial advances. This survey closes those gaps from a developer-oriented perspective, introducing a conceptual framework for navigating the edge-cloud continuum. We systematically examine architectural models, performance metrics, and paradigms for computation, communication, and deployment, together with enabling technologies and widely used edge-to-cloud platforms. We also discuss real-world applications in smart cities, healthcare, and Industry 4.0, as well as tools for testing and experimentation. Drawing on academic research and practices of leading cloud providers, this survey serves as a practical guide for developers and a structured reference for researchers, while identifying open challenges and emerging trends that will shape the future of the continuum.

Survey: Graph Databases

Graph databases have become essential tools for managing complex and interconnected data, which is common in areas like social networks, bioinformatics, and recommendation systems. Unlike traditional relational databases, graph databases offer a more natural way to model and query intricate relationships, making them particularly effective for applications that demand flexibility and efficiency in handling interconnected data.
Despite their increasing use, graph databases face notable challenges. One significant issue is the irregular nature of graph data, often marked by structural sparsity, such as in its adjacency matrix representation, which can lead to inefficiencies in data read and write operations. Other obstacles include the high computational demands of traversal-based queries, especially within large-scale networks, and complexities in managing transactions in distributed graph environments. Additionally, the reliance on traditional centralized architectures limits the scalability of Online Transaction Processing (OLTP), creating bottlenecks due to contention, CPU overhead, and network bandwidth constraints.
This paper presents a thorough survey of graph databases. It begins by examining property models, query languages, and storage architectures, outlining the foundational aspects that users and developers typically engage with. Following this, it provides a detailed analysis of recent advancements in graph database technologies, evaluating these in the context of key aspects such as architecture, deployment, usage, and development, which collectively define the capabilities of graph database solutions.

A Survey of In-Network Systems for Intelligent, High-Efficiency AI and Topology Optimization

In-network computation represents a transformative approach to addressing the escalating demands of Artificial Intelligence (AI) workloads on network infrastructure. By leveraging the processing capabilities of network devices such as switches, routers, and Network Interface Cards (NICs), this paradigm enables AI computations to be performed directly within the network fabric, significantly reducing latency, enhancing throughput, and optimizing resource utilization. This paper provides a comprehensive analysis of optimizing in-network computation for AI, exploring the evolution of programmable network architectures, such as Software-Defined Networking (SDN) and Programmable Data Planes (PDPs), and their convergence with AI. It examines methodologies for mapping AI models onto resource-constrained network devices, addressing challenges like limited memory and computational capabilities through efficient algorithm design and model compression techniques. The paper also highlights advancements in distributed learning, particularly in-network aggregation, and the potential of federated learning to enhance privacy and scalability. Frameworks like Planter and Quark are discussed for simplifying development, alongside key applications such as intelligent network monitoring, intrusion detection, traffic management, and Edge AI. Future research directions, including runtime programmability, standardized benchmarks, and new applications paradigms, are proposed to advance this rapidly evolving field. This survey underscores the potential of in-network AI to create intelligent, efficient, and responsive networks capable of meeting the demands of next-generation AI applications.

Software Architecture Meets LLMs: A Systematic Literature Review

Large Language Models (LLMs) are used for many different software engineering tasks. In software architecture, they have been applied to tasks such as classification of design decisions, detection of design patterns, and generation of software architecture design from requirements. However, there is little overview on how well they work, what challenges exist, and what open problems remain. In this paper, we present a systematic literature review on the use of LLMs in software architecture. We analyze 18 research articles to answer five research questions, such as which software architecture tasks LLMs are used for, how much automation they provide, which models and techniques are used, and how these approaches are evaluated. Our findings show that while LLMs are increasingly applied to a variety of software architecture tasks and often outperform baselines, some areas, such as generating source code from architectural design, cloud-native computing and architecture, and checking conformance remain underexplored. Although current approaches mostly use simple prompting techniques, we identify a growing research interest in refining LLM-based approaches by integrating advanced techniques.

A Survey on Open-Source Edge Computing Simulators and Emulators: The Computing and Networking Convergence Perspective

Edge computing, with its low latency, dynamic scalability, and location awareness, along with the convergence of computing and communication paradigms, has been successfully applied in critical domains such as industrial IoT, smart healthcare, smart homes, and public safety. This paper provides a comprehensive survey of open-source edge computing simulators and emulators, presented in our GitHub repository (https://github.com/qijianpeng/awesome-edge-computing), emphasizing the convergence of computing and networking paradigms. By examining more than 40 tools, including CloudSim, NS-3, and others, we identify the strengths and limitations in simulating and emulating edge environments. This survey classifies these tools into three categories: packet-level, application-level, and emulators. Furthermore, we evaluate them across five dimensions, ranging from resource representation to resource utilization. The survey highlights the integration of different computing paradigms, packet processing capabilities, support for edge environments, user-defined metric interfaces, and scenario visualization. The findings aim to guide researchers in selecting appropriate tools for developing and validating advanced computing and networking technologies.

Posted by

Raymond P. Burkholder

in Machine Learning at 15:10

Introductions

The Matrix Calculus You Need For Deep Learning

This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way---just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at theory category at fast ai forums. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here. See related articles at explained.ai - Deep explanations of machine learning and related topics

Frugal Machine Learning for Energy-efficient, and Resource-aware Artificial Intelligence

Frugal Machine Learning (FML) refers to the practice of designing Machine Learning (ML) models that are efficient, cost-effective, and mindful of resource constraints. This field aims to achieve acceptable performance while minimizing the use of computational resources, time, energy, and data for both training and inference. FML strategies can be broadly categorized into input frugality, learning process frugality, and model frugality, each focusing on reducing resource consumption at different stages of the ML pipeline. This chapter explores recent advancements, applications, and open challenges in FML, emphasizing its importance for smart environments that incorporate edge computing and IoT devices, which often face strict limitations in bandwidth, energy, or latency. Technological enablers such as model compression, energy-efficient hardware, and data-efficient learning techniques are discussed, along with adaptive methods including parameter regularization, knowledge distillation, and dynamic architecture design that enable incremental model updates without full retraining. Furthermore, it provides a comprehensive taxonomy of frugal methods, discusses case studies across diverse domains, and identifies future research directions to drive innovation in this evolving field.

An Introduction to Discrete Variational Autoencoders

Variational Autoencoders (VAEs) are well-established as a principled approach to probabilistic unsupervised learning with neural networks. Typically, an encoder network defines the parameters of a Gaussian distributed latent space from which we can sample and pass realizations to a decoder network. This model is trained to reconstruct its inputs and is optimized through the evidence lower bound. In recent years, discrete latent spaces have grown in popularity, suggesting that they may be a natural choice for many data modalities (e.g. text). In this tutorial, we provide a rigorous, yet practical, introduction to discrete variational autoencoders -- specifically, VAEs in which the latent space is made up of latent variables that follow a categorical distribution. We assume only a basic mathematical background with which we carefully derive each step from first principles. From there, we develop a concrete training recipe and provide an example implementation, hosted at https://github.com/alanjeffares/discreteVAE.

A Practical Introduction to Deep Reinforcement Learning

Deep reinforcement learning (DRL) has emerged as a powerful framework for solving sequential decision-making problems, achieving remarkable success in a wide range of applications, including game AI, autonomous driving, biomedicine, and large language models. However, the diversity of algorithms and the complexity of theoretical foundations often pose significant challenges for beginners seeking to enter the field. This tutorial aims to provide a concise, intuitive, and practical introduction to DRL, with a particular focus on the Proximal Policy Optimization (PPO) algorithm, which is one of the most widely used and effective DRL methods. To facilitate learning, we organize all algorithms under the Generalized Policy Iteration (GPI) framework, offering readers a unified and systematic perspective. Instead of lengthy theoretical proofs, we emphasize intuitive explanations, illustrative examples, and practical engineering techniques. This work serves as an efficient and accessible guide, helping readers rapidly progress from basic concepts to the implementation of advanced DRL algorithms.

Posted by

Raymond P. Burkholder

in Machine Learning at 15:06

Friday, May 16. 2025

Network Security

A Review of Various Datasets for Machine Learning Algorithm-Based Intrusion Detection System: Advances and Challenges

IDS aims to protect computer networks from security threats by detecting, notifying, and taking appropriate action to prevent illegal access and protect confidential information. As the globe becomes increasingly dependent on technology and automated processes, ensuring secured systems, applications, and networks has become one of the most significant problems of this era. The global web and digital technology have significantly accelerated the evolution of the modern world, necessitating the use of telecommunications and data transfer platforms. Researchers are enhancing the effectiveness of IDS by incorporating popular datasets into machine learning algorithms. IDS, equipped with machine learning classifiers, enhances security attack detection accuracy by identifying normal or abnormal network traffic. This paper explores the methods of capturing and reviewing intrusion detection systems (IDS) and evaluates the challenges existing datasets face. A deluge of research on machine learning (ML) and deep learning (DL) architecture-based intrusion detection techniques has been conducted in the past ten years on various cybersecurity datasets, including KDDCUP'99, NSL-KDD, UNSW-NB15, CICIDS-2017, and CSE-CIC-IDS2018. We conducted a literature review and presented an in-depth analysis of various intrusion detection methods that use SVM, KNN, DT, LR, NB, RF, XGBOOST, Adaboost, and ANN. We provide an overview of each technique, explaining the role of the classifiers and algorithms used. A detailed tabular analysis highlights the datasets used, classifiers employed, attacks detected, evaluation metrics, and conclusions drawn. This article offers a thorough review for future IDS research.

Fog Intelligence for Network Anomaly Detection

Anomalies are common in network system monitoring. When manifested as network threats to be mitigated, service outages to be prevented, and security risks to be ameliorated, detecting such anomalous network behaviors becomes of great importance. However, the growing scale and complexity of the mobile communication networks, as well as the ever-increasing amount and dimensionality of the network surveillance data, make it extremely difficult to monitor a mobile network and discover abnormal network behaviors. Recent advances in machine learning allow for obtaining near-optimal solutions to complicated decision-making problems with many sources of uncertainty that cannot be accurately characterized by traditional mathematical models. However, most machine learning algorithms are centralized, which renders them inapplicable to a large-scale distributed wireless networks with tens of millions of mobile devices. In this article, we present fog intelligence, a distributed machine learning architecture that enables intelligent wireless network management. It preserves the advantage of both edge processing and centralized cloud computing. In addition, the proposed architecture is scalable, privacy-preserving, and well suited for intelligent management of a distributed wireless network.

GPML: Graph Processing for Machine Learning

The dramatic increase of complex, multi-step, and rapidly evolving attacks in dynamic networks involves advanced cyber-threat detectors. The GPML (Graph Processing for Machine Learning) library addresses this need by transforming raw network traffic traces into graph representations, enabling advanced insights into network behaviors. The library provides tools to detect anomalies in interaction and community shifts in dynamic networks. GPML supports community and spectral metrics extraction, enhancing both real-time detection and historical forensics analysis. This library supports modern cybersecurity challenges with a robust, graph-based approach.

Self-Supervised Transformer-based Contrastive Learning for Intrusion Detection Systems

As the digital landscape becomes more interconnected, the frequency and severity of zero-day attacks, have significantly increased, leading to an urgent need for innovative Intrusion Detection Systems (IDS). Machine Learning-based IDS that learn from the network traffic characteristics and can discern attack patterns from benign traffic offer an advanced solution to traditional signature-based IDS. However, they heavily rely on labeled datasets, and their ability to generalize when encountering unseen traffic patterns remains a challenge. This paper proposes a novel self-supervised contrastive learning approach based on transformer encoders, specifically tailored for generalizable intrusion detection on raw packet sequences. Our proposed learning scheme employs a packet-level data augmentation strategy combined with a transformer-based architecture to extract and generate meaningful representations of traffic flows. Unlike traditional methods reliant on handcrafted statistical features (NetFlow), our approach automatically learns comprehensive packet sequence representations, significantly enhancing performance in anomaly identification tasks and supervised learning for intrusion detection. Our transformer-based framework exhibits better performance in comparison to existing NetFlow self-supervised methods. Specifically, we achieve up to a 3% higher AUC in anomaly detection for intra-dataset evaluation and up to 20% higher AUC scores in inter-dataset evaluation. Moreover, our model provides a strong baseline for supervised intrusion detection with limited labeled data, exhibiting an improvement over self-supervised NetFlow models of up to 1.5% AUC when pretrained and evaluated on the same dataset. Additionally, we show the adaptability of our pretrained model when fine-tuned across different datasets, demonstrating strong performance even when lacking benign data from the target domain.

A Representation Learning Approach to Feature Drift Detection in Wireless Networks

AI is foreseen to be a centerpiece in next generation wireless networks enabling enabling ubiquitous communication as well as new services. However, in real deployment, feature distribution changes may degrade the performance of AI models and lead to undesired behaviors. To counter for undetected model degradation, we propose ALERT; a method that can detect feature distribution changes and trigger model re-training that works well on two wireless network use cases: wireless fingerprinting and link anomaly detection. ALERT includes three components: representation learning, statistical testing and utility assessment. We rely on MLP for designing the representation learning component, on Kolmogorov-Smirnov and Population Stability Index tests for designing the statistical testing and a new function for utility assessment. We show the superiority of the proposed method against ten standard drift detection methods available in the literature on two wireless network use cases.

Posted by

Raymond P. Burkholder

in Machine Learning at 00:57

Wednesday, May 14. 2025

Machine Learning

Why Gradients Rapidly Increase Near the End of Training

During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.

A Multi-Step Comparative Framework for Anomaly Detection in IoT Data Streams

The rapid expansion of Internet of Things (IoT) devices has introduced critical security challenges, underscoring the need for accurate anomaly detection. Although numerous studies have proposed machine learning (ML) methods for this purpose, limited research systematically examines how different preprocessing steps--normalization, transformation, and feature selection--interact with distinct model architectures. To address this gap, this paper presents a multi-step evaluation framework assessing the combined impact of preprocessing choices on three ML algorithms: RNN-LSTM, autoencoder neural networks (ANN), and Gradient Boosting (GBoosting). Experiments on the IoTID20 dataset shows that GBoosting consistently delivers superior accuracy across preprocessing configurations, while RNN-LSTM shows notable gains with z-score normalization and autoencoders excel in recall, making them well-suited for unsupervised scenarios. By offering a structured analysis of preprocessing decisions and their interplay with various ML techniques, the proposed framework provides actionable guidance to enhance anomaly detection performance in IoT environments.

Intelligent Load Balancing Systems using Reinforcement Learning System

Load Balancing is a fundamental technology for scaling cloud infrastructure. It enables systems to distribute incoming traffic across backend servers using predefined algorithms such as round robin, weighted round robin, least connections, weighted least connections, resource based, weighted response time, source IP hash, and URL hash. However, traditional traffic balancing techniques are increasingly becoming inadequate in optimizing distribution times. Existing algorithms are struggling to meet the rising demands of internet traffic, often resulting in degraded user experiences.
To proactively address these issues particularly in areas like response time, distribution latency, and system uptime, we need to rethink how load balancing is implemented. Key challenges include traffic management, congestion control, intelligent scheduling, and the ability to determine when and when not to apply load balancing.

Efficient Telecom Specific LLM: TSLAM-Mini with QLoRA and Digital Twin Data

General-purpose large language models (LLMs), despite their broad capabilities accrued from open-world data, frequently exhibit suboptimal performance when confronted with the nuanced and specialized demands inherent in real-time telecommunications applications. This investigation addresses this critical limitation through the meticulous fine-tuning of TSLAM-Mini developed by NetoAI, a compact (3.8-billion parameter) causal language model architecturally derived from Phi-4 Mini Instruct 4B. The fine-tuning regimen leverages a bespoke dataset comprising 100,000 samples, strategically engineered to address 20 pivotal telecommunications use-cases, encompassing domains such as Network Fundamentals, IP Routing, MPLS, Network Security, Automation, OSS/BSS, RAN, Mobile Core, Satellite Communications, and Ethical AI. This dataset was curated utilizing NetoAI's DigiTwin platform, enriched with granular insights from venerated network Subject Matter Experts (SMEs) and authoritative RFC documents, thereby capturing high-fidelity representations of real-world network dynamics through simulations inspired by digital twin paradigms. Employing Quantized Low-Rank Adaptation (QLoRA), a state-of-the-art Parameter Efficient Fine-Tuning (PEFT) technique, we achieved substantial training efficiency and enabled prospective deployment on resource-constrained hardware. A novel evaluation framework, predicated on a high-capacity LLM (Qwen3-235B-A22B) functioning as an automated adjudicator, was instituted to rigorously assess instruction-following fidelity and response quality across the specified telecom use-cases. Empirical results unequivocally demonstrate TSLAM-Mini's superior aptitude in telecom-centric applications, underscoring the profound efficacy of domain-specific datasets and PEFT methodologies for advancing intelligent network management.

Graph-Based Floor Separation Using Node Embeddings and Clustering of WiFi Trajectories

Indoor positioning systems (IPSs) are increasingly vital for location-based services in complex multi-storey environments. This study proposes a novel graph-based approach for floor separation using Wi-Fi fingerprint trajectories, addressing the challenge of vertical localization in indoor settings. We construct a graph where nodes represent Wi-Fi fingerprints, and edges are weighted by signal similarity and contextual transitions. Node2Vec is employed to generate low-dimensional embeddings, which are subsequently clustered using K-means to identify distinct floors. Evaluated on the Huawei University Challenge 2021 dataset, our method outperforms traditional community detection algorithms, achieving an accuracy of 68.97%, an F1- score of 61.99%, and an Adjusted Rand Index of 57.19%. By publicly releasing the preprocessed dataset and implementation code, this work contributes to advancing research in indoor positioning. The proposed approach demonstrates robustness to signal noise and architectural complexities, offering a scalable solution for floor-level localization.

Posted by

Raymond P. Burkholder

in Machine Learning at 11:24

Tuesday, May 13. 2025

Machine Learning Interesting Concepts

LLM Cannot Discover Causality, and Should Be Restricted to Non-Decisional Support in Causal Discovery

This paper critically re-evaluates LLMs' role in causal discovery and argues against their direct involvement in determining causal relationships. We demonstrate that LLMs' autoregressive, correlation-driven modeling inherently lacks the theoretical grounding for causal reasoning and introduces unreliability when used as priors in causal discovery algorithms. Through empirical studies, we expose the limitations of existing LLM-based methods and reveal that deliberate prompt engineering (e.g., injecting ground-truth knowledge) could overstate their performance, helping to explain the consistently favorable results reported in much of the current literature. Based on these findings, we strictly confined LLMs' role to a non-decisional auxiliary capacity: LLMs should not participate in determining the existence or directionality of causal relationships, but can assist the search process for causal graphs (e.g., LLM-based heuristic search). Experiments across various settings confirm that, by strictly isolating LLMs from causal decision-making, LLM-guided heuristic search can accelerate the convergence and outperform both traditional and LLM-based methods in causal structure learning. We conclude with a call for the community to shift focus from naively applying LLMs to developing specialized models and training method that respect the core principles of causal discovery.

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: GitHub ToMAP.

Trading Under Uncertainty: A Distribution-Based Strategy for Futures Markets Using FutureQuant Transformer

In the complex landscape of traditional futures trading, where vast data and variables like real-time Limit Order Books (LOB) complicate price predictions, we introduce the FutureQuant Transformer model, leveraging attention mechanisms to navigate these challenges. Unlike conventional models focused on point predictions, the FutureQuant model excels in forecasting the range and volatility of future prices, thus offering richer insights for trading strategies. Its ability to parse and learn from intricate market patterns allows for enhanced decision-making, significantly improving risk management and achieving a notable average gain of 0.1193% per 30-minute trade over state-of-the-art models with a simple algorithm using factors such as RSI, ATR, and Bollinger Bands. This innovation marks a substantial leap forward in predictive analytics within the volatile domain of futures trading.

A critical assessment of reinforcement learning methods for microswimmer navigation in complex flows

Navigating in a fluid flow while being carried by it, using only information accessible from on-board sensors, is a problem commonly faced by small planktonic organisms. It is also directly relevant to autonomous robots deployed in the oceans. In the last ten years, the fluid mechanics community has widely adopted reinforcement learning, often in the form of its simplest implementations, to address this challenge. But it is unclear how good are the strategies learned by these algorithms. In this paper, we perform a quantitative assessment of reinforcement learning methods applied to navigation in partially observable flows. We first introduce a well-posed problem of directional navigation for which a quasi-optimal policy is known analytically. We then report on the poor performance and robustness of commonly used algorithms (Q-Learning, Advantage Actor Critic) in flows regularly encountered in the literature: Taylor-Green vortices, Arnold-Beltrami-Childress flow, and two-dimensional turbulence. We show that they are vastly surpassed by PPO (Proximal Policy Optimization), a more advanced algorithm that has established dominance across a wide range of benchmarks in the reinforcement learning community. In particular, our custom implementation of PPO matches the theoretical quasi-optimal performance in turbulent flow and does so in a robust manner. Reaching this result required the use of several additional techniques, such as vectorized environments and generalized advantage estimation, as well as hyperparameter optimization. This study demonstrates the importance of algorithm selection, implementation details, and fine-tuning for discovering truly smart autonomous navigation strategies in complex flows.

Large Wireless Localization Model (LWLM): A Foundation Model for Positioning in 6G Networks

Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propose a foundation-model-based solution tailored for wireless localization. We first analyze how different self-supervised learning (SSL) tasks acquire general-purpose and task-specific semantic features based on information bottleneck (IB) theory. Building on this foundation, we design a pretraining methodology for the proposed Large Wireless Localization Model (LWLM). Specifically, we propose an SSL framework that jointly optimizes three complementary objectives: (i) spatial-frequency masked channel modeling (SF-MCM), (ii) domain-transformation invariance (DTI), and (iii) position-invariant contrastive learning (PICL). These objectives jointly capture the underlying semantics of wireless channel from multiple perspectives. We further design lightweight decoders for key downstream tasks, including time-of-arrival (ToA) estimation, angle-of-arrival (AoA) estimation, single base station (BS) localization, and multiple BS localization. Comprehensive experimental results confirm that LWLM consistently surpasses both model-based and supervised learning baselines across all localization tasks. In particular, LWLM achieves 26.0%--87.5% improvement over transformer models without pretraining, and exhibits strong generalization under label-limited fine-tuning and unseen BS configurations, confirming its potential as a foundation model for wireless localization.

Transforming Decoder-Only Transformers for Accurate WiFi-Telemetry Based Indoor Localization

Wireless Fidelity (WiFi) based indoor positioning is a widely researched area for determining the position of devices within a wireless network. Accurate indoor location has numerous applications, such as asset tracking and indoor navigation. Despite advances in WiFi localization techniques -- in particular approaches that leverage WiFi telemetry -- their adoption in practice remains limited due to several factors including environmental changes that cause signal fading, multipath effects, interference, which, in turn, impact positioning accuracy. In addition, telemetry data differs depending on the WiFi device vendor, offering distinct features and formats; use case requirements can also vary widely. Currently, there is no unified model to handle all these variations effectively. In this paper, we present WiFiGPT, a Generative Pretrained Transformer (GPT) based system that is able to handle these variations while achieving high localization accuracy. Our experiments with WiFiGPT demonstrate that GPTs, in particular Large Language Models (LLMs), can effectively capture subtle spatial patterns in noisy wireless telemetry, making them reliable regressors. Compared to existing state-of-the-art methods, our method matches and often surpasses conventional approaches for multiple types of telemetry. Achieving sub-meter accuracy for RSSI and FTM and centimeter-level precision for CSI demonstrates the potential of LLM-based localisation to outperform specialized techniques, all without handcrafted signal processing or calibration.

Posted by

Raymond P. Burkholder

in Machine Learning at 01:45

Friday, May 9. 2025

Installing LibTorch with Cuda on NVIDIA GeForce RTX 4070

For trying out some LSTM Machine Learning algorithms with my TradeFrame Algorithmic Trading Library, I wanted to install LibTorch with NVidia/Cuda support for hardware accelerating learning.

Do not install the nvidia-driver yet. It is part of the cuda deployment package. Only install headers, which are necessary for building kernal modules.

$ sudo apt install linux-headers-$(uname -r)

I used Installing C++ Distributions of PyTorch as a starting point. However, their example is CPU based. My desire is for a Cuda based installation. This meant going to the CUDA Zone and start the Download process. My configuration options were: Linux, x86_64, Debian, 12, deb (local).

Using the "deb (local)" with a complete file seemed to be the only way to ensure all components were available.

The steps, as of this writing, were:

wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda-repo-debian12-12-9-local_12.9.0-575.51.03-1_amd64.deb
sudo dpkg -i cuda-repo-debian12-12-9-local_12.9.0-575.51.03-1_amd64.deb
sudo cp /var/cuda-repo-debian12-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9

Install the open version of the nvidia drivers:

sudo apt-get install -y nvidia-open

See if the nouveau driver is installed.

$ lsmod |grep nouveau

If so, then run these commands to enable the nvidia driver and to blacklist the nouveau driver and reboot:

sudo mv /etc/modprobe.d/nvidia.conf.dpkg-new  /etc/modprobe.d/nvidia.conf
sudo update-initramfs -u

There is also the NVIDIA CUDA Installation Guide for Linux for further information.

The following changes are required for a successful compile of the example application below:

$ diff math_functions.h /etc/alternatives/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h
2556c2556
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 sinpi(double x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 sinpi(double x) noexcept (true);
2579c2579
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  sinpif(float x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  sinpif(float x) noexcept (true);
2601c2601
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 cospi(double x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 cospi(double x) noexcept (true);
2623c2623
< extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  cospif(float x);
---
> extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  cospif(float x) noexcept (true);

use this file to fix the above:

$ cat cuda_fix.sh
#!/bin/sh
header_file=/etc/alternatives/cuda/bin/../targets/x86_64-linux/include/crt/math_functions.h
sudo sed -i 's/sinpi(double x);/sinpi(double x) noexcept (true);/' $header_file
sudo sed -i 's/sinpif(float x);/sinpif(float x) noexcept (true);/' $header_file
sudo sed -i 's/cospi(double x);/cospi(double x) noexcept (true);/' $header_file
sudo sed -i 's/cospif(float x);/cospif(float x) noexcept (true);/' $header_file

The PyTorch LibTorch library can be downloaded from PyTorch Start Locally. Choose the C++/Java option with Cuda 12.8 (as of this writing). An appropriate link is presented. Download and expand the file into a development directory. LibTorch doesn't have at this moment a build for Cuda 12.9, but is referenced as 12.8.

The most recent can be found at https://download.pytorch.org/libtorch/nightly/cu128/libtorch-shared-with-deps-latest.zip.

It is probably advised to NOT use the Debian package, as it may be out of date: pytorch-cuda.

Expand the libtorch package and deploy to /usr/local/share/libtorch.

To test out the installation, I then created a subdirectory containing a couple of files. The first is the test code example-app.cpp:

#include <torch/torch.h>
#include <iostream>

int main() {
  torch::Tensor tensor = torch::rand({2, 3});
  std::cout << tensor << std::endl;
}

The second file is the CMakeLists.txt file. This is my version:

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)

cmake_policy(SET CMP0104 NEW)
cmake_policy(SET CMP0105 NEW)
project(example-app)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
set(CMAKE_CUDA_STANDARD 17)

add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)

Then to build the example:

mkdir build
cmake \
  -DCMAKE_PREFIX_PATH=/usr/local/share/libtorch \
  -DCMAKE_CUDA_ARCHITECTURES=native \
  -DCMAKE_BUILD_TYPE=DEBUG \
  -DCMAKE_CUDA_COMPILER=/etc/alternatives/cuda/bin/nvcc \
  -Dnvtx3_dir=/usr/local/cuda/targets/x86_64-linux/include/nvtx3  \
  ..
make
./example-app

Notes:

PREFIX_PATH points to the directory of your expanded libtorch download
CMAKE_CUDA_ARCHITECTURES provides a 'native' cuda solution, the build process will determine the specific gpu for which to build
CMAKE_BUILD_TYPE can be DEBUG or RELEASE
CMAKE_CUDA_COMPILER needs to be set, by using /etc/alternatives, these are softlinks to the version you desire (as were installed by the cuda installation)
nvtx3_dir is required, as the current libtorch library seems to still refer to nvtx and not nvtx3

If you get output along the lines of:

-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 5.0;8.0;8.6;8.9;9.0;9.0a;10.0;10.0a;10.1a;12.0;12.0a

My system has two RTX 4070 cards, and can be verified with (an extract is shown with important parts, noticing that the nvidia driver is properly shown):

$ sudo lshw -c video
  *-display
    product: Arrow Lake-S [Intel Graphics]
    configuration: depth=32 driver=i915 latency=0 mode=3840x2160 resolution=3840,2160 visual=truecolor xres=3840 yres=2160
  *-display
    product: AD103 [GeForce RTX 4070]
    configuration: driver=nvidia latency=0
  *-display
    product: AD103 [GeForce RTX 4070]
    configuration: driver=nvidia latency=0

Therefore, the output of my cmake process will include gpu specific selections:

-- Autodetected CUDA architecture(s):  8.9 8.9
-- Added CUDA NVCC flags for: -gencode;arch=compute_89,code=sm_89

And running the generated binary results in valid output:

$ ./example-app
0.7141  0.9744  0.3179
0.7794  0.9281  0.7529
[ CPUFloatType{2,3} ]

Continue reading "Installing LibTorch with Cuda on NVIDIA GeForce..." »

Posted by

Raymond P. Burkholder

in C++ at 03:09 | Comments (0)

(Page 1 of 2, totaling 18 entries) next page »

Raymond P. Burkholder - Things I Do

Entries from May 2025

Google Analytics

Google Custom Search

Categories

Syndicate This Blog

Saturday, May 31. 2025

Trends in Trading

DNS

Agriculture

Leaps in Thought

Simulating What Didn't Happen

Wednesday, May 28. 2025

Continual / Continuous / Continuity

Saturday, May 24. 2025

Data Sources

Thursday, May 22. 2025

FPGA Convolutional Neural Networks

AI Agents

Saturday, May 17. 2025

Surveys

Introductions

Friday, May 16. 2025

Network Security

Wednesday, May 14. 2025

Machine Learning

Tuesday, May 13. 2025

Machine Learning Interesting Concepts

Friday, May 9. 2025

Installing LibTorch with Cuda on NVIDIA GeForce RTX 4070

Calendar

Archives

	May '25
Mo	Tu	We	Th	Fr	Sa	Su
Friday, July 18. 2025
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31