Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis - accepted at AIOPS workshop @ICDM 2023
The realm of AIOps is transforming IT landscapes with the power of AI and ML. Despite the challenge of limited labeled data, supervised models show promise, emphasizing the importance of leveraging labels for training, especially in deep learning contexts. This study enhances the field by introducing a taxonomy for log anomalies and exploring automated data labeling to mitigate labeling challenges. It goes further by investigating the potential of diverse anomaly detection techniques and their alignment with specific anomaly types. However, the exploration doesn't stop at anomaly detection. The study envisions a future where root cause analysis follows anomaly detection, unraveling the underlying triggers of anomalies. This uncharted territory holds immense potential for revolutionizing IT systems management. In essence, this paper enriches our understanding of anomaly detection, and automated labeling, and sets the stage for transformative root cause analysis. Together, these advances promise more resilient IT systems, elevating operational efficiency and user satisfaction in an ever-evolving technological landscape.
A Roadmap towards Intelligent Operations for Reliable Cloud Computing Systems
The increasing complexity and usage of cloud systems have made it challenging for service providers to ensure reliability. This paper highlights two main challenges, namely internal and external factors, that affect the reliability of cloud microservices. Afterward, we discuss the data-driven approach that can resolve these challenges from four key aspects: ticket management, log management, multimodal analysis, and the microservice resilience testing approach. The experiments conducted show that the proposed data-driven AIOps solution significantly enhances system reliability from multiple angles.
A Systematic Mapping Study in AIOps
IT systems of today are becoming larger and more complex, rendering their human supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to AI and Big Data. However, past AIOps contributions are scattered, unorganized and missing a common terminology convention, which renders their discovery and comparison impractical. In this work, we conduct an in-depth mapping study to collect and organize the numerous scattered contributions to AIOps in a unique reference index. We create an AIOps taxonomy to build a foundation for future contributions and allow an efficient comparison of AIOps papers treating similar problems. We investigate temporal trends and classify AIOps contributions based on the choice of algorithms, data sources and the target components. Our results show a recent and growing interest towards AIOps, specifically to those contributions treating failure-related tasks (62%), such as anomaly detection and root cause analysis.
AIOps with Data, Analytics, and Intelligent Automation: A Foundational Capability for Modern IT Operations - sponsored by BMC but has useful general background information