![]() We have deployed Beacon for production use since April 2017. Its elaborated collection scheme and aggressive compression minimize the system cost only 85 part-time servers to monitor the entire 40960-node system, with \(\lt \!1\%\) performance overhead in user applications. Beacon collects performance data simultaneously from different types of nodes (including the compute, I/O forwarding, storage, and metadata nodes) and analyzes them collaboratively, without requiring any involvement of application developers. To the best of our knowledge, this is the first system-level, multi-layer monitoring and real-time diagnosis framework deployed on ultra-scale supercomputers. Beacon integrates front-end tracing and back-end profiling into a seamless framework, enabling tasks such as automatic per-application I/O behavior profiling, I/O bottleneck/interference analysis, and system anomaly detection. It works with TaihuLight’s 40,960 compute nodes (over ten-million cores in total), 288 forwarding nodes, 288 storage nodes, and two metadata nodes. This paper reports the design, implementation, and deployment of a lightweight, end-to-end I/O resource monitoring and diagnosis system, Beacon, for TaihuLight, currently the fourth-ranked supercomputer in the world. ![]() Finally, problematic applications issuing inefficient I/O requests escape the radar of back-end-side analytical methods relying on high-bandwidth applications. Back-end-oriented tools can collect system-level performance data and monitor cross-application interactions but have difficulty in identifying performance issues for specific applications and in finding their root causes. They also do not offer intuitive ways to analyze inter-application I/O performance behaviors such as interference issues. Application-oriented tools often require developers to instrument their source code or link extra libraries. These proposed tools, however, have one or more of the following limitations. To this end, several profiling/tracing tools and frameworks have been developed, including application-side (e.g., Darshan, ScalableIOTrace, and IOPin ), back-end side (e.g., LustreDU, IOSI, and LIOProf ), and multi-layer tools (e.g., EZIOTracer, GUIDE, and Logaider ). They also need to provide I/O usage information and performance records to guide future systems’ design, configuration, and deployment. Online tools that can capture/analyze I/O activities and guide optimization are urgently needed. In addition, because I/O utilizes heavily shared system components (unlike computation or memory accesses), it usually suffers from substantial inter-workload interference, causing high performance variance. The long I/O path from storage media to application, combined with complex software stacks and hardware configurations, makes I/O optimizations increasingly challenging for application developers and supercomputer administrators. Modern supercomputers are networked systems with increasingly deep storage hierarchies, serving applications with growing scale and complexity. Skip 1INTRODUCTION Section 1 INTRODUCTION Both Beacon codes and part of collected monitoring data are released. ![]() In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. ![]() Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. Several of the exposed problems have already been fixed, with others being currently addressed. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. ![]() With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. This paper offers a solution to overcome the complexities of production system I/O performance monitoring. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |