- Home
- Register
- Attend
- Conference Program
- SC15 Schedule
- Technical Program
- Awards
- Students@SC
- Research with SCinet
- HPC Impact Showcase
- HPC Matters Plenary
- Keynote Address
- Support SC
- SC15 Archive
- Exhibits
- Media
- SCinet
- HPC Matters
SCHEDULE: NOV 15-20, 2015
When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.
Analysis of Node Failures in High Performance Computers Based on System Logs
SESSION: Regular & ACM Student Research Competition Poster Reception
EVENT TYPE: Posters, Receptions, ACM Student Research Competition
EVENT TAG(S): HPC Beginner Friendly, Regular Poster
TIME: 5:15PM - 7:00PM
SESSION CHAIR(S): Michela Becchi, Manish Parashar, Dorian C. Arnold
AUTHOR(S):Siavash Ghiasvand, Florina M. Ciorba, Ronny tschueter, Wolfgang E. Nagel
ROOM:Level 4 - Lobby
ABSTRACT:
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.
Chair/Author Details:
Michela Becchi, Manish Parashar, Dorian C. Arnold (Chair) - University of Missouri|Rutgers University|University of New Mexico|
Siavash Ghiasvand - Technical University of Dresden
Florina M. Ciorba - University of Basel
Ronny tschueter - Technical University of Dresden
Wolfgang E. Nagel - Technical University of Dresden
Click here to download .ics calendar file
