Workshop on Fault Tolerance for HPC at eXtreme Scale
摘要截稿:
全文截稿: 2018-08-30
开会时间: 2018-11-16
会议难度:
CCF分类: 无
会议地点: Dallas, TX, USA
Overview
Topics include, but are not limited to:
-Failure data analysis and field studies
-Power, performance, resilience (PPR) assessments / tradeoffs
-Novel fault-tolerance techniques and implementations
-Emerging hardware and software technology for resilience
-Silent data corruption (SDC) detection / correction techniques
-Advances in reliability monitoring, analysis, and control of highly complex systems
-Failure prediction, error preemption, and recovery techniques
-Fault-tolerant programming models
-Models for software and hardware reliability
-Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
-Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
-Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
-Near-threshold-voltage implications and evaluations for reliability
-Benchmarks and experimental environments including fault injection
-Frameworks and APIs for fault-tolerance and fault management