|
|
www.design-reuse-embedded.com |
Uncovering Silent Data Errors with AI
Silent data errors (SDEs) are a growing problem in data centers, and today's testing and identification solutions are falling short. New machine-learning (ML) methodologies can offer a way to proactively identify and deal with silent data errors.What are SDEs?
www.eetimes.com, Oct. 03, 2024 –
An SDE is a hardware error that occurs when a CPU, GPU, AI engine or other computational element makes errors in computation or instruction execution with the data it processes. The data is corrupted, but there are no hardware or software alarms, so the data still appears valid. Because SDEs are undetected, they can cause unpredictable behavior within local hardware and potentially spread corrupt data into the larger system, manifesting as application software issues or problems in networking hardware.
In a worst-case scenario within a data center, silent data errors can result in permanent data loss and crash a server or portion of the cloud by adversely affecting the network interface. Even outside such a scenario, issues caused by SDEs can take a long time to debug and resolve, with significant losses in terms of workflow, engineering and operational expenses.
SDEs are not a new problem; companies with data centers have been battling them for years. A 2021 paper from Google titled "Cores that don't count" described the effects seen by cloud operators. The researchers described "mercurial" cores that were difficult to locate and would create silent data errors (also called corrupt execution errors). They said that the fundamental reason we are just learning about mercurial cores is "ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design."
In a 2022 paper from engineers at Meta titled "Detecting silent data corruptions in the wild," the authors noted, "With an occurrence rate of one fault within a thousand devices, silent data corruptions have the ability to impact numerous applications."
Over time, cloud operators have presented additional papers explaining in more detail what they are seeing in their data centers and outlining methodologies for isolating the failed chips.
Semiconductor failures
Semiconductor failures are a well-understood problem in the industry, and failure modes that cause silent data errors are no different from other failures discovered during normal chip product life. The lifecycle of the chip and the failure rate of the chip during each section of the lifecycle are shown in Figure 1.