www.design-reuse-embedded.com
Find Top SoC Solutions
for AI, Automotive, IoT, Security, Audio & Video...

Uncovering Silent Data Errors with AI

Silent data errors (SDEs) are a growing problem in data centers, and today's testing and identification solutions are falling short. New machine-learning (ML) methodologies can offer a way to proactively identify and deal with silent data errors.

What are SDEs?

www.eetimes.com, Oct. 03, 2024 – 

An SDE is a hardware error that occurs when a CPU, GPU, AI engine or other computational element makes errors in computation or instruction execution with the data it processes. The data is corrupted, but there are no hardware or software alarms, so the data still appears valid. Because SDEs are undetected, they can cause unpredictable behavior within local hardware and potentially spread corrupt data into the larger system, manifesting as application software issues or problems in networking hardware.

In a worst-case scenario within a data center, silent data errors can result in permanent data loss and crash a server or portion of the cloud by adversely affecting the network interface. Even outside such a scenario, issues caused by SDEs can take a long time to debug and resolve, with significant losses in terms of workflow, engineering and operational expenses.

SDEs are not a new problem; companies with data centers have been battling them for years. A 2021 paper from Google titled "Cores that don't count" described the effects seen by cloud operators. The researchers described "mercurial" cores that were difficult to locate and would create silent data errors (also called corrupt execution errors). They said that the fundamental reason we are just learning about mercurial cores is "ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design."

In a 2022 paper from engineers at Meta titled "Detecting silent data corruptions in the wild," the authors noted, "With an occurrence rate of one fault within a thousand devices, silent data corruptions have the ability to impact numerous applications."

Over time, cloud operators have presented additional papers explaining in more detail what they are seeing in their data centers and outlining methodologies for isolating the failed chips.

Semiconductor failures

Semiconductor failures are a well-understood problem in the industry, and failure modes that cause silent data errors are no different from other failures discovered during normal chip product life. The lifecycle of the chip and the failure rate of the chip during each section of the lifecycle are shown in Figure 1.

graph figure 1

click here to read more...

 Back

Partner with us

List your Products

Suppliers, list and add your products for free.

More about D&R Privacy Policy

© 2024 Design And Reuse

All Rights Reserved.

No portion of this site may be copied, retransmitted, reposted, duplicated or otherwise used without the express written permission of Design And Reuse.