www.design-reuse-embedded.com
Find Top SoC Solutions
for AI, Automotive, IoT, Security, Audio & Video...

Crafting a Silicon Lifecycle Management Strategy for HPC and Data Centers

As data center computing and HPC advances, the stakes for ensuring reliability are high. Learn how to develop a silicon lifecycle management (SLM) strategy that ensures a successful future for your designs.

www.allaboutcircuits.com, Jun. 27, 2024 – 

From the advancements of mathematical models to climate projections, supercomputers play a crucial role in driving answers to today's largest problems, while the cloud data centers powering them process and move extreme volumes of data.

With all that in mind, demands for high-performance computing (HPC) and enormous amounts of data storage are more important now than ever.

As electronic systems that power HPC and data centers become more advanced, issues such as device aging, thermal challenges, power constraints, and others pose a challenge for semiconductor designers (Figure 1). A lesser-known issue that poses a challenge is Silent Data Corruption (SDC), which is the result of undetected errors that occur for unknown and unexpected reasons within data centers.

Silent Data Corruption a Growing Problem

Since SDCs are apparently random and difficult to detect, SDCs are now becoming a widespread issue amongst the semiconductor industry and beyond. In a 2021 report on SDC, Meta ran a silent error test scenario in their large-scale infrastructure across hundreds of thousands of machines in their fleets and found that hundreds of CPUs detected these silent errors.

SDC can cause widespread problems within infrastructure systems, therefore consistent testing during manufacturing and in-field is imperative. In today's digital era, millions or more operations are happening within and across devices, which could exacerbate even a few system errors. If an error isn't detected and mitigated quickly, it can lead to data loss and impact business operations and user experiences on a broader, hyperscale level.

To address SDC, designers must know what is happening beyond the surface of a chip to ensure the reliability, availability, and serviceability (RAS) of devices. Designers will need to start employing a silicon lifecycle management (SLM) strategy. Having awareness of long-term RAS implications is key to successful product lifecycle management.

What is a Silicon Lifecycle Management Strategy?

SLM is an emerging concept that consists of the monitoring, analysis, and optimization of devices throughout design and development to ensure that silicon "health" remains robust–that the chip performs as intended.

Beyond ensuring your chip works when it is produced and shipped, it also needs continuous monitoring and testing throughout its life–data center providers and their silicon partner must be able to monitor or analyze the components inside each chip, from the transistor to the data being transmitted, to help not only identify and track expected degradation and potential issues, but also troubleshoot and fix problems.

To guarantee RAS throughout a chip's life, an SLM strategy provides the following actionable insights:

In-Design–Pinpoint the best design component contestant in the device for monitoring. Install monitor IP directly into the infrastructure of the design.

In-Ramp–Focus on the highest yield limiter candidates, conduct accurate failure analysis, and adjust the design and fab process to satisfy high yield requirements.

In-Production–Detect yield and quality outliers through automated insights, perform root-cause analysis across various stages of high-volume manufacturing, and course correct in the semiconductor supply chain as necessary.

In-Field–Calculate silicon health through predictive maintenance and advance performance metrics such as power and throughput, especially as the device ages.

click here to read more...

 Back

Partner with us

List your Products

Suppliers, list and add your products for free.

More about D&R Privacy Policy

© 2024 Design And Reuse

All Rights Reserved.

No portion of this site may be copied, retransmitted, reposted, duplicated or otherwise used without the express written permission of Design And Reuse.