Senior Staff Engineer, Memory Fault Management Architect

4 Months ago • 10-15 Years • Research & Development • $180,950 PA - $289,050 PA

Job Summary

Job Description

As a Senior Staff Engineer, Memory Fault Management Architect, you will be part of an incubation team focused on transforming the customer quality experience for Samsung memory products. This role involves analyzing massive datasets from memory fleet telemetry to identify failure modes, project failure rates, and develop proactive solutions to minimize system downtime. You will collaborate with customers, contribute to industry standardization efforts (OCP), and design/develop RAS algorithms (page offlining, hPPR). Responsibilities include recommending solutions to mitigate DRAM failure rates, communicating ECC schemes, and establishing the value of in-field fault management architecture. The position requires deep knowledge of SOC controllers, memory operations, RAS features, ECC design, and Linux kernel experience.
Must have:
  • 10+ years experience in hardware fault management
  • Knowledge of platform memory subsystem and RAS
  • ECC design, verification, and reverse engineering
  • Understanding of DRAM and HBM failure modes
  • Excellent communication and collaboration skills
Good to have:
  • Linux kernel commit experience
  • Memory controller register modification
Perks:
  • 4+ weeks paid time off
  • Medical/Dental/Vision/401k
  • Fertility care or adoption stipend
  • Medical travel support
  • On-site gym and cafe
  • Virtual classes
  • Flexible work environment

Job Details

Please Note:

To provide the best candidate experience amidst our high application volumes, each candidate is limited to 10 applications across all open jobs within a 6-month period. 

Advancing the World’s Technology Together
Our technology solutions power the tools you use every day--including smartphones, electric vehicles, hyperscale data centers, IoT devices, and so much more. Here, you’ll have an opportunity to be part of a global leader whose innovative designs are pushing the boundaries of what’s possible and powering the future. 

We believe innovation and growth are driven by an inclusive culture and a diverse workforce. We’re dedicated to empowering people to be their true selves. Together, we’re building a better tomorrow for our employees, customers, partners, and communities.

Conventional DRAM failure analysis was physical electrical FA and physical FA. But, in the era of Data center, it is easier to track the field failure information. With this data set, Fault management team’s role is finding DRAM failure mode, abnormality and failure rate projection.

You will be part of an incubation team working on in-field telemetry intended to transform the Customer Quality Experience for Samsung memory products. Fault Management is the future of quality to minimize system downtime within AI/ML hardware deployments and workloads of the future. We analyze trends and patterns from enormous memory fleet telemetry to bucketize failures and perform virtual root-cause analysis. Telemetry analysis helps us design solutions to proactively avoid system downtime. We conduct research and develop both in-house and collaboratively in the industry with the opportunity to publish our findings through whitepapers and conferences. We are looking for innovative and passionate thinkers who can work in a start-up environment and are excited to shape the future of data centers around the world. Join us in our mission!

What You'll Do

  • Based on the knowledge of  SOC controller and memory operation including RAS feature, find and recommends better solution to mitigate the field DRAM failure rate.
  • Needs to communicate better ECC scheme to customers based on Samsung DRAM failure mode(DQ and burst)
  • Interface with customers to establish the value add of enabling in-field fault management architecture
  • Contribute to the standardization of DRAM/HBM failure logging in the OCP.
  • Propose and develop platform RAS (Reliability Availability Serviceability) algorithms for memory fault management such as page offlining, hPPR and conduct POC with known failure DIMMs in the real server and application.

Location: Hybrid with at least 3 days in office in San Jose, CA office location remainder of time to work remotely

Job ID: 42448

 What You Bring

  • Bachelors with 15+ years of relevant industry experience, or Masters with 13+ years or PhD with 10+ years hardware fault management, reliability, data center fleet management experience or related technical field preferred
  • Knowledge of platform memory subsystem, platform RAS (Reliability Availability Serviceability) such as ECC, page offlining, hPPR and hardware sparing.
  • ECC design and verification and reverse engineering experience.
  • Understanding on the address mapping between CPU and memory.
  • Memory controller register modification.
  • Linux kernel commit experience.
  • DRAM and HBM failure mode understanding.
  • Excellent communication and interpersonal skills.
  • Ability to work independently and as part of a team.
  • You’re inclusive, adapting your style to the situation and diverse global norms of our people.
  • An avid learner, you approach challenges with curiosity and resilience, seeking data to help build understanding.
  • You’re collaborative, building relationships, humbly offering support and openly welcoming approaches.
  • Innovative and creative, you proactively explore new ideas and adapt quickly to change.

#LI-SF1

 

 

 

What We Offer
The pay range below is for all roles at this level across all US locations and functions. Individual pay rates depend on a number of factors—including the role’s function and location, as well as the individual’s knowledge, skills, experience, education, and training. We also offer incentive opportunities that reward employees based on individual and company performance. 

This is in addition to our diverse package of benefits centered around the wellbeing of our employees and their loved ones. In addition to the usual Medical/Dental/Vision/401k, our inclusive rewards plan empowers our people to care for their whole selves. An investment in your future is an investment in ours.

Give Back With a charitable giving match and frequent opportunities to get involved, we take an active role in supporting the community.
Enjoy Time Away You’ll start with 4+ weeks of paid time off a year, plus holidays and sick leave, to rest and recharge.
Care for Family Whatever family means to you, we want to support you along the way—including a stipend for fertility care or adoption, medical travel support, and an errand service.
Prioritize Emotional Wellness With on-demand apps and paid therapy sessions, you’ll have support no matter where you are.
Stay Fit Eating well and being active are important parts of a healthy life. Our onsite Café and gym, plus virtual classes, make it easier.
Embrace Flexibility Benefits are best when you have the space to use them. That’s why we facilitate a flexible environment so you can find the right balance for you.

Base Pay Range

$180,950 - $289,050 USD

Equal Opportunity Employment Policy 

Samsung Semiconductor takes pride in being an equal opportunity workplace dedicated to fostering an environment where all individuals feel valued and empowered to excel, regardless of race, religion, color, age, disability, sex, gender identity, sexual orientation, ancestry, genetic information, marital status, national origin, political affiliation, or veteran status.

When selecting team members, we prioritize talent and qualities such as humility, kindness, and dedication. We extend comprehensive accommodations throughout our recruiting processes for candidates with disabilities, long-term conditions, neurodivergent individuals, or those requiring pregnancy-related support. All candidates scheduled for an interview will receive guidance on requesting accommodations.

Recruiting Agency Policy

We do not accept unsolicited resumes. Only authorized recruitment agencies that have a current and valid agreement with Samsung Semiconductor, Inc. are permitted to submit resumes for any job openings.

Covid-19 Policy
To help keep our employees, customers, and communities safe, we’ve developed guidelines for our teams. Currently, we encourage vaccination for all employees and may require it depending on job functions (e.g., traveling for business, meeting with customers). While visiting our offices or attending team events, we ask employees to complete a daily health questionnaire and complete a weekly COVID test. Our COVID policies are subject to change depending on public health, regulatory and business circumstances. 

Applicant Privacy Policy
https://semiconductor.samsung.com/us/careers/privacy

 

Similar Jobs

Google - Software Engineer III, Security, Privacy, Sandboxing

Google

Munich, Bavaria, Germany (On-Site)
1 Month ago
QuinStreet - Sales Executive

QuinStreet

United States (Remote)
4 Weeks ago
OLIVER Agency - Senior SEO Content Writer

OLIVER Agency

Manila, Metro Manila, Philippines (On-Site)
2 Weeks ago
Google - Staff Software Engineer, Google Cloud

Google

Pune, Maharashtra, India (On-Site)
6 Months ago
Axon - Senior Firmware Engineer I

Axon

London, England, United Kingdom (Hybrid)
2 Weeks ago
ByteDance - Research Scientist, Data Management and Security - Infrastructure System Lab

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
Google - Engineering Manager, Gemini Code Assist

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Month ago
NVIDIA - Senior Signal and Power Integrity Engineer - Hardware

NVIDIA

Austin, Texas, United States (On-Site)
3 Months ago
Rivos - Silicon Logic Formal Verification - Full Time

Rivos

Austin, Texas, United States (Hybrid)
7 Months ago
KPIT - C++ Expert

KPIT

Bengaluru, Karnataka, India (Hybrid)
8 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Hedra - Machine Learning Engineer (CUDA)

Hedra

New York, New York, United States (On-Site)
2 Months ago
ByteDance - Linux Kernel Software Engineer

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
Aptive - Software Engineer Ground Truth Lab

Aptive

Kraków, Lesser Poland Voivodeship, Poland (On-Site)
3 Weeks ago
Argus Labs - Senior Software Engineer (Infrastructure/Backend)

Argus Labs

(Remote)
2 Months ago
Ansys - Lead R&D Engineer (Cloud Platform Developer)

Ansys

Waterloo, Ontario, Canada (Remote)
2 Weeks ago
Playrix - Senior Data Analyst (Attribution)

Playrix

Montenegro (Remote)
7 Months ago
Electronic Arts - Technical Director - Dynamic Experiences

Electronic Arts

Redwood City, California, United States (On-Site)
1 Month ago
CData - Software Development Engineer III

CData

Bengaluru, Karnataka, India (On-Site)
1 Month ago
NVIDIA - AI Network System Architect

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
Inkittt - Senior Software Engineer, Backend

Inkittt

Krakow Am See, Mecklenburg-Vorpommern, Germany (Hybrid)
7 Months ago

Get notifed when new similar jobs are uploaded

Jobs in San Jose, California, United States

Google - Senior Software Engineer, Engineering Productivity, Google Cloud Platforms

Google

New York, New York, United States (On-Site)
1 Month ago
Haleon - Automation Technician Apprentice

Haleon

Lincoln, Nebraska, United States (On-Site)
3 Weeks ago
Optiv - Federal Client Director

Optiv

Tampa, Florida, United States (Remote)
2 Weeks ago
Patreon - Acquisitions Coordinator

Patreon

New York, New York, United States (On-Site)
2 Months ago
hh exchange - Senior FP&A Analyst

hh exchange

Philadelphia, Pennsylvania, United States (Remote)
1 Month ago
ByteDance - Site Reliability Engineer Intern

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
ByteDance - Research Engineer Graduate (Vision AI Platform)

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
Netflix - Manager/Counsel, Business and Legal Affairs - Original Series // Drama

Netflix

Los Angeles, California, United States (On-Site)
1 Month ago
Mattel  Inc  - American Girl New York-  Salon Stylist (Licensed Cosmetologist/ part time under/seasonal)

Mattel Inc

New York, New York, United States (On-Site)
6 Months ago
lifechruh - Head of Creative

lifechruh

Edmond, Oklahoma, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Ubisoft - Research Student - Ubisoft La Forge

Ubisoft

Shanghai, Shanghai, China (On-Site)
6 Months ago
ByteDance - Site Reliability Engineer, ML System

ByteDance

Seattle, Washington, United States (On-Site)
7 Months ago
Niantic - Senior Computer Vision Software Engineer

Niantic

London, England, United Kingdom (Hybrid)
2 Months ago
NVIDIA - Senior ASIC Power and Thermal Engineer

NVIDIA

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Google - Software Engineer, Speed

Google

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
Tesla - Algorithms Engineer, Autobidder (Electricity Markets/Energy Trading)

Tesla

North Holland, Netherlands (On-Site)
3 Months ago
Backbone - Technical Program Manager, Mechanical

Backbone

Atherton, California, United States (Hybrid)
9 Months ago
Google - Staff Software Engineer, ML Compilers

Google

New Taipei, New Taipei City, Taiwan (On-Site)
1 Month ago
NVIDIA - Physical Design Backend Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Riot Games - Staff Software Engineer, Gameplay & Simulation

Riot Games

Los Angeles, California, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

San Jose, California, United States (Hybrid)

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

San Jose, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Samsung Semiconductor

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug