Data Parallel Accelerator Post-Silicon Performance Lead
rivos
Job Summary
Join a well-funded, innovative hardware startup in Silicon Valley as the Post-Silicon and Emulation Performance Lead Engineer. As a key technical leader, you will drive silicon performance analysis and optimization across software, firmware, architecture, power, and system design. Your work will ensure our silicon consistently achieves industry-leading efficiency and performance standards. This role offers a rare opportunity to shape future architectural directions by executing and analyzing end-to-end workloads in advanced post-silicon environments. You will champion best-in-class performance for both single-socket and scale-up/scale-out systems. We are reimagining silicon to build accelerated computing platforms that will transform the industry, collaborating with talented engineers to push boundaries in performance, energy efficiency, programmability, and scalability.
Must Have
- Lead cross-functional performance validation
- System-level performance optimization
- Collaborate across teams
- Power and performance correlation
- Performance infrastructure automation
- Debug and tuning
- Drive innovation
- Deep expertise in GP-GPU architecture
- Strong C/C++ and Python programming
- Solid understanding of ML/DL workloads
- Familiarity with SIMT processing
- Experience with performance counters
- Knowledge of performance improvement concepts
- Excellent teamwork and communication
Good to Have
- Experience optimizing LLMs at the system level
- Experience with embedded systems (bare-metal testing/debugging)
Perks & Benefits
- Shape the future of silicon
- Work alongside world-class engineers
- Explore research across hardware/software
- Flexible, creative, and collaborative environment
- Opportunity to drive architectural innovation
Job Description
Key Responsibilities
- Lead cross-functional performance validation: Analyze workloads and microbenchmarks in emulation and post-silicon environments, ensuring strong correlation with cycle-accurate models and RTL
- System-level performance optimization: Measure and tune workloads (Generative AI, data analytics) for optimal performance per watt
- Collaborate across teams: Work closely with design, architecture, systems, and software groups to enable enterprise use-case performance measurements
- Power and performance correlation: Integrate silicon power measurements with simulation and full-chip projections to drive hardware/software tuning.
- Performance infrastructure automation: Develop and automate tools for performance measurement, debug, and reporting.
- Debug and tuning: Conduct system-level power and performance debugging, including silicon register tuning to meet aggressive performance targets.
- Drive innovation: Influence architectural decisions and validation methodologies to ensure our platforms remain at the forefront of the industry
Required Qualifications
- Deep expertise in GP-GPU architecture and microarchitecture
- Strong programming skills in C/C++ and Python
- Solid understanding of ML/DL workloads and benchmarks; experience optimizing LLMs at the system level is a significant plus
- Familiarity with SIMT processing, cache, and memory hierarchies
- Hands-on experience with performance counters and profiling techniques
- Knowledge of performance improvement concepts: bottleneck analysis, latency hiding, speculative execution, resource scheduling, buffer sizing, replacement policies
- Experience with embedded systems (bare-metal testing/debugging is a plus)
- Excellent teamwork, ownership, and communication skills; ability to thrive under aggressive schedules and adapt quickly
Education and Experience
- Bachelor’s degree with 12+ years of experience in a relevant field
- Master’s degree with 10+ years of experience in a relevant field
- PhD with 5+ years of experience in a relevant field