Senior Technical Incident Manager

10 Minutes ago • 6-9 Years • Technical Art

Job Summary

Job Description

Provide strategic coordination of end-to-end service delivery across critical platforms, proactively identifying service trends and systemic failures to drive permanent resolutions. Lead root cause analysis and post-incident reviews, mentoring junior team members. Act as the primary escalation point for complex incidents, owning resolution and customer communication. Drive continuous improvement in monitoring, automation, and system reliability, championing ITIL best practices. Contribute to tooling strategy and manage key relationships with cross-functional partners to ensure operational readiness and service alignment.
Must have:
  • Provide oversight and strategic coordination of end-to-end service delivery across critical platforms and systems.
  • Proactively identify service trends, recurring issues, and systemic failures, and lead efforts to drive permanent resolutions.
  • Lead root cause analysis (RCA) and post-incident reviews with stakeholders.
  • Act as the primary escalation point for complex incidents, owning resolution and customer communication at the senior level.
  • Drive continuous improvement across monitoring, automation, and system reliability.
  • Lead incident bridges and engage with engineering teams and senior stakeholders.
  • Champion best practices in service management including SLAs/OLAs, change management, and problem management processes.
  • Bachelor’s degree or equivalent experience with 6 to 9 years in IT operations, site reliability, or service delivery.
  • Deep understanding of Cloud architectures (Microsoft Azure, AWS, or GCP), infrastructure monitoring, and incident response.
  • Demonstrated experience managing incidents in high-availability, high-throughput, mission-critical environments.
  • Strong technical background with ability to lead troubleshooting across infrastructure, networking, application, and platform services.
  • Advanced knowledge of monitoring, alerting, and observability tools (e.g., Grafana, Opsgenie, Datadog, Prometheus, etc.).
  • Expert-level understanding of ITIL processes, particularly Incident, Problem, and Change Management.
  • Experience conducting technical postmortems, producing RCA reports, and implementing service improvement plans.
  • Proven ability to influence and collaborate with cross-functional technical teams and senior management.
  • Strong leadership presence during high-impact events.
  • Excellent verbal and written communication skills.
  • Willingness to participate in on-call rotation and provide senior-level support during critical incidents.

Job Details

Principal Duties and Responsibilities

  • Provide oversight and strategic coordination of end-to-end service delivery across critical platforms and systems.
  • Proactively identify service trends, recurring issues, and systemic failures, and lead efforts to drive permanent resolutions.
  • Lead root cause analysis (RCA) and post-incident reviews with stakeholders, identifying patterns and continuous improvement opportunities.
  • Mentor and guide junior team members in incident and problem resolution techniques, ensuring knowledge transfer and skills development.
  • Act as the primary escalation point for complex incidents, owning resolution and customer communication at the senior level.
  • Drive continuous improvement across monitoring, automation, and system reliability to reduce operational noise and increase system resiliency.
  • Lead incident bridges and engage with engineering teams and senior stakeholders to ensure timely resolution and high-quality communications.
  • Champion best practices in service management including SLAs/OLAs, change management, and problem management processes.
  • Contribute to tooling strategy and capability enhancements for observability, incident management, and analytics.
  • Own key relationships with cross-functional partners including DevOps, Cloud Engineering, and Product teams to ensure operational readiness and service alignment.
  • Represent the team in technical leadership forums and contribute to operational strategy and planning.
  • Ensure consistent shift readiness by reviewing and refining runbooks, escalation paths, and shift documentation.
  • Promote a culture of quality by embedding service excellence in operational procedures, ensuring processes are optimized for consistency, performance, and reliability.
  • Measure and track key quality indicators and ensure feedback loops are in place for ongoing improvement.

Required Knowledge, Skills, and Qualities

  • Bachelor’s degree or equivalent experience with 6 to 9 years in IT operations, site reliability, or service delivery within enterprise or SaaS environments.
  • Deep understanding of Cloud architectures (Microsoft Azure, AWS, or GCP), infrastructure monitoring, and incident response.
  • Demonstrated experience managing incidents in high-availability, high-throughput, mission-critical environments.
  • Strong technical background with ability to lead troubleshooting across infrastructure, networking, application, and platform services.
  • Advanced knowledge of monitoring, alerting, and observability tools (e.g., Grafana, Opsgenie, Datadog, Prometheus, etc.).
  • Expert-level understanding of ITIL processes, particularly Incident, Problem, and Change Management.
  • Experience conducting technical postmortems, producing RCA reports, and implementing service improvement plans.
  • Proven ability to influence and collaborate with cross-functional technical teams and senior management.
  • Strong leadership presence during high-impact events; comfortable leading conversations with engineering leaders and executive stakeholders.
  • Demonstrated mentoring and coaching experience; ability to develop junior engineers and promote operational excellence culture.
  • Strong focus on quality assurance within service delivery, with a commitment to maintaining high standards in documentation, execution, and outcomes.
  • Excellent verbal and written communication skills with the ability to tailor messages to technical and non-technical audiences.
  • Adaptability to evolving technologies and a strong drive to automate and improve existing processes.
  • Willingness to participate in on-call rotation and provide senior-level support during critical incidents.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Pune, Maharashtra, India

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Technical Art Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

We’re creating moving experiences for vehicles around the world. We’re Cerence. We utilize sophisticated A.I. and sensor data to entertain, inform and delight drivers and passengers. Whether it’s voice, gesture, gaze or touch technologies, the experience is the sum of the parts. Raise windows with a quick glance, hear a restaurant review with the point of a finger, display an augmented reality cityscape on a windshield, drive with just the sound of your voice.The future is connected cars, autonomous driving, ride sharing and e-vehicles.

Pune, Maharashtra, India (On-Site)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (Remote)

Taiwan (Remote)

Pune, Maharashtra, India (On-Site)

Tokyo, Japan (On-Site)

Aachen, North Rhine-Westphalia, Germany (On-Site)

View All Jobs

Get notified when new jobs are added by Cerence

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug
Contact Us
hello@outscal.com
Made in INDIA 💛💙