Technical Architect - Monitoring, IAAS, PAAS, Observability, Azure

19 Minutes ago • 8-10 Years

Job Summary

Job Description

As a Monitoring SME & Architect, you will design and implement comprehensive monitoring solutions to ensure system uptime, health, performance, and reliability. This involves reducing alert volume, implementing intelligible alerting, alert correlation, and setting up early warning systems. You will collaborate across teams to create centralized dashboards and leverage automation for scalable and secure monitoring configurations, with a future scope for AI-integrated operations.
Must have:
  • Ensuring the monitoring framework and enhancements are setup to increase Pro-active identification & resolution prior to customer impact.
  • Setup & maintain centralized Monitoring Configuration by code
  • Consistently drive the alert volume down and eliminate false alerts
  • Setup advanced monitoring alerts for golden signals i.e. Latency, Errors, Throughputs & Saturation.
  • Transform from traditional CPU, Memory symptomatic monitors to more advanced alert co-relation pinpointing directly to issues for predictive monitoring
  • Create & implement Synthetic or End User Monitoring using Python, Selenium for customer experience monitoring
  • Set up API End point monitoring & measure uptime & availability across customers, products & infrastructure endpoints.
  • Implement SLOs, SLIs, Error Budgets concepts to measure & setup Maturity model
  • Maintain & Manage Code Repository built to scale and security measures
  • Leverage Automation to push changes on monitoring tools
  • Setup Orchestration mechanism for on-boarding & decommissioning to ensure Operational Readiness
  • Setup Dashboards & Create visibility across all Cross-functional teams
  • Establish Telemetry for automated collection of data across Metrics, Logs & Traces
  • Continuous Analysis on Data to acknowledge gaps and implementing improvements
  • Associate’s degree (or equivalent) in Computer Science; Information Technology or related field preferred
  • 8-10 years of IT experience with 7 years of Monitoring Experience
  • Experience in Administrating Monitoring Tools – AppDynamics, SolarWinds, Grafana, Zabbix, DataDog, ELK Stack etc.
  • Hands-on experience on Logs, Metrics, Traces, Parsing, RegEx, Tagging
  • Hands-on experience on implementing APM, EUM, Synthetics, API endpoint etc.
  • Hands-on experience on integrations with ITSM tools such as Service Now & Jira
  • Hands-on experience on Ansible, Python, Selenium, Shell
  • Hands-on experience on Enterprise scale of Azure, VM Ware & AWS
  • Hands-on experience on creating dashboards and analysis
  • Excellent interpersonal, influencing skills, interacting appropriately with colleagues of many technical skill levels, remaining calm and courteous while working in a high-stress situation to resolve problems.
Good to have:
  • Problem-solving skills – should be able to devise technical and creative solutions. Use Analytics to understand pattern and pro-actively identify gaps
  • Communication skills – Effective communication is key in this role to gather data about problems, prepare detailed notes and reports, and update users with further steps
  • Time management – Need to maintain excellent time management skills and should be able to set priorities when handling multiple cases.
  • Team collaboration – To routinely work with other functions to resolve user issues, so they need to successfully collaborate with team members and coworkers.
  • Highly motivated, hands-on personality.
  • Ability to learn quickly in a challenging environment

Job Details

Scope:

• As a Monitoring SME & Architect, you will be responsible for designing, implementing a comprehensible Monitoring Solutions & process to ensure uptime, system health, performance & reliability. You will be responsible for reduction of alert volume, implement intelligible alerting, alert co-relations, compression of alerts, measuring signal to noise ratio and setting up an early warning system across Operations. You will be required to collaborate across teams and create centralized dashboarding and visibility to remove Silos. You will be responsible for architecting monitoring configurations in a scalable & secure model leveraging automation with a future scope of AI integrated Monitoring Operations.

Our current technical environment:

• Technical Skills: Monitoring Tool Administration, Logs Indexing & pipeline, Azure, VMWare, Ansible, Python, Selenium, Terraform, Shell, Windows, Linux, GROK parsing

• Problem-solving skills – should be able to devise technical and creative solutions. Use Analytics to understand pattern and pro-actively identify gaps

• Communication skills – Effective communication is key in this role to gather data about problems, prepare detailed notes and reports, and update users with further steps

• Time management – Need to maintain excellent time management skills and should be able to set priorities when handling multiple cases.

• Team collaboration – To routinely work with other functions to resolve user issues, so they need to successfully collaborate with team members and coworkers.

• Highly motivated, hands-on personality.

• Ability to learn quickly in a challenging environment

Key Accountability

  • Monitoring Effectiveness – Ensuring the monitoring framework and enhancements are setup to increase Pro-active identification & resolution prior to customer impact.
  • Setup & maintain centralized Monitoring Configuration by code
  • Consistently drive the alert volume down and eliminate false alerts
  • Setup advanced monitoring alerts for golden signals i.e. Latency, Errors, Throughputs & Saturation.
  • Transform from traditional CPU, Memory symptomatic monitors to more advanced alert co-relation pinpointing directly to issues for predictive monitoring
  • Create & implement Synthetic or End User Monitoring using Python, Selenium for customer experience monitoring
  • Set up API End point monitoring & measure uptime & availability across customers, products & infrastructure endpoints.
  • Implement SLOs, SLIs, Error Budgets concepts to measure & setup Maturity model
  • Maintain & Manage Code Repository built to scale and security measures
  • Leverage Automation to push changes on monitoring tools
  • Setup Orchestration mechanism for on-boarding & decommissioning to ensure Operational Readiness
  • Setup Dashboards & Create visibility across all Cross-functional teams
  • Establish Telemetry for automated collection of data across Metrics, Logs & Traces
  • Continuous Analysis on Data to acknowledge gaps and implementing improvements

Minimum Requirements

  • Associate’s degree (or equivalent) in Computer Science; Information Technology or related field preferred
  • 8-10 years of IT experience with 7 years of Monitoring Experience
  • Experience in Administrating Monitoring Tools – AppDynamics, SolarWinds, Grafana, Zabbix, DataDog, ELK Stack etc.
  • Hands-on experience on Logs, Metrics, Traces, Parsing, RegEx, Tagging
  • Hands-on experience on implementing APM, EUM, Synthetics, API endpoint etc.
  • Hands-on experience on integrations with ITSM tools such as Service Now & Jira
  • Hands-on experience on Ansible, Python, Selenium, Shell
  • Hands-on experience on Enterprise scale of Azure, VM Ware & AWS
  • Hands-on experience on creating dashboards and analysis
  • Excellent interpersonal, influencing skills, interacting appropriately with colleagues of many technical skill levels, remaining calm and courteous while working in a high-stress situation to resolve problems.

Skills:

  • Technical Skills: Monitoring Tool Administration, Logs Indexing & pipeline, Azure, VMWare, Ansible, Python, Selenium, Terraform, Shell, Windows, Linux, GROK parsing
  • Problem-solving skills – should be able to devise technical and creative solutions. Use Analytics to understand pattern and pro-actively identify gaps
  • Communication skills – Effective communication is key in this role to gather data about problems, prepare detailed notes and reports, and update users with further steps
  • Time management – Need to maintain excellent time management skills and should be able to set priorities when handling multiple cases.
  • Team collaboration – To routinely work with other functions to resolve user issues, so they need to successfully collaborate with team members and coworkers.
  • Highly motivated, hands-on personality.
  • Ability to learn quickly in a challenging environment.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Bangalore, Karnataka, India

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Category Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

We are a proven, passionate bunch of disruptors. Our work is all about tapping into your potential so we can deliver the best solutions and customer experiences on the planet. Collaboration, respect, and a great work-life balance earned us the title of "Best Place to Work- Employees' Choice" by Glassdoor. Our people are smart, creative, rock stars with over 400 patents and 10,000 people years of domain expertise. Blue Yonder is the world leader in digital supply chain and omni-channel commerce fulfillment. Our intelligent, end-to-end platform enables retailers, manufacturers and logistics providers to seamlessly predict, pivot and fulfill customer demand. With Blue Yonder, you can make more automated, profitable business decisions that deliver greater growth and re-imagined customer experiences. Blue Yonder - Fulfill your Potential.™
View All Jobs

Get notified when new jobs are added by Blue Yonder

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug