Site Reliability Engineer

Aeries technology

Job Summary

As a Site Reliability Engineer at Constant Contact, you will be responsible for maintaining the reliability and uptime of critical services, focusing on CentOS servers, Java application support, incident management, change management, and Kubernetes administration. You will monitor production systems, applications, and overall performance while conducting security checks and routine system and application maintenance. The role involves responding to operational alerts, collaborating with developers and operations personnel to resolve issues, and participating in post-mortem meetings to prevent future incidents. You will also be expected to write and maintain policy and procedure documents, write scripts or code to develop tools and/or services, and manage service-level objectives.

Must Have

  • Administer Kubernetes clusters with ArgoCD.
  • Monitor and manage applications on CentOS servers.
  • Manage incidents, perform root cause analysis.
  • Use basic Linux scripting for automation.
  • Knowledge of Project Management Tools like JIRA/Confluence.
  • Experience with database systems like MySQL and DB2.
  • Drive incidents using Incident Management processes.
  • Execute change management procedures.
  • Experience as a Linux (CentOS / RHEL) administrator.
  • Experience with managing deployments using Jenkins.
  • Working with monitoring tools like New Relic, Splunk and Nagios.
  • Experience with log aggregation tools like Splunk, Loki or Grafana.

Job Description

About Us Aeries Technology is a Nasdaq listed global professional services and consulting partner, headquartered in Mumbai, India, with centers in the USA, Mexico, Singapore, and Dubai. We provide mid-size technology companies with the right mix of deep vertical specialty, functional expertise, and the right systems & solutions to scale, optimize and transform their business operations with unique customized engagement models. Aeries is Great Place to Work certified by GPTW India, reflecting our commitment to fostering a positive and inclusive workplace culture for our employees. Read about us at https://aeriestechnology.com About Business Unit "Constant Contact is a technology product company, headquartered in Waltham, Massachusetts, United States. We are one of the top 2 providers of email marketing, social media marketing, event marketing, and online survey tools. We support 0.5 million SMBs to grow their businesses by building stronger relationships with their customers, with a wide range of intuitive marketing applications designed to help small businesses and nonprofits expand their customer bases and nurture relationships. Read about us at https://www.constantcontact.com/about In 2021, Constant Contact partnered with Aeries to set up its GTC with an aim of consolidating the former’s global operations in Bengaluru (Bangalore), India; with teams set up in the areas of IT, Engineering, Customer Support, and other General and Administrative functions. The GTC is a dedicated center, focused on providing best practices, research, support, and training for specific business functions." Big Reasons to Support Small - https://constantcontact.wistia.com/medias/pmlrsyb6hu Roles and Responsibility At Constant Contact, we’re looking for individuals well rounded in several aspects of Technical Operations. You will be taking on the role of a responder to the Operational alerts and monitoring within Constant Contact. This role requires you to work with both Developers and Operational personnel to address and resolve issues and requests. We are looking for a highly skilled and motivated Site Reliability Engineer to join our team. The successful candidate will be responsible for maintaining the reliability and uptime of critical services, with a focus on CentOS servers, Java application support, incident management, change management and Kubernetes administration. The ideal candidate will possess strong ArgoCD for Kubernetes management, Linux skills, basic scripting knowledge and familiarity with modern monitoring, alerting and automation tools. We are looking for someone that is self-motivated, possesses excellent communication skills (both oral and written) and is able to work both independently and collaboratively. What you’ll do: Conduct regular routine tasks for system and application maintenance. Follow SOP's to correct/prevent issues Monitor production systems, applications and overall performance. Observability is a process that prepares the software team for uncertainties when the software goes live for end users. Site reliability engineering uses tools to detect abnormal behaviors in the software and, more importantly, collect information that helps developers understand what causes the problem. Conduct security checks Run meetings with our business partners following in place processes and procedures. Writing, updating and maintaining policy and procedure documents Write scripts or code as necessary to develop tools and/or services in order to support the product Learn from Post Mortems and prevent new incidents from occurring Performing admin work on various tools and applications such as JIRA and New Relic Maintain Service-level objectives, specific and quantifiable goals related to maintaining the parameters set for our “Golden Metrics”. Who you are: 3-5+ years of experience working in a SaaS and Cloud environment. Administer Kubernetes clusters, including management of applications using ArgoCD. Monitor, maintain, and manage applications on CentOS servers, ensuring high availability and performance. Respond to and manage running incidents, including running post mortem meetings, peforming root cause analysis and ensuring timely resolution. Use basic Linux scripting to automate routine tasks and improve operational efficiency. Knowledge in Project Management Tools like JIRA/Confluence Knowledge of Database systems like MySQL and DB2 Understand and drive incidents using Incident Management processes and procedures Execute change management procedures, run change management meetings and enforce safe and compliant changes to production environments. Experience as a Linux (CentOS / RHEL) administrator Deep knowledge of on-call responsibilities and awareness of time management. Include maintaining On-call management tools such as xMatters software. Experience with managing deployments using Jenkins Working with a suite of monitoring tools including New Relic, Splunk and Nagios Experience with log aggregation tools like Splunk, Loki or Grafana You must be comfortable troubleshooting and debugging web applications across the entire stack (i.e. the application layer, the database layer, the OS). Production MySQL experience: replication, performance tuning, query optimization. You should have familiarity with Ansible or other configuration management tools like Puppet.

17 Skills Required For This Role

Saas Business Models Timeline Management Communication Problem Solving Mysql Linux Ansible New Relic Grafana Nagios Puppet Kubernetes Confluence Splunk Jira Jenkins Java