Shield AI is looking for a Cloud Engineer to support its leadership in applied artificial intelligence development. In this role, you will be responsible for engineering, deploying, provisioning, and managing critical cloud systems that drive innovation across Shield AI’s public and private cloud environments, both domestically and internationally. As part of the Cloud and Infrastructure team within Enterprise Operations, you will play a key role in ensuring the performance, scalability, and reliability of these systems to support various business units. This position may involve occasional travel to Shield AI locations.
What you'll do:
- Engineering:
- Oversee the day-to-day management and optimization of cloud-based infrastructure (e.g., Azure, AWS).
- Support and optimize cloud and virtual machine environments, assisting with capacity planning, performance monitoring, security compliance, and vulnerability remediation.
- Assist in implementing and maintaining infrastructure systems, including servers, storage, backup solutions, and disaster recovery processes, for both public and private clouds.
- Demonstrate a willingness to learn and work with familiar or unfamiliar operating systems and workloads with the desire to leverage automation tasks for repeatable tasks.
- Author and produce the necessary documentation for engineered and maintained systems along with associated processes which supporting teams can leverage.
- Assist in researching, recommending, and developing innovative solutions for complex requirements and issue resolution.
- Participate in Agile methodologies and sound engineering principles.
- Operations and Support:
- Perform daily system monitoring, verifying the integrity and availability of all server resources, systems and key processes, reviewing system and application logs.
- Support system maintenance and upgrades, including OS patching, software configuration, hardware updates, and performance tuning to ensure optimal cloud infrastructure performance.
- Provide escalated support for operational issues possibly during and after normal business hours for systems, workloads, and Kubernetes AI infrastructure.
- Analyze, troubleshoot and resolve system infrastructure and software issues.
- Possess the capacity to participate in on-call, emergency, or maintenance roles.