Senior Cloud Consultant/Architect - (SRE / Observability / AIOPs)
- Seasoned client-facing, senior consultant who has advised IT executives on cloud operating models e.g. IT processes and tools
- Experienced with the knowledge to reduce the day-to-day noise and toil of IT support and improve the availability of the client’s application suite via new support methods, scripting automation, and advanced new tooling.
- Experience working closely with production operations, application developers, system, network, middleware, and database administrators to streamline development, operations, and support processes
- Adept at analyzing and problem-solving and preferably have a blend of platform, middleware, network, and software development skills
- Experience with consulting methodologies, knowledge management, and service offering development (to assist in building cloud practice offerings from sales through delivery)
- Apply consulting and engineering skills to solve operations problems by:
- Defining and driving initiatives to increase the client‘s overall application availability
- Building tooling needed to improve observability of performance and operations efficiency
- Enhancing monitoring and management tooling to better detect, diagnose, and correct problems
- Resolution of problems in code for an incident, when applicable
- Documenting defects to communicate back to the Service Owner(s)
- Participate with application developers to develop new features and automation to solve operational challenges
- Driving the transformation of delivery methods into the operational teams such as network, database, system administrators, Incident management
- Enabling an AIOps strategy and roadmap to drive more predictive and automated response
- Investigate RCA resolution to get to, and correct, the source of issues and outages.
- Ideally, a former Developer who knows how to troubleshoot applications transactions end to end and critical points of failure or bottlenecks.
- DevOps/GitOps mindset with a vision for AIOPs (how AI can automate analysis, assignments, decisions, and actions to support and operate a platform and application)
- Cloud Native dashboarding & alerting. (minimally familiar with AWS, GCP, and Azure with depth in at least 1)
- Experience with scalable architectures and performance tuning.
- Enjoy solving difficult engineering problems, approach troubleshooting systematically, and comfortable getting hands-on to guide engineers and operators
- Great communication and planning experience ideally with a large consultancy background
- Ability to own all or part of an assessment to develop recommendations and a roadmap
- Solid understanding of ITSM and ITIL principles with a focus on Event, Incident, Problem, Change, and Configuration Management - and ability to lead assessments of maturity
- Nice to have software engineering skills ideally with experience in Python, Go, and/or Java
- Understanding of large-scale complex systems from a reliability perspective
- Passion for resolving reliability issues and identifying strategies to mitigate going forward
- Implemented High Availability and Disaster Recovery Infrastructure in the cloud.
- Experience with self-healing infrastructure.
- Adhering infrastructure to business SLAs and SLOs and managing Error Budgets.
- MUST HAVE (hands-on with at least one): Dynatrace, Big Panda, Datadog, or New Relic
- Highly desired hands-on: Grafana, ELK Stack, Prometheus, Splunk, and cloud-native tools for alerting and logging
- Knowledge of required and preferred to have some hands-on: Kubernetes, Terraform, Python, GCP/AWS/Azure
- Participation in challenging projects, an opportunity for professional development and growth
- Flexible work hours and a dynamic environment
- Friendly cooperative team and atmosphere
- Medical insurance and other benefits
Grid Dynamics is the engineering services company known for transformative, mission-critical cloud solutions for the retail, finance, and technology sectors. We have architected some of the busiest e-commerce services on the Internet and have never had an outage during the peak season. Founded in 2006 and headquartered in San Ramon, California with offices throughout the US and Eastern Europe, we focus on big data analytics, scalable omnichannel services, DevOps, and cloud enablement.
Get in touch
We'd love to hear from you. Please provide us with your preferred contact method so we can be sure to reach you.
Please follow up to email alerts if you would like to receive information related to press releases, investors relations, and regulatory filings.