Senior Cloud Architect
Senior Cloud Architect - (SRE / Observability / AIOPs)
Professionals who specialize in cloud operations with a focus on observability (logging, tracing, alerting) with a vision for AIOPs and a strong understanding of the practice of site reliability. A consultant with a mix of knowledge and skills in software development and cloud platforms, with experience advising clients on how to analyze their challenges, advise, design, build, test, and deploy changes while maintaining a cloud operating model e.g. DevOps & ITSM process and tools.Responsibilities:
- Seasoned client-facing consultant/architect who has advised IT executives on cloud operating models e.g. IT processes and tools - and can craft proposals after initial presales meetings
- Experience working closely with production operations, application developers, system, network, middleware and database administrators to streamline development, operations and support processes
- Experience in leading DevOps teams, establishing pipelines for cloud and application development, and managing the velocity, quality and performance of the cloud and the applications
- Adept at analyzing and problem-solving and preferably have a blend of platform, middleware, network and software development skills
- Very nice to have: experience with consulting methodologies, knowledge management and service offering development (to assist in building cloud practice offerings from sales through delivery)
- Apply consulting and engineering skills to solve operations problems by:
- Defining and driving initiatives to increase the client‘s overall application development velocity , quality and availability
- Building tooling needed to improve DevOps and observability of development and operations performance/efficiency
- Enhancing monitoring and management tooling to better detect, diagnose, and correct problems
- Identification and resolution of defects/problems in the cloud or application code for an incident, when applicable
- Team with application developers to support pipelines for new features and incident response automation
- Driving the transformation of delivery methods into the operational teams such as network, database, system administrators, Incident management
- Enabling an AIOps strategy and roadmap to drive more predictive and automated response
- Investigate RCA resolution to get to, and correct, the source of issues and outages.
Requirements:
- Ideally, a former Developer who knows how to support development with DevOps and SRE automation, including troubleshooting applications transactions end to end and critical points of failure or bottlenecks.
- DevOps/GitOps understanding with a vision for how to automate analysis, assignments, decisions and actions to support and operate a platform and application
- Cloud Native dashboarding & alerting. (minimally familiar with AWS, GCP and Azure with depth in at least 1)
- Experience with scalable cloud native architectures and performance tuning.
- Enjoy solving difficult engineering problems, approach troubleshooting systematically, and comfortable getting hands-on to guide engineers and operators
- Great communication and planning experience ideally with large consultancy background
- Ability to own all or part of an assessment to develop recommendations and a roadmap
- Solid understanding of ITSM and ITIL principles with focus on Event, Incident, Problem, Change and Configuration Management - and ability to lead assessments of maturity
- Nice to have software engineering skills ideally with experience in Python, Go and/or Java
- Understanding of large-scale complex systems from a reliability perspective
- Passion for resolving reliability issues and identify strategies to mitigate going forward
- Implemented High Availability & Disaster Recovery Infrastructure in the cloud.
- Experience with self-healing infrastructure.
- Adhering infrastructure to business SLAs and SLOs and managed Error Budgets.
- MUST HAVE experience:
- Designing and implementing DevOps pipelines (e.g. Jenkins, Tekton, etc) for Infrastructure as Code (e.g. Terraform)
- Leading team to develop DevOps/IaC Automation: Kubernetes, Terraform, Python, GCP/AWS/Azure
- Strong experience with at least one cloud: GCP, AWS or Azure (ideally many)
- Highly desired hands on:
- Grafana, ELK Stack, Prometheus, Splunk, and cloud native tools for alerting and logging
- Observability/APM tools (at least one): Dynatrace, Big Panda, Datadog or New Relic
What we offer:
Opportunity to work on bleeding-edge projects
Work with a highly motivated and dedicated team
Competitive salary
Flexible schedule
Benefits package - medical insurance, sports
Corporate social events
Professional development opportunities
Well-equipped office
About Us:
Grid Dynamics (Nasdaq:GDYN) is a digital-native technology services provider that accelerates growth and bolsters competitive advantage for Fortune 1000 companies. Grid Dynamics provides digital transformation consulting and implementation services in omnichannel customer experience, big data analytics, search, artificial intelligence, cloud migration, and application modernization. Grid Dynamics achieves high speed-to-market, quality, and efficiency by using technology accelerators, an agile delivery culture, and its pool of global engineering talent. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the US, UK, Netherlands, Mexico, and Central and Eastern Europe.
To learn more about Grid Dynamics, please visit www.griddynamics.com. Follow us on Facebook, Twitter, and LinkedIn.
Don’t see the right opportunity?
Contact us anyway and let’s talk! To apply, send your resume and cover letter to jobs@griddynamics.comGet in touch
We'd love to hear from you. Please provide us with your preferred contact method so we can be sure to reach you.
Please follow up to email alerts if you would like to receive information related to press releases, investors relations, and regulatory filings.