Site Reliability Engineer- Spark

San Ramon, CA, US

We are seeking a skilled Site Reliability Engineer to setup and maintain a distributed, high-load infrastructure, operating at scale. Your expertise will be instrumental for overall reliability improvements, in a close partnership with our development teams and aim to design & build new services together. You know how to do troubleshooting of Spark jobs, setting up alerts and monitoring, working with various teams to coordinate a fix, and handling incident management for given services you are managing. 

  • SRE participates and provides feedback in design, development, and implementation of integration processes for Enterprise Data Lake, Data Warehouse, and BI Applications
  • Collaborates with Architects, Engineers, Business Intelligence Developers, and other teammates to achieve common goals
  • The Data SRE is responsible for end-to-end service availability and performance of the data platforms
  • Responsible for meeting defined organizational SLAs, and ongoing tracking and optimizing service availability using established Key Performance Indicators (KPIs)
  • SRE is responsible for optimizing platform utilization and billing cost containment
  • Lead incident resolution, Root cause analysis (RCAs), blameless post-mortem, and problem management
  • Active participant of Disaster Recovery planning and Business continuity planning and drills
  • Review incident trends, identify recurring issues, build automation to eliminate toil.
  • Communicate progress and resolution to appropriate stakeholders and leadership
  • Lead by example, mentor the team, and establish credibility through quality technical execution.
  • Recommend application changes to improve application performance, reliability, and cost to operate
  • Review existing processes and recommend changes or institute new processes as necessary, including observability, alerting, operations, engineering, and system tuning, etc.
  • Generate high-quality documentation detailing the data platform, common patterns, runbooks, SOPs, knowledge base, etc.
  • Work in shifts in a globally distributed team, with follow-the-sun approach.
  • Degree in computer science/engineering or equivalent experience
  • 2+ years’ experience (5+ preferred) in SRE or similar roles
  • Hands-on with Big Data stack, including managing Spark jobs
  • 2+ years’ experience (5+ preferred) in Kubernetes
  • Experienced in Python, shell scripts, SQL, and PL/SQL scripting
  • Experienced in Change & Release process, GitHub and CI/CD solutions
  • Experience in on-prem and public cloud platforms (AWS preferred)
  • Experienced in process reviews, continuous improvement, automation, and toil elimination
  • Experienced in high availability (HA), high transaction volume environments, backup/recovery, and disaster recovery
  • Strong background in full-lifecycle support across multiple platforms or languages
  • Ability to interact with tech/non-tech teams in Infrastructure, Network, Development, Business Analysts, and QA teams
  • Experience in analyzing and recommending solutions for production issues
  • Familiarity with Infrastructure as Code (IaC) and Terraform scripting is a plus

What we offer:

  • Opportunity to work on bleeding-edge projects
  • Work with a highly motivated and dedicated team
  • Competitive salary
  • Flexible schedule
  • Benefits package - medical insurance, sports 
  • Corporate social events
  • Professional development opportunities


Placement and Staffing Agencies need not apply.  We do not work with C2C at this time.

At this moment, we are not able to process H1B transfers. Applicants with CPT and OPT visas are welcome to apply.

About Us: 

Grid Dynamics (Nasdaq:GDYN) is a digital-native technology services provider that accelerates growth and bolsters competitive advantage for Fortune 1000 companies. Grid Dynamics provides digital transformation consulting and implementation services in omnichannel customer experience, big data analytics, search, artificial intelligence, cloud migration, and application modernization. Grid Dynamics achieves high speed-to-market, quality, and efficiency by using technology accelerators, an agile delivery culture, and its pool of global engineering talent. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the US, UK, Netherlands, Mexico, and Central and Eastern Europe.

To learn more about Grid Dynamics, please visit Follow us on FacebookTwitter, and LinkedIn.


Get in touch

We'd love to hear from you. Please provide us with your preferred contact method so we can be sure to reach you.

Please follow up to email alerts if you would like to receive information related to press releases, investors relations, and regulatory filings.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.