Senior Site Reliability Engineer

San Ramon, CA, US

Our client, a renowned technology giant headquartered in Silicon Valley, is at the forefront of innovation and has a worldwide presence. We're embarking on an exciting project that leverages cutting-edge big data technologies to craft a high-performance data analytics platform capable of managing petabytes of data.

We're in search of a talented Site Reliability Engineer (SRE) to spearhead the setup and maintenance of a distributed, high-load infrastructure that operates at scale. Your expertise will play a pivotal role in enhancing overall reliability through close collaboration with our development teams. Together, we aim to design and construct new services that will leave a lasting impact. Your primary focus will revolve around managing Big Data services running on Kubernetes.

Key Responsibilities:

  • Collaborate in the design, development, and implementation of integration processes for Enterprise Data Lake, Data Warehouse, and BI Applications.
  • Work closely with Architects, Engineers, Business Intelligence Developers, and other team members to accomplish shared objectives.
  • Ensure end-to-end service availability and performance of the data platforms.
  • Meet predefined organizational SLAs, and continuously monitor and optimize service availability using established Key Performance Indicators (KPIs).
  • Optimize platform utilization and manage billing costs effectively.
  • Lead incident resolution, Root Cause Analysis (RCAs), blameless post-mortems, and problem management.
  • Actively contribute to Disaster Recovery planning and Business Continuity planning and drills.
  • Analyze incident trends, identify recurring issues, and develop automation to reduce manual tasks.
  • Communicate progress and resolutions to relevant stakeholders and leadership.
  • Lead by example, mentor the team, and establish credibility through high-quality technical execution.
  • Recommend application changes to enhance performance, reliability, and cost-efficiency.
  • Review existing processes and propose changes or introduce new processes where necessary, including observability, alerting, operations, engineering, and system tuning.
  • Produce comprehensive documentation covering the data platform, common practices, runbooks, SOPs, and a knowledge base.
  • Perform other duties as assigned.


  • A degree in computer science/engineering or equivalent experience.
  • 2+ years of experience (5+ preferred) in SRE or similar roles.
  • 2+ years of experience (5+ preferred) in Kubernetes.
  • Hands-on experience with the Big Data stack, including managing Spark jobs.
  • Proficiency in Python, shell scripts, SQL, and PL/SQL scripting.
  • Familiarity with Change & Release processes, GitHub, and CI/CD solutions.
  • Experience with on-prem and public cloud platforms (AWS preferred).
  • Expertise in process reviews, continuous improvement, automation, and toil elimination.
  • Proficiency in high availability (HA), high transaction volume environments, backup/recovery, and disaster recovery.
  • Strong background in full-lifecycle support across multiple platforms or languages.
  • Ability to collaborate with both technical and non-technical teams in Infrastructure, Network, Development, Business Analysts, and QA teams.
  • Proficiency in analyzing and recommending solutions for production issues.
  • Familiarity with Infrastructure as Code (IaC) and Terraform scripting is a plus.

What we offer:

  • Opportunity to work on bleeding-edge projects
  • Work with a highly motivated and dedicated team
  • Competitive salary
  • Flexible schedule
  • Benefits package - medical insurance, sports 
  • Corporate social events
  • Professional development opportunities


Placement and Staffing Agencies need not apply.  We do not work with C2C at this time.

At this moment, we are not able to process H1B transfers. Applicants with CPT and OPT visas are welcome to apply.

About Us: 

Grid Dynamics (Nasdaq:GDYN) is a digital-native technology services provider that accelerates growth and bolsters competitive advantage for Fortune 1000 companies. Grid Dynamics provides digital transformation consulting and implementation services in omnichannel customer experience, big data analytics, search, artificial intelligence, cloud migration, and application modernization. Grid Dynamics achieves high speed-to-market, quality, and efficiency by using technology accelerators, an agile delivery culture, and its pool of global engineering talent. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the US, UK, Netherlands, Mexico, and Central and Eastern Europe.

To learn more about Grid Dynamics, please visit Follow us on Facebook, Twitter, and LinkedIn.


Get in touch

We'd love to hear from you. Please provide us with your preferred contact method so we can be sure to reach you.

Please follow up to email alerts if you would like to receive information related to press releases, investors relations, and regulatory filings.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.