Home

Comprehensive

Principal Site Reliability Engineer

Principal Site Reliability Engineer-December 2024

Vancouver

Dec 26, 2024

About Principal Site Reliability Engineer

　　What is Viva Engage?

　　Viva Engage is the industry-defining social network for the enterprise. We provide a platform for millions of employees, including those from 85% of Fortune 500 companies, to build community and culture, share knowledge, and connect with their leaders and each other.

　　Why Viva Engage?

　　Acquired by Microsoft in 2012, Viva Engage combines the benefits of a startup - rapid innovation, cutting-edge technology, outsized individual impact - with the advantages of working for one of the most successful software companies in the world. We believe in mission-driven work and in this post-Covid world, our platform has become more indispensable than ever as it fosters connection and a sense of belonging among remote teams. #VivaEngage

　　You will have:

　　Autonomy and freedom to innovate

　　Choice of the best of open source and Microsoft-internal technology

　　The ability to experiment, A/B test, and make data-driven decisions

　　Opportunity for outsized impact as part of a small but mighty team on a rapidly-growing product needed now more than ever.

　　As a Principal Site Reliability Engineer in Viva Engage, you will have two critical accountabilities:

　　The first is driving efforts to fully embrace site reliability engineering principals while building critical infrastructure, optimizing existing systems, and eliminating toil. You will lead efforts that combine software and systems engineering to build, scale and operate the large-scale conversation platform that powers Viva Engage experiences.

　　The second expectation is to improve overall reliability for Viva Engage. This means guiding and influencing peers to develop missing capabilities, and driving changes to our culture and processes to make reliability a critical aspect of how we work. We have been growing rapidly to become a critical workload for many of the world’s largest organizations and are looking for you to help us get to the next level.

　　Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

　　Responsibilities

　　Develop and execute on the observability and telemetry strategy

　　Own the telemetry and monitoring infrastructure

　　Continually seek deeper insights into the performance, reliability & scalability of our systems

　　Improve service reliability for the entire Yammer team, by reducing mean time to recovery (MTTR)

　　Help all of Yammer prevent service incidents altogether

　　Qualifications

　　Required/Minimum Qualifications

　　8+ years technical experience in software engineering, network engineering, or systems administration

　　OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration

　　OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration

　　OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

　　6+ years of experience building large scale distributed systems.

　　6+ years of experience in a Site Reliability Engineering role building and operating systems with world-class reliability at huge scale

　　Preferred Qualifications/Attributes

　　Knowledge of log and metrics pipelines (ELK stack or cloud services)

　　Troubleshooting skills and ability to trace request through an entire stack.

　　Micro services development, deployment, and monitoring.

　　Curious about reliability and performance, in all levels of the stack 

　　Experience with large datasets and data migrations

　　Azure | AWS | GCP automation 

　　Site Reliability Engineering IC5 - The typical base pay range for this role across Canada is CAD $132,800 - CAD $247,200 per year.

　　Find additional pay information here:

　　https://careers.microsoft.com/v2/global/en/canada-pay-information.html

　　Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .

Previous page： RN - Registered Nurse - Float Pool Tier 1- FT Nights Next page： Associate Project Director ESRD - REMOTE - 1039/709/3686_43067705237_19-3819

Comments

Welcome to zdrecruit comments! Please keep conversations courteous and on-topic. To fosterproductive and respectful conversations, you may see comments from our Community Managers.