James Posted May 22, 2022 Share Posted May 22, 2022 Quote Our SRE on call was getting paged daily that one of our SLIs was burning through our SLOs for the GitLab Pages service. It was intermittent and short-lived, but enough to cause user-facing impact which we weren't comfortable with. This turned into alert fatigue because there wasn't enough time for the SRE on call to investigate the issue and it wasn't actionable since it recovered on its own. We decided to open up an investigation issue for these alerts. We had to find out what the issue was since we were showing 502 errors to our users and we needed a DRI that wasn't on call to investigate. Here: https://about.gitlab.com/blog/2022/05/17/how-we-removed-all-502-errors-by-caring-about-pid-1-in-kubernetes/?utm_id=FAUN_Kaptain321_Link_title Quote Link to comment Share on other sites More sharing options...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.