The logs generated by most of our services are unmonitored. This increases the chances that our customers discover problems before we do, which is always a poor experience.
- For each service, your squad has claimed:
- Find the logs. Where are they stored?
- If needed, gain access to the logs. Work with your assigned SRE on this.
- Review & clean the logs.
- Is a particular log too noisy to be useful? Remove it.
- Is another log’s severity incorrect? (e.g. [INFO] when it should be [WARNING] or vice versa). Fix it.
- Reserve Errors for actual application Errors, something the team will act on that indicates a real problem. Some examples of things that are not errors:
- Bad customer input (file size too large, malformed syntax, etc.). It is ok to log this, but not as an error.
- Occasional connection timeouts
- Anything else that would make the logs easier to read, and more useful for proactively detecting problems?
- Monitor the logs for security, privacy, compliance, reliability, performance, and cost issues. I.e. for Nest issues.
- This should be done manually at first, e.g. a daily review working session.
- Over time, reviewing logs manually has diminishing returns. Automate what you can, but continue to periodically review and clean manually.
<aside>
❓ If you get stuck, bring your questions to the Nest HQ room in Roam. Or post to your #epd-leads channel.
</aside>