Log Review, Cleaning, and Monitoring Project

The logs generated by most of our services are unmonitored. This increases the chances that our customers discover problems before we do, which is always a poor experience.

For each service, your squad has claimed:
- Find the logs. Where are they stored?
- If needed, gain access to the logs. Work with your assigned SRE on this.
- Review & clean the logs.
  - Is a particular log too noisy to be useful? Remove it.
  - Is another log’s severity incorrect? (e.g. [INFO] when it should be [WARNING] or vice versa). Fix it.
  - Reserve Errors for actual application Errors, something the team will act on that indicates a real problem. Some examples of things that are not errors:
    - Bad customer input (file size too large, malformed syntax, etc.). It is ok to log this, but not as an error.
    - Occasional connection timeouts
  - Anything else that would make the logs easier to read, and more useful for proactively detecting problems?
- Monitor the logs for security, privacy, compliance, reliability, performance, and cost issues. I.e. for Nest issues.
  - This should be done manually at first, e.g. a daily review working session.
  - Over time, reviewing logs manually has diminishing returns. Automate what you can, but continue to periodically review and clean manually.

<aside> ❓ If you get stuck, bring your questions to the Nest HQ room in Roam. Or post to your #epd-leads channel.

</aside>