Well-Architected Framework
Identify common metrics
Effective monitoring requires selecting the right metrics that align with your specific goals and provide actionable insights into your infrastructure and service health. The metrics you choose should help you understand system performance, identify bottlenecks, plan capacity, and ensure a smooth user experience.
Choosing the right metrics helps you fix problems before they get bad, make your systems run faster, and avoid wasting money. By monitoring your infrastructure and services, you can prevent bottlenecks, network problems, and paying for resources you don't need while keeping your applications running smoothly.
Implementing comprehensive monitoring requires understanding different metric categories, selecting appropriate thresholds, and establishing monitoring strategies that provide both real-time visibility and long-term trend analysis.
Select infrastructure metrics
Infrastructure metrics show you how healthy and fast your hardware and platform are running. These metrics help you see how much of your resources you are using, find when you are running out of capacity, and plan for when you need to grow.
Common infrastructure metrics include CPU usage, memory utilization, disk I/O, network traffic, and server uptime.
Watch CPU usage to find performance problems and see when you need more computing power. High CPU usage can mean performance issues and show that your applications need more processing power.
Track memory usage to make sure your applications have enough RAM to run well. High memory usage can mean you need more memory and can cause swapping, which makes everything run much slower.
Watch disk I/O to find storage problems that could slow down your applications. High disk I/O can mean storage bottlenecks and show that your applications are waiting on disk operations.
Track network traffic to see how much bandwidth you are using and find when you might run out of capacity. Changing traffic levels let you add more resources when you expect busy times, and remove them when things are quiet.
Implement service metrics
Service metrics focus on how well your applications are running and how happy your users are. These metrics help you find problems with your applications and make sure your users have a good experience.
Common service metrics include response times, error rates, throughput, and latency.
Watch response times to see how fast your services answer user requests. Slow response times help you find and fix bottlenecks and show how quickly your services respond to users.
Track error rates to find when your applications are failing and having reliability problems. High error rates can mean bugs, configuration problems, or not enough resources are causing your services to fail.
Measure throughput to see how much work your application can handle under load. High throughput levels can mean you have enough capacity since services can keep performing well when they get more traffic.
Watch latency to find performance problems that affect how your users experience your application. High latency can mean network congestion, routing problems, or service performance issues that need fixing.
Next steps
In this section of Monitor system health, you learned about identifying and implementing common infrastructure and service metrics, including selecting infrastructure metrics, implementing service metrics, and configuring monitoring thresholds. Identify common metrics is part of the Optimize systems.
Refer to the following documents to learn more about monitoring and optimization:
- Monitor network traffic to track network performance and connectivity
- Detect configuration drift to maintain infrastructure consistency
- Scale servers to implement server-level scaling strategies
If you are interested in learning more about monitoring and metrics, you can check out the following resources:
- Recommendations for designing and creating a monitoring system - Microsoft's guidance on monitoring system design
- Collect workload performance data - Guide to collecting performance data for analysis