From Breaking to Repairing: Using Performance Tests to Strengthen Your System ❤️🩹
In software development, there is an important area that some of us are passionate about, but unfortunately, it is often overlooked or ignored by many, either because of company culture or because they haven’t yet sparked their curiosity. What am I talking about? Quality assurance.
The world of testing and quality assurance seeks to evaluate a system from different angles, ranging from unit tests that verify the correct functioning of a function, to end-to-end tests that might be assessing the communication between different systems.
It is crucial to define the purpose of what you want to test, consider your system’s requirements, and understand user expectations. Unfortunately, the area of testing is vast, and not everything can be covered in one article. For now, I’ll mention that many types exist; today I will focus on load testing, hoping to pique your curiosity about diving into this realm.
The disaster of the day without VAT could have been avoided
Context: During the COVID-19 pandemic in Colombia, there was an online sale event similar to Black Friday, called “Día sin IVA” (In english something like Day without VAT, where IVA refers to Value Added Tax).
Nationwide, many products were exempt from this tax. This promotion led to numerous Colombians overwhelming websites with their visits. Sadly, many of these sites were not prepared for the massive influx of users, resulting in service interruptions and outages for several hours due to the lack of resources to manage the unexpected load.
Could this have been prevented? Definitely. After that incident, we saw the implementation of “strategies” to regulate the number of simultaneous users, such as “shifts” or “virtual queues”. This solution could have been proposed from the first event if load and scalability tests had been conducted, allowing for anticipation and preparation of the system for the limit of concurrent users, avoiding failures in production environments.
Load Testing
A test evaluates the minimum desired characteristics of something, in other words, “the system’s expectation” or “the minimum required for correct operation”. Sound complex? What I’m trying to explain is that the software you write should have desired behavior under certain scenarios, and for that, we subject it to various tests to confirm if it acts as expected.
Although many sites describe tests in different ways, all agree they aim to evaluate the system through a simulation. In my words, load tests allow us to assess how my system “behaves” or “works” under conditions such as a certain number of users accessing simultaneously or a specific number of requests, all using specialized tools.
You might wonder: Why would I want to know this? As I mentioned earlier, you can anticipate, plan the steps to take under certain scenarios. For example, if your system receives more requests than it can handle, what would be the next step? Implementing a virtual queue is one option. But depending on the test results, you can determine when the system needs to scale either horizontally or vertically.
Practical Case: Weather Service
Let’s get to work. To better understand load testing, I’ve created a service that provides the current weather of a city. This service will connect to a provider (a remote service) for this information.
Repository: https://github.com/AFU92/weather-service
Tools I’ll use for testing
Jmeter URL: https://jmeter.apache.org/
I chose this open-source tool because it is widely used for load testing in the industry. It allows us to simulate multiple scenarios and users to assess an application’s performance. Additionally, being a Java application, there’s no need for installation. We can download the binary directly from the Apache website and run it like any other Java app or using the executable files (.sh) included. In my case, I’m working on a macOS operating system and installed it via brew with the command:
brew install jmeter
For the test scenarios we’ll develop later, we’ll use the following Jmeter features:
- Test Plan: A structured file or document (.xml) that encompasses all the test settings, details the objectives, and specific steps to run a test. In my case, it’s called “Weather Service Load Test”.
- Thread Group: Represents multiple users performing the same actions on the service to be tested. Multiple thread groups can exist and can run simultaneously or consecutively.
- HTTP Request: In this case, I will execute an HTTP GET request since the service to test is RestFul.
- Listeners: Components in JMeter that let you see and analyze test run results.
- View Results Tree: Shows request and response data for each sample.
- Summary Report: Provides an overview of request and response details, including averages, medians, and error percentages.
- Aggregate Graph: Graphical representation of aggregate response times for all requests.
- Response Time Graph: A graph that visualizes response times for requests during a test’s duration.
Mockserver
- url: https://www.mock-server.com/
When conducting load tests, and the service we want to test connects to other services or external dependencies, it’s common and advisable to use mocks. These mocks simulate the normal behavior of those other services to avoid sending the load to the actual services. In this context, I’ll use MockServer to mimic the external service that my weather-service is using.
I used Docker to run MockServer inside a container:
docker run -d — name mockserver -p 1080:1080 mockserver/mockserver
Additionally, I used the following command to send the mock of the external service I’m using to MockServer via terminal:
curl -v -X PUT "http://localhost:1080/mockserver/expectation" -d '{
"httpRequest": {
"method": "GET",
"path": "/your-path"
},
"httpResponse": {
"statusCode": 200,
"body": {
"your": "response"
}
}
}'v
Progressive Load Scenarios
The purpose of this case study is to evaluate the capacity and performance of our weather service by simulating different user loads. Through the progressive application of load scenarios, we aim to identify potential bottlenecks, resource limitations, and failure points.
1. Scenario: Basic Load
Objective: Evaluate the system’s response under a baseline or nominal load.
Configuration:
- Concurrent users: 100
- Concurrent users: 100
- Ramp-up time: 2 minutes
Expected Outcome: The application is anticipated to handle this load level without showing signs of stress, maintaining consistent response times, and with no significant errors.
Obtained Outcome: The application managed to handle the sent load since there were no failed responses (as evidenced in the View Results Tree listener). However, the response times were much higher than expected, indicating some strain on the application (as evidenced in the Response Time Graph).
Improvement Proposal
During Scenario 1, it was found that the current service implementation did not meet the expected results. Thus, given that our service relies on an external service providing data that doesn’t change abruptly, an added caching solution with an initial time of 10 minutes is proposed. This would prevent repeated calls when service clients repeatedly request the same information, aiming to enhance the response time.
Result from repeating Scenario 1: With the cache improvement applied to avoid repeated requests to the external service, we managed to improve response times to the expected levels for Scenario 1, reducing the response times to approximately 500ms 🎉🎉.
2. Scenario: Medium Load
Objective: Determine how the system behaves when increasing the load to a medium level, simulating a typical usage surge.
Configuration:
- Concurrent users: 150
- Duration: 15 minutes
- Ramp-up time: 3 minutes
Expected Outcome: There may be an increase in response times, but the application should still function correctly without major errors. This is an opportunity to identify if there is any resource nearing its limits.
Actual Outcome: The expected outcome was met. Response times increased to 800ms, but it remained stable and processed all requests adequately.
3. Escenario: Carga Pico
Objective: Stress the application to its maximum capacity, pushing it to its limits to understand its breakpoints.
Configuration:
- Concurrent users: 200
- Duration: 20 minutes
- Ramp-up time: 4 minutes
Expected Outcome: Under this extreme load, significant increases in response times and potentially some failures are likely. The goal is to identify breaking points and understand how the system behaves under maximal stress.
Actual Outcome: The expected outcome was achieved. Response times increased to 1800ms, and some requests were not processed and failed.
Conclusions
After completing the three progressive load scenarios, I gathered insights regarding the behavior and responsiveness of the weather service under varying levels of demand:
- Throughout the tests, the development environment (my machine) showed limitations. In a production environment, where more dedicated resources are present, my service might benefit from vertical scaling as it would mean increasing the power of the machine, such as memory, CPU, and/or storage. To handle a larger user load, with more resources, the service should cater to more simultaneous users without degrading performance or quality.
- Horizontal scaling is also worth considering. By this, I mean adding more instances (or servers) to distribute the workload and improve response capability. Based on the results, I suggest that when concurrent users surpass 150 or when service latency exceeds 800ms, it would be most appropriate to introduce more instances. This would accommodate more users while maintaining consistent and faster response times.
- Since the tests were conducted on my development environment, which is a personal machine, we should consider that the results offer insights into the potential behavior of the service under certain conditions. In practice, a production environment with real scaling capacities might exhibit different behaviors.
- Based on the tests and results obtained, planning an appropriate scaling strategy for the production environment is essential. Decisions on whether to scale vertically or horizontally should be based on specific, objective metrics, taking into account both the business’s unique characteristics and demands and those of the service or application.
In conclusion, the tests provided me with the information needed to plan and prepare my service for a successful production launch. This allows me to scale it appropriately, either vertically or horizontally, so the service or application can handle future demands efficiently, ensuring an optimal user experience. 💙