load tests for scrape route

2024-05-22 09:30:32 -03:00 · 2024-05-22 09:30:32 -03:00 · 068a240ab4
commit 068a240ab4
parent 75f4e34d8e
15 changed files with 5150 additions and 8 deletions
--- a/apps/api/fly.staging.toml
+++ b/apps/api/fly.staging.toml
@ -17,20 +17,20 @@ kill_timeout = '5s'
 [http_service]
  internal_port = 8080
  force_https = true
-  auto_stop_machines = false
+  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 2
  processes = ['app']
 [http_service.concurrency]
  type = "requests"
-  hard_limit = 1000
+  hard_limit = 100
-  soft_limit = 1000
+  soft_limit = 50
 [[services]]
  protocol = 'tcp'
  internal_port = 8080
-  processes = ['app']
+  processes = ['worker']
 [[services.ports]]
    port = 80
@ -43,8 +43,8 @@ kill_timeout = '5s'
  [services.concurrency]
    type = 'connections'
-    hard_limit = 1000
+    hard_limit = 25
-    soft_limit = 1000
+    soft_limit = 20
 [[vm]]
  size = 'performance-1x'
--- a/apps/test-suite/load-test-results/tests-1-5/assets/CPU-utilization-report-test-1.png
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/CPU-utilization-report-test-1.png
--- a/apps/test-suite/load-test-results/tests-1-5/assets/memory-utilization-report-test-1.png
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/memory-utilization-report-test-1.png
--- a/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-2.png
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-2.png
--- a/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-3.png
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-3.png
--- a/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-4.png
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-4.png
--- a/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-5.png
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/metrics-test-5.png
--- a/apps/test-suite/load-test-results/tests-1-5/assets/test-run-report.json
+++ b/apps/test-suite/load-test-results/tests-1-5/assets/test-run-report.json
--- a/apps/test-suite/load-test-results/tests-1-5/load-test-1.md
+++ b/apps/test-suite/load-test-results/tests-1-5/load-test-1.md
@ -0,0 +1,98 @@
 # Scraping Load Testing - Test #1
 ## Summary
 The load test successfully processed 600 requests in 60 seconds with all requests returning HTTP 200 status codes. The average response time was 1380.1 ms, with CPU utilization peaking at around 50% on both machines, indicating sufficient CPU resources. However, there was a significant increase in memory usage post-test, which did not return to pre-test levels, suggesting a potential memory leak. Further investigation and additional load tests are recommended to address this issue and optimize the system's performance.
 ## Table of Contents
 - [Scraping Load Testing - Test #1](#scraping-load-testing---test-1)
  - [Summary](#summary)
  - [Table of Contents](#table-of-contents)
  - [Test environment](#test-environment)
    - [Machines](#machines)
  - [Load #1 - 600 reqs 60 secs (initial load only)](#load-1---600-reqs-60-secs-initial-load-only)
    - [Archillery Report](#archillery-report)
    - [CPU Utilization](#cpu-utilization)
    - [Memory Utilization](#memory-utilization)
  - [Conclusions and Next Steps](#conclusions-and-next-steps)
    - [Conclusions](#conclusions)
    - [Next Steps](#next-steps)
 ## Test environment
 ### Machines
 | Machine | Size/CPU |
 |---|---|
 | e286de4f711e86 mia (app) | performance-cpu-1x@2048MB |
 | 73d8dd909c1189 mia (app) | performance-cpu-1x@2048MB |
 ---
 ## Load #1 - 600 reqs 60 secs (initial load only)
 ```yml
 # load-test.yml
 - duration: 60
  arrivalRate: 10  # Initial load
 ```
 ### Archillery Report
 Date: 10:49:39(-0300)
 | Metric                                      | Value   |
 |---------------------------------------------|---------|
 | http.codes.200                              | 600     |
 | http.downloaded_bytes                       | 0       |
 | http.request_rate                           | 10/sec  |
 | http.requests                               | 600     |
 | http.response_time.min                      | 984     |
 | http.response_time.max                      | 2267    |
 | http.response_time.mean                     | 1380.1  |
 | http.response_time.median                   | 1353.1  |
 | http.response_time.p95                      | 1755    |
 | http.response_time.p99                      | 2059.5  |
 | http.responses                              | 600     |
 | vusers.completed                            | 600     |
 | vusers.created                              | 600     |
 | vusers.created_by_name.Scrape a URL         | 600     |
 | vusers.failed                               | 0       |
 | vusers.session_length.min                   | 1053.7  |
 | vusers.session_length.max                   | 2332.6  |
 | vusers.session_length.mean                  | 1447.4  |
 | vusers.session_length.median                | 1436.8  |
 | vusers.session_length.p95                   | 1863.5  |
 | vusers.session_length.p99                   | 2143.5  |
 ### CPU Utilization
 ![](./assets/CPU-utilization-report-test-1.png)
 Both machines peaked at around 50% CPU utilization.
 ### Memory Utilization
 ![](./assets/memory-utilization-report-test-1.png)
 | Machine | Before | After Load Test |
 |---|---|---|
 | e286de4f711e86 | 295 MiB | 358 MiB |
 | 73d8dd909c1189 | 296 MiB | 355 MiB |
 Notice that the memory utilization has not re-stabilished to the pre-test values during the check window, which may indicate a memory leak problem.
 ---
 ## Conclusions and Next Steps
 ### Conclusions
 1. **Performance:** The system handled 600 requests in 60 seconds with a mean response time of 1380.1 ms. All requests were successful (HTTP 200).
 2. **CPU Utilization:** Both machines peaked at around 50% CPU utilization, indicating that the CPU resources were sufficient for the load.
 3. **Memory Utilization:** There was a noticeable increase in memory usage on both machines post-test, and the memory did not re-stabilize to pre-test levels, suggesting a potential memory leak.
 ### Next Steps
 1. **Investigate Memory Leak:** Conduct a detailed analysis to identify and fix the potential memory leak. This may involve profiling the application and reviewing the code for memory management issues.
 2. **Additional Load Tests:** Perform additional load tests with varying request rates and durations to further assess the system's performance and stability.
 3. **Optimize Performance:** Based on the findings, optimize the application to improve response times and resource utilization.
 4. **Monitor in Production:** Implement monitoring in the production environment to ensure that similar issues do not occur under real-world conditions.
 5. **Documentation:** Update the documentation with the findings and any changes made to the system as a result of this test.
 By following these steps, we can ensure that the system is robust, efficient, and ready to handle production workloads.
--- a/apps/test-suite/load-test-results/tests-1-5/load-test-2.md
+++ b/apps/test-suite/load-test-results/tests-1-5/load-test-2.md
@ -0,0 +1,93 @@
 # Scraping Load Testing - Test #2
 ## Summary
 The load test encountered significant issues, processing 9000 requests with 5473 timeouts and a 61.6% failure rate. The average response time was 3682.1 ms, with a peak response time of 9919 ms. Both machines reached 100% CPU utilization, leading to severe performance bottlenecks and high failure rates. This indicates the need for substantial optimizations, autoscaling, and further investigation.
 ## Table of Contents
 - [Scraping Load Testing - Test #2](#scraping-load-testing---test-2)
  - [Summary](#summary)
  - [Table of Contents](#table-of-contents)
  - [Test environment](#test-environment)
    - [Machines](#machines)
  - [Load #2 - 9000 reqs 7 mins 11 secs (4 phases)](#load-2---9000-reqs-7-mins-11-secs-4-phases)
    - [Archillery Report](#archillery-report)
    - [Metrics](#metrics)
  - [Conclusions and Next Steps](#conclusions-and-next-steps)
    - [Conclusions](#conclusions)
    - [Next Steps](#next-steps)
 ## Test environment
 ### Machines
 | Machine | Size/CPU |
 |---|---|
 | e286de4f711e86 mia (app) | performance-cpu-1x@2048MB |
 | 73d8dd909c1189 mia (app) | performance-cpu-1x@2048MB |
 ---
 ## Load #2 - 9000 reqs 7 mins 11 secs (4 phases)
 ```yml
 # load-test.yml
 - duration: 60
  arrivalRate: 10  # Initial load
 - duration: 120
  arrivalRate: 20  # Increased load
 - duration: 180
  arrivalRate: 30  # Peak load
 - duration: 60
  arrivalRate: 10  # Cool down
 ```
 ### Archillery Report
 Date: 13:50:08(-0300)
 | Metric                                      | Value   |
 |---------------------------------------------|---------|
 | errors.ETIMEDOUT                            | 5473    |
 | errors.Failed capture or match              | 73      |
 | http.codes.200                              | 3454    |
 | http.codes.401                              | 64      |
 | http.codes.402                              | 9       |
 | http.downloaded_bytes                       | 0       |
 | http.request_rate                           | 21/sec  |
 | http.requests                               | 9000    |
 | http.response_time.min                      | 929     |
 | http.response_time.max                      | 9919    |
 | http.response_time.mean                     | 3682.1  |
 | http.response_time.median                   | 3395.5  |
 | http.response_time.p95                      | 8024.5  |
 | http.response_time.p99                      | 9607.1  |
 | http.responses                              | 3527    |
 | vusers.completed                            | 3454    |
 | vusers.created                              | 9000    |
 | vusers.created_by_name.Scrape a URL         | 9000    |
 | vusers.failed                               | 5546    |
 | vusers.session_length.min                   | 1127.6  |
 | vusers.session_length.max                   | 9982.2  |
 | vusers.session_length.mean                  | 3730.6  |
 | vusers.session_length.median                | 3464.1  |
 | vusers.session_length.p95                   | 7865.6  |
 | vusers.session_length.p99                   | 9607.1  |
 ### Metrics
 ![](./assets/metrics-test-2.png)
 Both machines reached 100% CPU utilization, which led to a significant number of request failures (61.6% failure rate).
 ---
 ## Conclusions and Next Steps
 ### Conclusions
 1. **Performance:** The system struggled with 9000 requests, resulting in 5473 timeouts and a mean response time of 3682.1 ms.
 2. **CPU Utilization:** Both machines experienced 100% CPU utilization, causing severe performance degradation and high failure rates.
 ### Next Steps
 Implement an autoscaling solution on Fly.io and conduct tests using the same configurations.
--- a/apps/test-suite/load-test-results/tests-1-5/load-test-3.md
+++ b/apps/test-suite/load-test-results/tests-1-5/load-test-3.md
@ -0,0 +1,107 @@
 # Scraping Load Testing - Test #3
 ## Summary
 The load test involved setting up an autoscaling option and adjusting the hard and soft limits for the Fly.io configuration. The test environment consisted of 5 machines, with 3 machines automatically scaling up during the test. Despite the scaling, there were 653 timeouts (7.3%) and 2 HTTP 502 responses (0.02%). The average response time was 3037.2 ms, with a peak response time of 9941 ms. Further adjustments to the soft limit are recommended to improve performance and reduce errors.
 ## Table of Contents
 - [Scraping Load Testing - Test #3](#scraping-load-testing---test-3)
  - [Summary](#summary)
  - [Table of Contents](#table-of-contents)
  - [Test environment](#test-environment)
    - [Machines](#machines)
  - [Load Test Phases](#load-test-phases)
    - [Configuration](#configuration)
    - [Results](#results)
    - [Metrics](#metrics)
  - [Conclusions and Next Steps](#conclusions-and-next-steps)
    - [Conclusions](#conclusions)
    - [Next Steps](#next-steps)
 ## Test environment
 ### Machines
 | Machine | Size/CPU | Status |
 |---|---|---|
 | e286de4f711e86 mia (app) | performance-cpu-1x@2048MB | always on |
 | 73d8dd909c1189 mia (app) | performance-cpu-1x@2048MB | always on |
 | 6e82050c726358 mia (app) | performance-cpu-1x@2048MB | paused |
 | 4d89505a6e5038 mia (app) | performance-cpu-1x@2048MB | paused |
 | 48ed6e6b74e378 mia (app) | performance-cpu-1x@2048MB | paused |
 ---
 ## Load Test Phases
 ### Configuration
 ```toml
 # fly.staging.toml
 [http_service.concurrency]
  type = "requests"
  hard_limit = 100
  soft_limit = 75
 ```
 ```yml
 # load-test.yml
 - duration: 60
 arrivalRate: 10  # Initial load
 - duration: 120
 arrivalRate: 20  # Increased load
 - duration: 180
 arrivalRate: 30  # Peak load
 - duration: 60
 arrivalRate: 10  # Cool down
 ```
 ### Results
 Date: 14:53:32(-0300)
 | Metric                                      | Value   |
 |---------------------------------------------|---------|
 | errors.ETIMEDOUT                            | 653     |
 | errors.Failed capture or match              | 2       |
 | http.codes.200                              | 8345    |
 | http.codes.502                              | 2       |
 | http.downloaded_bytes                       | 0       |
 | http.request_rate                           | 11/sec  |
 | http.requests                               | 9000    |
 | http.response_time.min                      | 979     |
 | http.response_time.max                      | 9941    |
 | http.response_time.mean                     | 3037.2  |
 | http.response_time.median                   | 2059.5  |
 | http.response_time.p95                      | 7709.8  |
 | http.response_time.p99                      | 9416.8  |
 | http.responses                              | 8347    |
 | vusers.completed                            | 8345    |
 | vusers.created                              | 9000    |
 | vusers.created_by_name.Scrape a URL         | 9000    |
 | vusers.failed                               | 655     |
 | vusers.session_length.min                   | 1044.5  |
 | vusers.session_length.max                   | 9998.8  |
 | vusers.session_length.mean                  | 3109.7  |
 | vusers.session_length.median                | 2143.5  |
 | vusers.session_length.p95                   | 7709.8  |
 | vusers.session_length.p99                   | 9416.8  |
 ### Metrics 
 ![](./assets/metrics-test-3.png)
 ---
 ## Conclusions and Next Steps
 ### Conclusions
 1. **Performance:** The system handled 9000 requests with a mean response time of 3037.2 ms. There were 653 timeouts and 2 HTTP 502 responses.
 2. **Autoscaling:** Three machines automatically scaled up during the test, but the scaling was not sufficient to prevent all errors.
 3. **Response Times:** The peak response time was 9941 ms, indicating that the system struggled under peak load conditions.
 ### Next Steps
 1. **Adjust Soft Limit:** Change the soft limit to 100 and the hard limit to 50 to test if machines will start faster and reduce the number of 502 errors.
 2. **Further Load Tests:** Conduct additional load tests with the new configuration to assess improvements.
 By following these steps, we can enhance the system's performance and reliability under varying load conditions.
--- a/apps/test-suite/load-test-results/tests-1-5/load-test-4.md
+++ b/apps/test-suite/load-test-results/tests-1-5/load-test-4.md
@ -0,0 +1,103 @@
 # Scraping Load Testing - Test #4
 ## Summary
 The load test was conducted with the Fly.io configuration set to a hard limit of 100 and a soft limit of 50. The test involved four phases with varying arrival rates. Despite the adjustments, there were 1329 timeouts (14.8%) but no HTTP 502 responses. The average response time was 3547.9 ms, with a peak response time of 9935 ms. Further adjustments to the artillery timeout configuration are recommended to improve performance.
 ## Table of Contents
 - [Scraping Load Testing - Test #4](#scraping-load-testing---test-4)
  - [Summary](#summary)
  - [Table of Contents](#table-of-contents)
  - [Test environment](#test-environment)
    - [Machines](#machines)
  - [Load Test Phases](#load-test-phases)
    - [Configuration](#configuration)
    - [Results](#results)
  - [Metrics](#metrics)
  - [Conclusions and Next Steps](#conclusions-and-next-steps)
    - [Conclusions](#conclusions)
    - [Next Steps](#next-steps)
 ## Test environment
 ### Machines
 | Machine | Size/CPU | Status |
 |---|---|---|
 | e286de4f711e86 mia (app) | performance-cpu-1x@2048MB | always on |
 | 73d8dd909c1189 mia (app) | performance-cpu-1x@2048MB | always on |
 | 6e82050c726358 mia (app) | performance-cpu-1x@2048MB | paused |
 | 4d89505a6e5038 mia (app) | performance-cpu-1x@2048MB | paused |
 | 48ed6e6b74e378 mia (app) | performance-cpu-1x@2048MB | paused |
 ---
 ## Load Test Phases
 ### Configuration
 ```toml
 # fly.staging.toml
 [http_service.concurrency]
  type = "requests"
  hard_limit = 100
  soft_limit = 50
 ```
 ```yml
 # load-test.yml
 - duration: 60
 arrivalRate: 10  # Initial load
 - duration: 120
 arrivalRate: 20  # Increased load
 - duration: 180
 arrivalRate: 30  # Peak load
 - duration: 60
 arrivalRate: 10  # Cool down
 ```
 ### Results
 Date: 15:43:26(-0300)
 | Metric                                      | Value   |
 |---------------------------------------------|---------|
 | errors.ETIMEDOUT                            | 1329    |
 | http.codes.200                              | 7671    |
 | http.downloaded_bytes                       | 0       |
 | http.request_rate                           | 23/sec  |
 | http.requests                               | 9000    |
 | http.response_time.min                      | 999     |
 | http.response_time.max                      | 9935    |
 | http.response_time.mean                     | 3547.9  |
 | http.response_time.median                   | 2836.2  |
 | http.response_time.p95                      | 8352    |
 | http.response_time.p99                      | 9607.1  |
 | http.responses                              | 7671    |
 | vusers.completed                            | 7671    |
 | vusers.created                              | 9000    |
 | vusers.created_by_name.Scrape a URL         | 9000    |
 | vusers.failed                               | 1329    |
 | vusers.session_length.min                   | 1063.4  |
 | vusers.session_length.max                   | 10006.8 |
 | vusers.session_length.mean                  | 3616    |
 | vusers.session_length.median                | 2893.5  |
 | vusers.session_length.p95                   | 8352    |
 | vusers.session_length.p99                   | 9607.1  |
 ## Metrics
 ![](./assets/metrics-test-4.png)
 ---
 ## Conclusions and Next Steps
 ### Conclusions
 1. **Performance:** The system handled 9000 requests with a mean response time of 3547.9 ms. There were 1329 timeouts but no HTTP 502 responses.
 2. **Response Times:** The peak response time was 9935 ms, indicating that the system struggled under peak load conditions.
 ### Next Steps
 1. **Adjust Timeout Configuration:** Change the artillery timeout configuration to reduce the number of timeouts.
 2. **Further Load Tests:** Conduct additional load tests with the new timeout configuration to assess improvements.
 By following these steps, we can enhance the system's performance and reliability under varying load conditions.
--- a/apps/test-suite/load-test-results/tests-1-5/load-test-5.md
+++ b/apps/test-suite/load-test-results/tests-1-5/load-test-5.md
@ -0,0 +1,94 @@
 # Scraping Load Testing - Test #5
 ## Summary
 The load test was conducted with a higher timeout configuration to address previous timeout issues. The test involved 9000 requests with a timeout set to 30 seconds. The system handled the load well, with only 4 HTTP 502 responses (0.04%). The average response time was 5661.8 ms, with a peak response time of 18924 ms. Further analysis is recommended to optimize response times.
 ## Table of Contents
 - [Scraping Load Testing - Test #5](#scraping-load-testing---test-5)
  - [Summary](#summary)
  - [Table of Contents](#table-of-contents)
  - [Test environment](#test-environment)
    - [Machines](#machines)
  - [Load Test Configuration](#load-test-configuration)
    - [Configuration](#configuration)
    - [Results](#results)
    - [Metrics](#metrics)
  - [Conclusions and Next Steps](#conclusions-and-next-steps)
    - [Conclusions](#conclusions)
    - [Next Steps](#next-steps)
 ## Test environment
 ### Machines
 | Machine | Size/CPU | Status |
 |---|---|---|
 | e286de4f711e86 mia (app) | performance-cpu-1x@2048MB | always on |
 | 73d8dd909c1189 mia (app) | performance-cpu-1x@2048MB | always on |
 | 6e82050c726358 mia (app) | performance-cpu-1x@2048MB | paused |
 | 4d89505a6e5038 mia (app) | performance-cpu-1x@2048MB | paused |
 | 48ed6e6b74e378 mia (app) | performance-cpu-1x@2048MB | paused |
 ---
 ## Load Test Configuration
 ### Configuration
 ```yml
  http:
    timeout: 30
 ```
 ### Results
 Date: 15:59:50(-0300)
 | Metric                                      | Value   |
 |---------------------------------------------|---------|
 | errors.Failed capture or match              | 4       |
 | http.codes.200                              | 8996    |
 | http.codes.502                              | 4       |
 | http.downloaded_bytes                       | 0       |
 | http.request_rate                           | 23/sec  |
 | http.requests                               | 9000    |
 | http.response_time.min                      | 62      |
 | http.response_time.max                      | 18924   |
 | http.response_time.mean                     | 5661.8  |
 | http.response_time.median                   | 5378.9  |
 | http.response_time.p95                      | 11050.8 |
 | http.response_time.p99                      | 12968.3 |
 | http.responses                              | 9000    |
 | vusers.completed                            | 8996    |
 | vusers.created                              | 9000    |
 | vusers.created_by_name.Scrape a URL         | 9000    |
 | vusers.failed                               | 4       |
 | vusers.session_length.min                   | 1079.2  |
 | vusers.session_length.max                   | 18980.3 |
 | vusers.session_length.mean                  | 5734.4  |
 | vusers.session_length.median                | 5487.5  |
 | vusers.session_length.p95                   | 11050.8 |
 | vusers.session_length.p99                   | 12968.3 |
 ### Metrics
 ![](./assets/metrics-test-5.png)
 ---
 ## Conclusions and Next Steps
 ### Conclusions
 1. **Performance:** The system handled 9000 requests with a mean response time of 5661.8 ms. There were only 4 HTTP 502 responses which represent a 0.04% failure rate.
 2. **Response Times:** The peak response time was 18924 ms, indicating that while the system handled the load, there is room for optimization.
 ### Next Steps
 2. **Testing Scraping Strategies:** Conduct further testing on the Playwright instance to ensure it can handle increased load and identify any potential bottlenecks.
 3. **Load Testing Other Functionalities:** Evaluate the performance of other critical routes, such as the crawl route, through additional load tests to ensure comprehensive system reliability.
 4. **Optimize Response Times:** Investigate and implement strategies to reduce the peak response time from 18924 ms. This could involve optimizing database queries, improving server configurations, or enhancing caching mechanisms.
 5. **Error Handling Improvements:** Analyze the causes of the 4 HTTP 502 responses and implement robust error handling and recovery mechanisms to minimize such occurrences in future tests.
 6. **Scalability Assessment:** Assess the system's scalability by gradually increasing the load beyond 9000 requests to determine its breaking point and plan for necessary infrastructure upgrades.
 By following these steps, we can further enhance the system's performance and reliability under varying load conditions.
--- a/apps/test-suite/load-test.yml
+++ b/apps/test-suite/load-test.yml
@ -1,8 +1,16 @@
 config:
  target: "https://staging-firecrawl-scraper-js.fly.dev/v0"
  http:
    timeout: 30
  phases:
    - duration: 60
-      arrivalRate: 10
+      arrivalRate: 10  # Initial load
    - duration: 120
      arrivalRate: 20  # Increased load
    - duration: 180
      arrivalRate: 30  # Peak load
    - duration: 60
      arrivalRate: 10  # Cool down
  defaults:
    headers:
      Authorization: "Bearer {{ $env.TEST_API_KEY }}"
--- a/apps/test-suite/package.json
+++ b/apps/test-suite/package.json
@ -4,7 +4,7 @@
  "description": "",
  "scripts": {
    "test:suite": "npx jest --detectOpenHandles --forceExit --openHandlesTimeout=120000 --watchAll=false",
-    "test:load": "artillery run load-test.yml",
+    "test:load": "artillery run --output ./load-test-results/test-1-5/assets/test-run-report.json load-test.yml",
    "test:scrape": "npx jest --detectOpenHandles --forceExit --openHandlesTimeout=120000 --watchAll=false --testPathPattern=tests/scrape.test.ts",
    "test:crawl": "npx jest --detectOpenHandles --forceExit --openHandlesTimeout=120000 --watchAll=false --testPathPattern=tests/crawl.test.ts"
  },