Why Traditional Performance Testing Fails in Modern Distributed Systems


From “Core Activities” to System-Level Reality

Performance testing is traditionally described as a sequence of structured activities:
Requirement analysis  
Test planning  
Script development  
Execution  
Analysis  

This model works well in controlled environments, but in real-world production systems, it breaks down.

In modern architectures—especially those built on microservices, Kubernetes, and ML inference pipelines—performance is no longer just a testing concern.

It is a system behavior problem.

The Gap Between Testing and Production

In lab environments:
Latency ~50ms  
Stable throughput  
Minimal or no failures  

In production:
Latency spikes to 500ms+  
Intermittent timeouts  
Cascading failures  

What changed?

Not the test scripts.  
Not the application logic.  

The system context changed.

Rethinking “Core Activities” in Performance Engineering

1. Requirement Analysis → System Behavior Modeling

Traditionally, requirement analysis focuses on response time targets.

In modern systems, this evolves into modeling end-to-end latency paths, including:
Network hops  
Service dependencies  
External APIs  
Feature stores (in ML systems)  

Performance must be understood as a chain, not a single metric.

2. Test Planning → Workload Realism Engineering

Traditional test planning emphasizes simulating user load.

Modern approaches focus on recreating real-world conditions:
Traffic spikes  
Burst patterns  
Cache warm vs cold states  
Autoscaling delays  

Synthetic load does not represent production traffic.

3. Script Development → Distributed Interaction Simulation

Instead of relying on single-tool scripting (e.g., JMeter), modern performance engineering requires simulating distributed interactions, including:
Service-to-service calls  
Asynchronous messaging  
Retry storms  
Backpressure effects  

Failures emerge from interactions, not individual endpoints.

4. Test Execution → Environment Fidelity

Running tests in staging is no longer sufficient.

Modern execution requires production-like environments:
Same infrastructure (e.g., Kubernetes / EKS)  
Identical autoscaling configurations  
Consistent observability stack  

Most performance issues originate from:
Resource contention  
Scheduling delays  
Infrastructure constraints  

5. Result Analysis → Root Cause Decomposition

Traditional analysis identifies slow endpoints.

Modern analysis focuses on system-level signals such as:
CPU throttling  
Pod evictions  
Queue buildup  
Cache misses  
Network latency  

Latency is an emergent property, not a single isolated cause.

Hidden Performance Killers (Often Missed)

In production systems, performance degradation is often driven by factors that traditional testing overlooks:
Kubernetes resource contention  
Autoscaling lag (HPA delays)  
Cold cache / feature fetch latency  
Model loading overhead (ML systems)  
Retry amplification in microservices  

These factors are rarely captured in conventional workflows.

From Performance Testing to Performance Engineering

Traditional approach:
Tool-driven  
Script-based  
Pre-production focused  
Endpoint-level metrics  

Modern approach:
System-driven  
Behavior modeling  
Continuous validation  
End-to-end observability  

Key Insight

Your system is not slow because your code is inefficient.  
Your system is slow because your architecture behaves differently under real-world conditions.

Final Takeaway

If performance testing is treated as a checklist, the wrong problem is being solved.

Modern systems require:
  1. Observability-first thinking  
  2. Infrastructure-aware testing  
  3. System-level reasoning  

This is where Performance Engineering and PerfMLOps converge.

Author

I specialize in Performance Engineering and PerfMLOps, focusing on system-level latency optimization in distributed and ML-driven architectures.

Common Performance Problems

Common Performance Problems:

Most performance problems revolve around speed, response time, load time and poor scalability. Speed is often one of the most important attributes of an application. A slow running application will lose potential users. Performance testing is done to make sure an app runs fast enough to keep a user's attention and interest. Take a look at the following list of common performance problems and notice how speed is a common factor in many of them:
  • Long Load time - Load time is normally the initial time it takes an application to start. This should generally be kept to a minimum. While some applications are impossible to make load in under a minute, Load time should be kept under a few seconds if possible.
  • Poor response time - Response time is the time it takes from when a user inputs data into the application until the application outputs a response to that input. Generally this should be very quick. Again if a user has to wait too long, they lose interest.
  • Poor scalability - A software product suffers from poor scalability when it cannot handle the expected number of users or when it does not accommodate a wide enough range of users. Load testing should be done to be certain the application can handle the anticipated number of users.
  • Bottlenecking - Bottlenecks are obstructions in system which degrade overall system performance. Bottlenecking is when either coding errors or hardware issues cause a decrease of throughput under certain loads. Bottlenecking is often caused by one faulty section of code. The key to fixing a bottlenecking issue is to find the section of code that is causing the slow down and try to fix it there. Bottle necking is generally fixed by either fixing poor running processes or adding additional Hardware. Some common performance bottlenecks are
    • CPU utilization
    • Memory utilization
    • Network utilization
    • Operating System limitations
    • Disk usage

Types of performance testing

  • Load testing - checks the application's ability to perform under anticipated user loads. The objective is to identify performance bottlenecks before the software application goes live.
  • Stress testing - involves testing an application under extreme workloads to see how it handles high traffic or data processing .The objective is to identify breaking point of an application.
  • Endurance testing - is done to make sure the software can handle the expected load over a long period of time.
  • Spike testing - tests the software's reaction to sudden large spikes in the load generated by users.
  • Volume testing - Under Volume Testing large no. of. Data is populated in database and the overall software system's behavior is monitored. The objective is to check software application's performance under varying database volumes.
  • Scalability testing - The objective of scalability testing is to determine the software application's effectiveness in "scaling up" to support an increase in user load. It helps plan capacity addition to your software system.

Overview of Performance Testing Concepts

Overview of Performance Testing Concepts

Performance Testing :  There are lot of Definitions available but the one mentioned in IEEE Glossary is as follows:

“Testing conducted to evaluate the compliance of a system or component with specified performance requirements. Often this is performed using an automated test tool to simulate large number of users. Also known as "Load Testing".

Or

“The testing performed to determine the degree to which a system or component accomplishes its designated functions within given constraints regarding processing time and throughput rate.”

The purpose of the test is to measure characteristics, such as response times, throughput or the mean time between failures (for reliability testing)

Performance testing tool:
A tool to support performance testing and that usually has two main facilities: load generation and test transaction measurement. Load generation can simulate either multiple users or high volumes of input data. During execution, response time measurements are taken from selected transactions and these are logged. Performance testing tools normally provide reports based on test logs and graphs of load against response times.

Features or characteristics of performance-testing tools include support for:
• generating a load on the system to be tested;
• measuring the timing of specific transactions as the load on the system varies;
• measuring average response times;
• producing graphs or charts of responses over time.

Load test:
A test type concerned with measuring the behavior of a component or system with increasing load, e.g. number of parallel users and/or numbers of transactions to determine what load can be handled by the component or system.

While doing Performance testing we measure some of the following:

Characterisitics (SLA)                                       Measurement (units)
Response Time                                                          Seconds
Hits per Second                                                        #Hits
Throughput                                                              Bytes Per Second
Transactions per Second (TPS)         #Transactions of a Specific Business Process
Total TPS (TTPS)                                                     Total no.of Transactions
Connections per Second (CPS)                                 #Connections/Sec
Pages Downloaded per Second (PDPS)                     #Pages/Sec

Some Definitions and importance of the Above:

Response Time :

What is Transaction Response Time?

Transaction Response Time represents the time taken for the application to complete a defined transaction or business process.

Why is important to measure Transaction Response Time?

The objective of a performance test is to ensure that the application is working perfectly under load. However, the definition of “perfectly” under load may vary with different systems.
By defining an initial acceptable response time, we can benchmark the application if it is performing as anticipated.

The importance of Transaction Response Time is that it gives the project team/ application team an idea of how the application is performing in the measurement of time. With this information, they can relate to the users/customers on the expected time when processing request or understanding how their application performed.


What does Transaction Response Time encompass?

The Transaction Response Time encompasses the time taken for the request made to the web server, there after being process by the Web Server and sent to the Application Server. Which in most instances will make a request to the Database Server. All this will then be repeated again backward from the Database Server, Application Server, Web Server and back to the user. Take note that the time taken for the request or data in the network transmission is also factored in.

To simplify, the Transaction Response Time comprises of the following:
1. Processing time on Web Server
2. Processing time on Application Server
3. Processing time on Database Server.
4. Network latency between the servers, and the client.

The following diagram illustrates Transaction Response Time.

 
Transaction Response Time = (t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8 + t9) X 2
Note:
Factoring the time taken for the data to return to the client.


How do we measure?

Measuring of the Transaction Response Time begins when the defined transaction makes a request to the application. From here, till the transaction completes before proceeding with the next subsequent request (in terms of transaction), the time is been measured and will stop when the transaction completes.

Differences with Hits Per Seconds

Hits per Seconds measures the number of “hits” made to a web server. These “hits” could be a request made to the web server for data or graphics. However, this counter does not represent well to users on how well their applications is performing as it measures the number of times the web server is being accessed.

How can we use Transaction Response Time to analyze performance issue?

Transaction Response Time allows us to identify abnormalities when performance issues surface. This will be represented as slow response of the transaction, which differs significantly (or slightly) from the average of the Transaction Response Time.
With this, we can further drill down by correlation using other measurements such as the number of virtual users that is accessing the application at the point of time and the system-related metrics (e.g. CPU Utilization) to identify the root cause.
Bringing all the data that have been collected during the load test, we can correlate the measurements to find trends and bottlenecks between the response time, the amount of load that was generated and the payload of all the components of the application.

How is it beneficial to the Project Team?

Using Transaction Response Time, Project Team can better relate to their users using transactions as a form of language protocol that their users can comprehend. Users will be able to know that transactions (or business processes) are performing at an acceptable level in terms of time.
Users may be unable to understand the meaning of CPU utilization or Memory usage and thus using a common language of time is ideal to convey performance-related issues.


Relation between Load, Response Time and Performance:

1.       Load is Directly Proportional to Response Time
2.      Performance is inversely proportional to Response Time.

So, As and When the Load increases the Response Time Increases. As Response Time Increases, the Performance Decreases.

Hits Per Second

A Hit is a request of any kind made from the virtual client to the application being tested (Client to Server). It is measured by number of Hits. The higher the Hits Per Second, the more requests the application is handling per second.

A virtual client can request an HTML page, image, file, etc. Testing the application for Hits Per Second will tell you if there is a possible scalability issue with the application. For example, if the stress on an application increases but the Hits Per Second does not, there may be a scalability problem in the application.

One issue with this metric is that Hits Per Second relates to all requests equally.
Thus a request for a small image and complex HTML generated on the fly will both be considered as hits. It is possible that out of a hundred hits on the application, the application server actually answered only one and all the rest were either cached on the web server or other caching mechanism.

So, it is very important when looking at this metric to consider what and how the
application is intended to work. Will your users be looking for the same piece of
information over and over again (a static benefit form) or will the same number of users be engaging the application in a variety of tasks – such as pulling up images, purchasing items, bringing in data from another site? To create the proper test, it is important to understand this metric in the context of the application. If you’re testing an application function that requires the site to ‘work,’ as opposed to present static data, use the pages per second measurement.

Pages Per Second

Pages Per Second measures the number of pages requested from the application per second. The higher the Page Per Second the more work the application is doing per second. Measuring an explicit request in the script or a frame in a frameset provides a metric on how the application responds to actual work requests. Thus if a script contains a Navigate command to a URL, this request is considered a page. If the HTML that returns includes frames they will also be considered pages, but any other elements retrieved such as images or JS Files, will be considered hits, not pages. This measurement is key to the end-user’s experience of application performance.

Correlation: If the stress increases, but the Page Per Second count doesn’t, there may be a scalability issue. For example, if you begin with 75 virtual users requesting 25 different pages concurrently and then scale the users to 150, the Page Per Second count should increase. If it doesn’t, some of the virtual users aren’t getting their pages. This could be caused by a number of issues and one likely suspect is throughput.

Throughput

“The amount of data transferred across the network is called throughput. It considers the amount of data transferred from the server to client only and is measured in Bytes/sec.”

This is an important baseline metric and is often used to check that the application and its server connection is working. Throughput measures the average number of bytes per second transmitted from the application being tested to the virtual clients running the test agenda during a specific reporting interval. This metric is the response data size (sum) divided by the number of seconds in the reporting interval.

Generally, the more stress on an application, the more Throughput. If the stress increases, but the Throughput does not, there may be a scalability issue or an application issue.

Another note about Throughput as a measurement – it generally doesn’t provide any information about the content of the data being retrieved. Thus it can be misleading especially in regression testing. When building regression tests, leave time in the testing plan for comparing returned data quality.


Round Trips

Another useful scalability and performance metric is the testing of Round Trips. Round Trips tells you the total number of times the test agenda was executed versus the total number of times the virtual clients attempted to execute the Agenda. The more times the agenda is executed, the more work is done by the test and the application.
The test scenario the agenda represents influences the round Trips measurement.
This metric can provide all kinds of useful information from the benchmarking of an application to the end-user availability of a more complex application. It is not
recommended for regression testing because each test agenda may have a different scenario and/or length of scenario.

Hit Time
Hit time is the average time in seconds it took to successfully retrieve an element of any kind (image, HTML, etc). The time of a hit is the sum of the Connect Time, Send Time, Response Time and Process Time. It represents the responsiveness or performance of the application to the end user. The more stressed the application, the longer it should take to retrieve an average element. But, like Hits Per Second, caching technologies can influence this metric. Getting the most from this metric requires knowledge of how the application will respond to the end user.
This is also an excellent metric for application monitoring after deployment. 

Time to First Byte

This measurement is important because end users often consider a site malfunctioning if it does not respond fast enough. Time to First Byte measures the number of seconds it takes a request to return its first byte of data to the test software’s Load Generator.
For example, Time to First Byte represents the time it took after the user pushes the “enter” button in the browser until the user starts receiving results. Generally, more concurrent user connections will slow the response time of a request. But there are also other possible causes for a slowed response.
For example, there could be issues with the hardware, system software or memory issues as well as problems with database structures or slow-responding components within the application.

Page Time

Page Time calculates the average time in seconds it takes to successfully retrieve a page with all of its content. This statistic is similar to Hit Time but relates only to pages. In most cases this is a better statistic to work with because it deals with the true dynamics of the application. Since not all hits can be cached, this data is more helpful in terms of tracking a user’s experience (positive or frustrated). It’s important to note that in many test software application tools you can turn caching on or off depending on your application needs.

Generally, the more stress on the site the slower its response. But since stress is a combination of the number of concurrent users and their activity, greater stress may or may not impact the user experience. It all depends upon the application’s functions and users. A site with 150 concurrent users looking up benefit information will differ from a news site during a national emergency. As always, metrics must be examined within context.

Failed Rounds/Failed Rounds Per Second

During a load test it’s important to know that the application requests perform as
expected. The Failed Rounds and Failed Rounds Per Second tests the number of
rounds that fail.

This metric is an “indicator metric” that provides QA and test with clues to the
application performance and failure status. If you start to see Failed Rounds or Failed Rounds Per Second, then you would typically look into the logs to see what types of failures correspond to this metric report. Also, with some software test packages, you can set what the definition of a failed round in an application.

Sometimes, basic image or page missing errors (HTTP 404 error codes) could be set to fail a round, which would stop the execution of the test agenda at that point and start at the top of the agenda again, thus not completing that particular round.

Failed Hits/Failed Hits Per Second

This test offers insight into the application’s integrity during the load test. An example of a request that might fail during execution is a broken link or a missing image from the server. The number of errors should grow with the load size. If there are no errors with a low load, the number of errors with a high load should remain zero. If the percentage of errors only increases during high loads, the application may have a scalability issue.

Failed Connections

This test is simply the number of connections that were refused by the application during the test. This test leads to other tests. A failed connection could mean the server was too busy to handle all the requests, so it started refusing them. It could be a memory issue. It could also mean that the user sent bogus or malformed data to which the server couldn’t respond so it refused the connection.

Introduction to Performance Testing


Why Performance testing?
Performance testing has proved itself to be crucial for the success of a business. Not only does a poor performing site face financiallosses, it also could lead to legal repercussions at times.
No one wants to put up with a slow performing, unreliable site in cases of purchasing, online test taking, bill payment, etc. With the internet being so widely available, the alternates are immense. It is easier to lose clientele than gain them and performance is a key game changer.
Therefore, performance testing is no longer a name sake checkpoint before going live. It is indeed a comprehensive and detailed stagethat would determine whether the performance of a site or an application meets the needs.
Introduction
The purpose of this test is to understand the performance of application under load, particularly users.

Types of Performance Testing

Performance testing types
Load Testing
Load testing is a type of performance test where the application is tested for its performance on normal and peak usage. Performance of an application is checked with respect to its response to the user request, its ability to respond consistently within accepted tolerance on different user loads.
The key considerations are:
  1. What is the max load the application is able to hold before the application starts behaving unexpectedly?
  2. How much data the Database is able to handle before system slowness or the crash is observed?
  3. Are there any network related issues to be addressed?
Stress Testing
Stress testing is the test to find the ways to break the system. The test also gives the idea for the maximum load the system can hold.
Generally Stress testing has incremental approach where the load is increased gradually. The test is started with good load for which application has been already tested. Then slowly more load is added to stress the system and the point when we start seeing servers not responding to the requests is considered as a break point.
During this test all the functionality of the application are tested under heavy load and on back-end these functionality might be running complex queries, handling data, etc.
The following questions are to be addressed:
  • What is the max load a system can sustain before it breaks down?
  • How is the system break down?
  • Is the system able to recover once it’s crashed?
  • In how many ways system can break and which are the weak node while handling the unexpected load?
Volume Testing
Volume test is to verify the performance of the application is not affected by volume of data that is being handled by the application. Hence to execute Volume Test generally huge volume of data is entered into the database. This test can be incremental or steady test. In the incremental test volume of data is increased gradually.
Generally with the application usage, the database size grows and it is necessary to test the application against heavy Database.  A good example of this could be a website of a new school or college having small data to store initially but after 5-10 years the data stores in database of website is much more.
The most common recommendation of this test is tuning of DB queries which access the Database for data. In some cases the response of DB queries is high for big database, so it needs to be rewritten in a different way or index, joints etc need to be included.
Capacity Testing
=> Is the application capable of meeting business volume under both normal and peak load conditions?
Capacity testing is generally done for future prospects.  Capacity testing addresses the following:
  1. Will the application able to support the future load?
  2. Is the environment capable to stand for upcoming increased load?
  3. What are the additional resources required to make environment capable enough?
Capacity testing is used to determine how many users and/or transactions a given web application will support and still meet performance. During this testing resources such as processor capacity, network bandwidth, memory usage, disk capacity, etc. are considered and altered to meet the goal.
Online Banking is a perfect example of where capacity testing could play a major part.
Reliability/Recovery Testing
Reliability Testing or Recovery Testing – is to verify as to whether the application is able to return back to its normal state or not after a failure or abnormal behavior- and also how long does it take for it to do so(in other words, time estimation).
An online trading site if experience a failure where the users are not able to buy/sell shares at a certain point of the day (peak hours) but are able to do so after an hour or two. In this case, we can say the application is reliable or recovered from the abnormal behavior.
In addition to the above sub-forms of performance testing, there are some more fundamental ones that are prominent:
Smoke Test:
  • How is the new version of the application performing when compared to previous ones?
  • Is any performance degradation observed in any area in the new version?
  • What should be the next area where developers should focus to address performance issues in the new version of application?
Component Test:
  • Whether the component is responsible for the performance issue?
  • Whether the component is doing what is expected and component optimization has been done?
Endurance Test:
  • Whether the application will able to perform well enough over the period of time.
  • Any potential reasons that could slow the system down?
  • Third party tool and/or vendor integration and any possibility that the interaction makes the application slower.
How does Functional Testing differ from Performance Testing?
Functional vs Performance Testing

Identification of components for testing

In an ideal scenario, all components should be performance tested. However, due to time & other business constraints that may not be possible. Hence, the identification of components for testing happens to be one of the most important tasks in load testing.
The following components must be included in performance testing:
------------
#1. Functional, business critical features
Components that have a Customer Service Level Agreement or those having complex business logic (and are critical for the business’s success) should be included.
Example:  Checkout and Payment for an E-commerce site like eBay.
#2. Components that process high volumes of data
Components, especially background jobs are to be included for sure.Example: Upload and download feature on a file sharing website.
#3. Components which are commonly used
A component that is frequently used by end-users, jobs scheduled multiple times in a day, etc.
Example: Login and Logout.
#4. Components interfacing with one or more application systems
In a system involving multiple applications that interact with one another, all the interface components must be deemed as critical for performance test.
Example: E-commerce sites interface with online banking sites for payments, which is an external third party application. This should be definitely the part of Perf testing.

Tools for performance testing

Sure, you could have a million computers set up with a million different credentials and all of them could login at once and monitor the performance. Apparently it’s not practical and even if we do, do that, we still need some sort of monitoring infrastructure.
The best way this situation is handled is through – virtual user (VU).For all our tests the VU behave just the way a real user would.
For the creation of as many VUs as you would require and to simulate real time conditions, performance testing tools are employed. Not only that, Perf testing also tests for the peak load usage, breakdown point, long term usage, etc
To enable all with limited resources, fast and to obtain reliable results tools are often used for this process. There are a variety of tools available in the market- licensed, free wares and open sourced.
Few of the such tools are:
  • HP LoadRunner,
  • Jmeter,
  • Silk Performer,
  • NeoLoad,
  • Web Load,
  • Rational Performance Tester (RTP),
  • VSTS,
  • Loadstorm,
  • Web Performance,
  • LoadUI,
  • Loadster,
  • Load Impact,
  • OpenSTA,
  • QEngine,
  • Cloud Test,
  • Httperf,
  • App Loader,
  • Qtest,
  • RTI,
  • Apica LoadTest,
  • Forecast,
  • WAPT,
  • Monitis,
  • Keynote Test Perspective,
  • Agile Load, etc.
The tool selection depends on budget, technology used, purpose of testing, nature of the applications, performance goals being validated, infrastructure, etc.
HP Load Runner captures majority of market due to:
  1. Versatility – can be used on windows as well as web based applications. It also works for many kinds of technologies.
  2. Test Results – It provides in-depth insights that can be used for tuning the application.
  3. Easy Integrations – works with diagnostics tool like HP Sitescope and HP Diagnostic.
  4. Analysis utility provides a variety of features which help in deep analysis.
  5. Robust Reports – LoadRunner has a good reporting engine and provides a variety of reporting formats.
  6. Comes with an Enterprise package too.
The only flip side is its license cost. It is a little bit on the expensive side – which is why other open source or affordably licensed tools that are specific to a technology, protocol and with limited analysis & reporting capabilities have emerged in the market.
Still, the HP LoadRunner is a clear winner.

Future in Performance Testing Career

Performance testing is easy to learn but need lots of dedication to master it. It’s like a mathematics subject where you have to build your concept. Once the concept is through, it can be applied to most of the tools irrespective of the scripting language being different, straight forward logic not being applicable, look and feel of the tool being different, etc. – the approach to Perf testing is almost always the same.
I would highly recommend this hot and booming technology and to enhance your skill by learning this. Mastering PT could be just what you are looking for to move ahead in your software testing career.

Conclusion

In this article we have covered most of the information required to build a base to move ahead and understand the Performance testing.  In the next article we will apply these concepts and understand the key activities of Performance testing.
Load Runner is going to be our vehicle in the journey, but the destination we want to reach is to understand everything about performance testing.