Rules to Better Application Performance

​​

Hold on a second! How would you like to view this content?
Just the title! A brief blurb! Gimme everything!
  1. Do you know the best time to tackle performance testing?

    Performance, load and stress testing should be tackled after you have confirmed that everything works functionally (usually after UX testing). Performance testing should only be after daily errors are down to zero (reported by Application Insights or Raygun). This way you can be sure that any functional issues that occur during performance tests are scaling issues.



  2. Do you know where your goal posts are?

    When starting on the path of improving application performance, it is always important to know when you can stop. The goal posts would depend on the type of application being written and the number of active users of the application and the budget. Some examples of performance goals are:​​​

    Example 1: High performance website

    • ​Every page refresh is under 500ms
    • Able to handle 1000 active concurrent users
    • ​Getting at least a score of 80 for Performance on Google Lighthouse - with throttling turned on

    Example 2: Office Intranet Application

    • Every page referesh is under 2 seconds
    • Able to handle 50 active concurrent users
    • ​Getting at least a score of 95 for Performance on Google Lighthouse - with throttling turned off.

    With the goal posts firmly in sight, the developers can begin performance tuning the application.​

  3. Do you avoid reviewing performance without metrics?

    ​If a client says:

    "This application is too slow, I don't really want to put up with such poor performance. Please fix."

    We don't jump in and look at the code and clean it up and reply with something like:

    "I've looked at the code and cleaned it up - not sure if this is suitable - please tell me if you are OK with the performance now."

    A better way is:

    • Ask the client to tell us how slow it is (in seconds) and how fast they ideally would like it (in seconds)
    • Add some code to record the time the function takes to run
    • Reproduce the steps and record the time
    • Change the code
    • Reproduce the steps and record the time again
    • Reply to the customer:
      "It was 22 seconds, you asked for around 10 seconds. It is now 8 seconds."
    Figure: Good example – Add some code to check the timing, before fixing any performance issues (An example from SSW Code Auditor)

    Also, never forget to do incremental changes in your tests!

    For example, if you are trying to measure the optimal number of processors for a server, do not go from 1 processor to 4 processors at once:

    1to4.png
    Figure: Bad Example - Going from 1 to 4 all at once gives you incomplete measurements and data

    Do it incrementally, adding 1 processor each time, measuring the results, and then adding more:

    1234.png
    Figure: Good Example - Going from 1 to 2, then measuring, then incrementally adding one more, measuring...

    This gives you the most complete set of data to work from.

    This is because performance is an emotional thing, sometimes it just *feels* slower. Without numbers, a person cannot really know for sure whether something has become quicker. By making the changes incrementally, you can be assured that there aren’t bad changes cancelling out the effect of good changes.​

    Samples​

    For sample code on how to measure performance for windows application form, please refer to rule Do you have tests for Performance? on Rules To Better Unit Tests.

    Related Rule

  4. Do you know the best way to get metrics out of your browser?

    ​​Lighthouse is an open-source tool built into Google Chrome that can audit for performance, accessibility, progressive web apps, and more. Allowing you to improve the quality of web pages.​​

    You can run Lighthouse:

    • In Chrome DevTools
    • From the command line​
    • As a Node module

    It runs a series of audits against a URL and then it generates a report on how well the page did. From there, you can use the failing audits as indicators on how to improve the page. Each audit has a reference doc explaining why the audit is important, as well as how to fix it.

    lighthouse-100.png
    Figure: Good Example - Google Chrome Lighthouse is showing 100%

    Lighthouse Level 1: Throttling Off

    For applications intended for use on a desktop and from within a well-connected office (such as your intranet or office timesheet application) test with throttling turned off.

    Lighthouse Level 2: Throttling On

    To see how well your website would perform on low-spec devices and with poor internet bandwidth, use the throttling features. This is most important for high volume, customer-facing apps. 


    lighthouse_throttling.png

    Figure: Good Example - Lighhouse can simulate slow netwrking and CPU when performing tests


    Lighthouse Level 3: Automated testing

    For business-critical pages, you may want to automate Lighthouse testing as part of your Continuous Delivery pipeline. This blog post by Andrejs Abrickis shows how to configure an Azure DevOps build pipeline that performs Lighthouse testing.
    https://medium.com/@andrejsabrickis/continuously-audit-web-apps-performance-using-google-s-lighthouse-and-azure-devops-3e1623372f79​



  5. Do you know where bottlenecks can happen?

    For modern applications, there are many layers and moving parts that need to seamlessly work together to deliver our application to the end user. 

    bottleneck.png
    Figure: Bottlenecks can happen anywhere! Call out diagrammatically where you think the bottlenecks are happenning

    The issues can be in:

    SQL Server

    • Slow queries 
    • Timeouts
    • Bad configuration 
    • Bad query plans 
    • Lack of resources 
    • Locking

    Business Logic

    • Inefficient code 
    • Chatty code 
    • Long running processes 
    • Not making use of multicore processors

    Front end

    • Too many requests to server a page 
    • Page size
    • Large images
    • No Caching

    Connection between SQL and Web

    • Lack of bandwidth
    • Too much chatter

    Connection between Web and Internet

    • Poor uplink (e.g. 1mbps uploads)
    • Too many hops

    Connection between Web and End users

    • Geographic​ally too far (e.g. US servers, AU users)

    Infrastructure

    • Misconfiguration
    • ​Resource contention
  6. Do you know how to find performance problems with Application Insights?

    ​​​​Once you have set up your Application Insights as per the rule 'Do you know how to set up Application Insights' and you have your daily failed requests down to zero, you can start looking for performance problems. You will discover that uncovering your performance related problems are relatively straightforward.​​

    The main focus of the first blade is the 'Overview timeline' chart, which gives you a birds eye view of the health of your application.

    performance-1.jpg
    Figure: There are 3 spikes to investigate (one on each graph), but which is the most important? Hint: look at the scales!

    Developers can see the following insights:

    • Number of requests to the server and how many have failed (First blue graph)
    • The breakdown of your page load times (Green Graph)
    • How the application is scaling under different load types over a given period
    • When your key usage peaks occur

    Always investigate the spikes first, notice how the two blue ones line up? That should be investigated, however,​ notice that the green peak is actually at 4 hours. This is definitely the first thing we'll look at.

    performance 2.png
    Figure: The 'Average of Browser page load time by URL base' graph will highlight the slowest page.

    As we can see that a single request took four hours in the 'Average of Browser page load time by URL base' graph, it is important to examine this request.

    It would be nice to see the prior week for comparison, however, we're unable to in this section.

    performance-3.png
    Figure: In this case, the user agent string gives away the cause, Baidu (a Chinese search engine) got stuck and failed to index the page.

    At this point, we'll create a PBI to investigate the problem and fix it.

    (Suggestion to Microsoft, please allow annotating the graph to say we've investigated the spike)

    The other spike which requires investigation is in the server response times. To investigate it, click on the blue spike. This will open the Server response blade that allows you to compare the current server performance metrics to the previous weeks. 

    performance-4.jpg
    Figure: In this case, the most important detail to action is the Get Healthcheck issue. Now you should be able to optimise the slowest pages​

    In this view, we find performance related issues when the usage graph shows similarities to the previous week but the response times are higher. When this occurs, click and drag on the timeline to select the spike and then click the magnifying glass to ‘zoom in’. This will reload the ‘Average of Server response time by Operation name’ graph with only data for the selected period.

    Looking beyond the Average Response Times

    High average response times are easy to find and indicate an endpoint that is usually slow - so this is a good metric to start with. But sometimes a low average value can contain many successful fast requests hiding a few much slower requests.

    Application insights plots out the distribution of response time values  allowing potential issues to be spotted.

    distribution.png

    ​​​​Figure: this distribution graph shows that under an average value of 54.9ms, 99% of requests were under 23ms but there were a few requests taking up to 32 seconds!


  7. Do you know how to investigate performance problems in a .NET app?

    Working out why the performance of an application has suddenly degraded can be hard.  This rule covers some investigations steps that can help determine the cause of performance problems.

    1. Use Application Insights to determine when the application last had acceptable performance​

    Follow the Do you know how to find performance problems with Application Insights?​ rule to determine when the decrease in performance began to occur. It'​s important to determine if the performance degradation occurred gradually or if there was a dramatic drop-off in performance.

    2. Look for changes that coincide with the performance issue​​

    There are three general cases that can cause performance issues:

    1. A change to software or hardware.  Your deployment tool (such as Octopus) can tell you if there has been a software deployment, and you can work with your network admin to determine if there has been infrastructure changes.
    2. The load factor on the application can change.  Application Insights can help you determine if the load factor on the application has increased.
    3. A hardware issue or network issue can occur that interferes with normal operation.​  The Windows Event ​Log and other sys admin monitoring tools can alert you to infrastructure issues like this.

    3. ​Dealing With Code Related Issues​

    If a software release has caused the performance problems, it is important to work out the code delta between the software release that worked well and the new release with the performance issues.  Your software repository should have the necessary metadata to allow you to trace code deltas between release numbers.  Inspect all the changes that have occurred for obvious performance issues like bad EF code, unnecessary loops and chatty network calls.  See Do you know where bottlenecks can happen?​ for more information on performance issues that can be introduced with code changes.

    4. Dealing with Database Related Issues​​

    Application Insights can help determine which tier of an application is performing poorly, and if it is determined that the performance issue is occurring in the database, SQL Server makes finding these performance issues much easier.

    Tip: Azure SQL can provide performance recommendations based off your application usage and even automatically apply them for you.

    Query Store is like having a light-weight version of SQL Profiler running all the time, and is enabled at a database level using the Database Properties dialog:

    Figure: Read Write indicates that the Query Store is setup to help us a few days later

    Once Query Store has been enabled for a particular database, it needs to run for a number of days to collect performance data.  It is generally a good idea to enable Query Store for important production databases before performance problems occur.  Detailed information on regressed queries, overall resource consumption, the worst performing queries, and detailed information such as query plans for a specific SQL statement can then be retrieved using SQL Server Management Studio (SSMS).

    Figure: A couple of days later… Query Store can now be queried to determine which queries are now performing poorly

    Once Query Store has been collecting performance information on a database for an extended period, a rich collection of information is available.  It is possible to show regressed queries by comparing a Recent time interval (2 weeks in the diagram below) compared to a baseline History period (the Last Year in the diagram below) to see queries that have begun to perform poorly.

    Figure: The query store can show the top 25 regressed queries in the last 2 weeks and give suggestions on how to improve them

    In the diagram we can see the total duration for a query (top left), the execution plans that have been used on a particular query (top right) and the details of a selected execution plan in the bottom pane.  The actual SQL statement that was executed is also visible, allowing the query to be linked back to a particular EF code statement.

    The Top Resource Consuming Queries tab is extremely valuable for performance tuning a database.  You can see the Top 25 Queries by:

    • Duration
    • CPU Time
    • Execution Count
    • Logical Reads
    • Logical Writes
    • Memory consumption
    • Physical Reads

    All of these readings can be broken down using the statistical measures of:

    • Total
    • Average
    • Min
    • Max
    • Std Deviation

    As with the Regressed Queries tab, the query plan history and details of a particular query plan are available for inspection. ​This provides all the required information to track down the part of the application that is calling the poorly performing SQL, and also provides insight into how to fix the poor performance depending on which SQL step is taking the most time.

  8. Do you know when to implement IDisposable?

    ​If you access unmanaged resources (e.g. files, database connections etc.) in a class, you should implement IDisposable and overwrite the Dispose method to allow you to control when the memory is freed.  If not, this responsibility is left to the garbage collector to free the memory when the object containing the unmanaged resources is finalised.  This means the memory will be unnecessarily consumed by resources which are no longer required, which can lead to inefficient performance and potentially running out of memory altogether.


    ​​public class MyClass
    {
      private File myFile = File.Open(...); // This is an unmanaged resource
    }

    //elsewhere in project:
    private void useMyClass()
    {
      var myClass = new MyClass();
       /*
      ​​Here we are using an unmanaged resource without disposing of it, meaning it will hang around in memory unnecessarily until the   garbage collector finalises it
      */
    }

    ​​Figure: Bad example - Using unmanaged resources without disposing of them when we are done​


    public class MyClass : IDisposable
    {
      private File myFile = new File.Open(...); // This is an unmanaged resource

    ​  public void Dispose()
      {
        myFile.Dispose(); // Here we dispose of the unmanaged resource
        GC.SuppressFinalize(this); // Preventing a redundant garbage collector finalize call
      }
    }

    ​​Figure: Good example - Implementing IDisposable​ allows you to dispose of the unmanaged resources deterministically to maximise efficiency


    Now we can use the "using" statement to automatically dispose the class when you are finished with it.:

    ​​private void useClass()
    {
      using (var myClass = new MyClass())
      {
        // do stuff with myClass here...
      }  // myClass.Dispose() is automatically run at the end of the using block
    }

    ​​Figure: Good example - With the using statement, the unmanaged resources are disposed of as soon as we are finished with them


    See here for more details.​

  9. Do you know when to use StringBuilder?

    ​​A String object is immutable - this means if you append two Strings together, the result is an entirely new String, rather than the original String being modified in place.  This is inefficient because creating new objects is slower than altering existing objects.  Using StringBuilder to append Strings modifies the underlying Char array rather than creating a new String.  Therefore, whenever you are performing multiple String appends or similar functions, you should always use a StringBuilder to improve performance.


    String s = "";
    for (int i = 0; i < 1000; i ++) {
      s += i;
    }

    ​​​Figure: Bad example - This inefficient code results in 1000 new String objects being created unnecessarily.


    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 1000; i ++) {
      sb.append(i);
    }​

    ​Figure: Good example - This efficient code only uses one StringBuilder object.​​

    See here for more details.


  10. Do you know the best load testing tools for web applications?

    Load testing helps you ensure that your apps can scale and do not go down when peak traffic hits. Load testing is typically initiated for seasonal events such as tax filing season, Black Friday, Christmas, summer sales, etc.

    Ultimately the tool used will depend on the number of users you want to target and what infrastructure you have available. 
    Most commercial tools will support up to 50 virtual users in the load test but on your hardware. Alternatively, for more users you should use cloud based offerings like:

    testingtools.jpg
    Figure: Load Storm results​
  11. Do you stress tests your infrastructure before testing your application?

    The infrastructure that your application is deployed to is often never tested but can be the culprit for performance issues due to misconfiguration or virtual machine resource contention. We recommend setting up a simply load test on the infrastructure like setting up a web server that serves 1 image and having the load tests simply fetch that image.

    This simple test will highlight: 

    • Maximum performance you can expect (are your goals realistic for the infrastructure)
    • Identify any network related issues
    • Uplink bandwidth, DDOS protection, firewall issues
    infratests.jpg
    Figure: Work out the maximum performance of the infrastructure before starting

    ​Note: if you have other servers in the mix, then you can make another simple test to pull records from the database to check the DB server as well.

  12. Do you know the highest hit pages?

    Measuring which pages receive the most hits will help a great deal when looking to optimise your website.
    • Measuring usage will help to determine the current business value of a page or feature
    • Optimising a highly used page will have a higher impact on overall system performance
    A number of great tools exist to find the highest hit pages.
    App-Insights-return-request.png
    Figure: Application Insights can return request counts under the performance screen
    GoogleAnalytics-Stats.png
    ​Figure: Google Analytics provides powerful usage statistics
  13. Do you prioritize performance optimization for maximum business value?

    Like any development, performance optimization requires development time and should therefore be prioritized for business value.​​

    ​Include the following considerations:

    • What is the preferred performance for this feature?
    • What represents an acceptable performance metric?
    • How many users are expected to use this feature within a timeframe?
    • What is the business impact of poor performance for this feature?
    • Are we planning to significantly change this feature in the near future?

    ​Hi Adam,

    As per our conversation, we have identified 2 slow queries:

    Query 1: On the “Edit Item” screen (admin only) we have identified 4 separate SQL queries that can be rewritten as one. We estimate that this will reduce the response time by 1.5 seconds. Only a few admin users will be affected
    Query 2: On the Home page is a query that currently takes 1 second that we can reduce to ½ a second. This affects all users.

    We optimized the "Edit Item" page because that had the biggest measurable improvement.

    ​Bad example: although the admin page has a bigger potential saving, the home page affects all users and therefore probably has a higher business value. Business value should be determined by the Product Owner, not the developer

    Hi Adam,

    As per our conversation, we have identified a query in the "Save Timesheet" endpoint that often takes more than 2 seconds to complete – well beyond the project’s 800ms target.
    However, this entire feature is scheduled to be migrated from MVC to Angular in the next sprint.

    Recommended actions:
    - We won't optimize the existing implementation
    - Raise the priority of the Angular migration PBI
    - Ensure that performance metrics are included in the acceptance criteria of the migration PBI

    1. Please “reply all” with changes or your acceptance. 

    ​​Good example: there is little business value in optimizing code that will soon be replaced – but the final decision on business value is left to the Product Owner​

    Related Rules​​

  14. Do you use async/await for all IO-bound operations?

    IO-Bound operations are operations where the execution time is not determined by CPU speed but by the time taken for an input/output operation to complete.​

    ​​​​Examples include:

    • Reading from a hard disk
    • Woking with a database
    • Sending an email
    • HTTP REST API calls

    It's important to note that all these IO operations are usually several orders of magnitude slower than performing operations against data in RAM.

    Modern .NET applications provide a thread pool for handling many operations in parallel. Threads are pooled to mitigate the expense of thread creation and destruction.

    If an individual thread is waiting for IO to complete, it is IO blocked and cannot be used to handle any more work until that IO operation is finished.

    ​When using async, the thread is released back to the thread pool while waiting for IO, while the await keyword registers a callback that will be executed after IO completion.

    The async/await pattern is most effective when applied “all the way down”. For ASP.NET web applications this means that the Controller Action – which is usually the entry point for a request into your application – should be async.

    public ActionResult Gizmos()
    {
        var gizmoService = new GizmoService();
        return View("Gizmos", gizmoService.GetGizmos());
    }

    Figure: Bad Example – this MVC Controller Action endpoint is not async so the thread assigned to process it will be blocked for the whole lifetime of the request​


    public async Task<ActionResult> GizmosAsync()
    {
        var gizmoService = new GizmoService();
        return View("Gizmos", await gizmoService.GetGizmosAsync());
    }

    Figure: Good example this MCV Controller Action is async. The thread will be released back to the threadpool while waiting for any IO operations under the “gizmoService” to complete 


    Above code examples are based on: https://docs.microsoft.com/en-us/aspnet/mvc/overview/performance/using-asynchronous-methods-in-aspnet-mvc-4

    With these async/await patterns on .NET​ Core, our applications can handle very high levels of throughput.

    Once an async/await based app is under heavy load, the next risk is from thread-pool starvation. Any blocking operations anywhere in the app can tie up threadpool threads – leaving fewer threads in the pool to handle the ongoing throughput. For this reason, the best practice is to ensure that all IO-bound operations are async and to avoid any other causes of blocking.

    ​For more information on understanding and diagnosing thread pool starvation, read: https://blogs.msdn.microsoft.com/vancem/2018/10/16/diagnosing-net-core-threadpool-starvation-with-perfview-why-my-service-is-not-saturating-all-cores-or-seems-to-stall/

    Further Information:​