{"id":6132,"date":"2018-08-03T11:56:34","date_gmt":"2018-08-03T17:56:34","guid":{"rendered":"http:\/\/mattdturner.com\/wordpress\/?p=6132"},"modified":"2018-08-07T12:26:28","modified_gmt":"2018-08-07T18:26:28","slug":"performance-analysis-tools-guide","status":"publish","type":"post","link":"http:\/\/mattdturner.com\/wordpress\/2018\/08\/performance-analysis-tools-guide\/","title":{"rendered":"Quick Guide to Performance Analysis Tools"},"content":{"rendered":"<p>Recently, at my work I have been doing a lot of performance analysis (also known as Profiling Tools) for the codes that I work on. \u00a0I thought it might be beneficial to provide some information for the performance analysis tools that I have used, as well as give my recommendations for which ones to use. \u00a0 All of the tools discussed in this post are capable of profiling OpenMP and\/or MPI codes. \u00a0Note that some of these tools are commercial, and some are open source. \u00a0I don&#8217;t go into details in this post of how to use these tools, although if you would like a post detailing the use just comment or use the <a href=\"http:\/\/mattdturner.com\/wordpress\/contact-me\/\">contact me<\/a> page.<\/p>\n<p>When running performance analysis, there are 2 types of measurement methods that can be performed: 1) Sampling, and 2) Tracing. \u00a0Sampling experiments generally have very little overhead, and are considered to be a very good first step towards identifying performance problems. \u00a0However, the accuracy of the sampling profile can be somewhat low depending on the sampling frequency used. \u00a0Additionally, sampling does not give any information about the number of times functions are called. \u00a0Tracing, on the other hand, revolves around function entry and exit. \u00a0Based on the entry and exit times to a function, the CPU time spent inside each function can be calculated, as well as the number of times the function was called. \u00a0However, tracing generally has a larger overhead compared to sampling.<!--more--><\/p>\n<p>1. <a href=\"https:\/\/www.arm.com\/products\/development-tools\/hpc-tools\/cross-platform\/forge\">ARM MAP<\/a> (previously Allinea MAP)<\/p>\n<p><strong>Pros<\/strong>: <em>Easy to use<\/em><\/p>\n<p><strong>Cons<\/strong>: <em>Commercial<\/em><\/p>\n<p>This is probably the tool that I recommend most to people who have never used a performance analysis tool. \u00a0It is part of the ARM Forge suite, which includes the graphical debugger DDT (another software that I highly recommend, but we&#8217;ll save that for another post). \u00a0It is very easy to get started with ARM MAP. \u00a0It doesn&#8217;t provide as much in-depth analysis information as some of the other tools, but it provides more than enough information to start determining performance bottlenecks in the code. \u00a0MAP provides analysis down to the source line level, and it is designed to be able to profile pthreads, OpenMP, and\/or MPI. \u00a0When using MAP, there is no need to instrument the source files; all you need to do is include the MAP library at link-time.<\/p>\n<p><a href=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/arm_map.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6133\" src=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/arm_map-300x221.png\" alt=\"\" width=\"600\" height=\"441\" srcset=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/arm_map-300x221.png 300w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/arm_map.png 600w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>One concern (perhaps more of a question than a concern) is how MAP determines SIMD usage. \u00a0Recent performance analysis tests with MAP on Intel Haswell processors report SIMD usage. \u00a0I say that this is a concern because Haswell has SIMD performance counters disabled. \u00a0So, I am not sure how MAP can accurately report SIMD usage if the hardware counters for SIMD are not enabled&#8230;<\/p>\n<p>2. CrayPAT<\/p>\n<p><strong>Pros<\/strong>: <em>Easy to use, can provide an immense amount of information<\/em><\/p>\n<p><strong>Cons<\/strong>: <em>Commercial, only available on Cray systems.<\/em><\/p>\n<p>Another easy to use performance analysis tool is the Cray Performance Analysis Tool (CrayPAT).\u00a0 Using CrayPAT is slightly different from using the other performance analysis tools. \u00a0In order to use CrayPAT, you must first compile and link your code after loading the craypat module. \u00a0Once you have an executable, you need to run <em>pat_build<\/em> on the executable. \u00a0The <em>pat_build<\/em> process creates a new executable with either <em>+pat<\/em> or <em>+apa <\/em>appended to the executable name. \u00a0You then run the newly generated executable, which will create profile files. \u00a0Finally, you then have to run <em>pat_report<\/em> on the profile files in order to get the performance analysis report.<\/p>\n<p>As mentioned previously, tracing experiments can provide information that sampling experiments are incapable of providing. \u00a0However, this additional information comes at the cost of an increased overhead of the profiling software. \u00a0In order to allow for tracing experiments to be performed with low overhead, CrayPAT includes the Automatic Program Analysis (APA). \u00a0Using this feature, you first create an instrumented executable for a sampling experiment. \u00a0When generating the report (via <em>pat_report<\/em>), CrayPAT will also create a <em>.apa<\/em> file that contains suggested options for building an executable for tracing experiments. \u00a0It essentially tells <em>pat_build<\/em> to only trace certain time-consuming functions, as determined by the sampling results. \u00a0You then use the <em>*.apa<\/em> file when running <em>pat_build<\/em>, and you can run the newly generated executable to perform a lower-overhead tracing experiment.<\/p>\n<p>CrayPAT reports can either be viewed as a text file, or in their Apprentice software. \u00a0Below is an example of a profile being viewed in Apprentice.<\/p>\n<p><a href=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/craypat.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6134\" src=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/craypat-300x158.png\" alt=\"\" width=\"600\" height=\"316\" srcset=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/craypat-300x158.png 300w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/craypat-768x405.png 768w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/craypat-1024x540.png 1024w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/craypat.png 1500w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>3. <a href=\"http:\/\/taucommander.paratools.com\/\">Tau Commander<\/a><\/p>\n<p><strong>Pros<\/strong>: <em>Open Source, can provide an immense amount of information<\/em><\/p>\n<p><strong>Cons<\/strong>: <em>Can be difficult to learn as your first performance analysis tool<\/em><\/p>\n<p>Another performance analysis tool that I use frequently is TAU Commander. \u00a0TAU Commander is basically an interface that provides a more intuitive, user-friendly way of using the TAU profiler. \u00a0As such, it provides access to TAU&#8217;s vast array of features. \u00a0Of the tools discussed in this post, my experience is that TAU can provide the most information (if you know how to use the tool properly). \u00a0It also allows for the most fine-tuning of the performance analysis experiments. \u00a0<a href=\"http:\/\/www.paratools.com\/tau\">TAU<\/a> is &#8220;capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements.&#8221; \u00a0The user code can be instrumented using an automatic instrumentor tool, dynamically, at runtime, or manually. \u00a0The profile information can be viewed in ParaTools ParaProf software.<\/p>\n<p><a href=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/paraprof.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6135\" src=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/paraprof-300x210.png\" alt=\"\" width=\"600\" height=\"420\" srcset=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/paraprof-300x210.png 300w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/paraprof-768x538.png 768w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/paraprof.png 802w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/p>\n<p>While TAU probably is capable of providing the most information out of the tools discussed here, I have not used it extensively enough (yet) to comment on all of the capability.<\/p>\n<p>4. <a href=\"https:\/\/openspeedshop.org\/\">OpenSpeedShop<\/a><\/p>\n<p><strong>Pros<\/strong>: <em>Open source, easy to use, can compare results between multiple experiments<\/em><\/p>\n<p><strong>Cons<\/strong>: <em>I believe OpenSpeedShop has issues running experiments on KNL<\/em><\/p>\n<div id=\"Signature\">\n<p>OpenSpeedShop is an open source performance analysis tool developed by the <a href=\"https:\/\/www.krellinst.org\/\">Krell Institute<\/a>. \u00a0It provides an easy to use GUI interface, while also allowing users to run from the command line. \u00a0It also provides convenience scripts for some common profiling experiments, such as sampling experiments, hardware counter sampling, IO, MPI, etc. \u00a0It allows for both sampling and tracing techniques to be used, and works on Intel, AMD, ARM, Intel Phi, Power PC, Power 8, and GPU processor based systems. \u00a0When running OpenSpeedShop experiments, users do not need to recompile the application in order to get performance data at the function and library level.<\/p>\n<p><span style=\"font-size: small;\"><a href=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/openspeedshop.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-6136\" src=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/openspeedshop-300x184.png\" alt=\"\" width=\"600\" height=\"368\" srcset=\"http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/openspeedshop-300x184.png 300w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/openspeedshop-768x471.png 768w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/openspeedshop-1024x627.png 1024w, http:\/\/mattdturner.com\/wordpress\/wp-content\/uploads\/2018\/08\/openspeedshop.png 1330w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><\/span><span style=\"font-size: small;\">\u00a0<\/span><\/p>\n<p>There are many other performance analysis tools out there (<a href=\"https:\/\/software.intel.com\/en-us\/intel-vtune-amplifier-xe\" class=\"broken_link\">Intel VTune<\/a>, <a href=\"http:\/\/www.mcs.anl.gov\/research\/projects\/darshan\/\">Darshan<\/a>, <a href=\"http:\/\/ipm-hpc.sourceforge.net\/\">IPM<\/a>, <a href=\"http:\/\/valgrind.org\/\">Valgrind<\/a>, etc.), but these are the 4 that I have used the most. \u00a0Feel free to comment with any questions that you might have, or if you think some other profiler is better please let me know.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Recently, at my work I have been doing a lot of performance analysis (also known as Profiling Tools) for the codes that I work on. \u00a0I thought it might be beneficial to provide some information for the performance analysis tools that I have used, as well as give my recommendations for which ones to use. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6140,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[64,261,260],"tags":[279,292,284,264,283,282,269,267,266,291,293,281,270,288,278,265,289,290,263,276,287,277,285,286,294,271,272,273,274,280,262,275,71,268],"class_list":["post-6132","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-linux-2","category-optimization","category-performance-analysis","tag-allinea","tag-amd","tag-apa","tag-arm","tag-cray","tag-craypat","tag-darshan","tag-ddt","tag-forge","tag-gpu","tag-hardware-counter","tag-haswell","tag-ipm","tag-knl","tag-krell","tag-map","tag-mpi","tag-openmp","tag-openspeedshop","tag-optimization","tag-paraprof","tag-paratools","tag-pat_build","tag-pat_report","tag-performance","tag-performance-analysis","tag-profiler","tag-profiling","tag-sampling","tag-simd","tag-tau","tag-tracing","tag-valgrind","tag-vtune"],"_links":{"self":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts\/6132"}],"collection":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/comments?post=6132"}],"version-history":[{"count":4,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts\/6132\/revisions"}],"predecessor-version":[{"id":6141,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/posts\/6132\/revisions\/6141"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/media\/6140"}],"wp:attachment":[{"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/media?parent=6132"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/categories?post=6132"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/mattdturner.com\/wordpress\/wp-json\/wp\/v2\/tags?post=6132"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}