Symmetric multiprocessing (SMP) systems are computers that contain multiple CPUs, where a CPU can be a single core on a multi-core processor or a single-core processor on a multiprocessor computer. Microsoft Windows and other modern operating systems can take advantage of SMP systems to achieve increased performance. SMP systems achieve better performance by executing multiple threads on multiple CPUs concurrently. Performance improvement can come through executing multiple processes concurrently as well as from executing multiple threads within a process concurrently. Generally, when an application is implemented with multiple threads, the operating system attempts to schedule each thread on a separate CPU when possible. In the best-case scenario, an SMP system can execute a multithreaded application n times faster than a non-SMP system, where n is the number of CPUs. An application that exhibits this best-case performance scaling is said to scale linearly with the number of CPUs.

An application scales linearly if the work done in the application is completely divided into an independent part for each CPU, with no data shared between the independent parts. If the parts have to interact with each other or access the same data, the performance increase can be substantially less than linear. In some cases, performance of an application can even be worse on an SMP system than on a non-SMP system. The performance degradation or lack of improvement on SMP systems is caused by a variety of factors. The most typical factors are data sharing, where both reads and writes occur to the same or nearby data in memory, and synchronization interdependencies, where threads must coordinate with each other for the application to execute correctly. The degree to which these factors impact performance is dependent on the particular CPU architecture in the SMP system. For example, on some quad-core CPUs the memory cache is shared among all of the cores while on others the cache is separate for each core. Such a difference can result in performance differences when the threads access a large amount of shared data.

General Strategies for Optimizing Performance on SMP Systems

The best way to optimize an application for an SMP system is to break up the application into as many independent, concurrent tasks as possible and run at least n of them at a time, in n separate threads, where n is the number of CPUs on the computer. The application should avoid sharing data among threads and avoid requiring synchronization among threads as much as possible. Unfortunately, it is often difficult or impossible to design application functionality in a way that does not require threads to share data and does not require thread synchronization.

Because the performance impact of sharing data and synchronization depends on the CPU architecture, you can benchmark the application on different computer systems to determine the computer architecture on which the application performs best. You can also optimize the application for a particular computer system. If parts of the application perform worse with multiple CPUs, you can tune the application performance for a particular computer system by limiting the CPUs which certain threads or processes use to a subset of the CPUs on the system. You can then benchmark to determine how such tuning affects performance.

Strategies for Optimizing TestStand Performance on SMP Systems

TestStand applications are often multithreaded. You can execute sequences in parallel by using a Sequence Call step to execute a sequence in a new thread or new execution. You can execute code modules in parallel by executing the sequences which contain them in parallel. You can also create threads in code modules using your programming environment.

TestStand sequences that execute in parallel threads, communicate with the engine, share data with other sequences, and synchronize with other sequences might not scale well on SMP systems. Additionally, code modules that sequences call might also perform operations that do not scale well on SMP systems. Therefore, your system might have better performance if you constrain some or all threads to a subset of CPUs. If the TestStand sequences spend significant time in code modules that are completely independent, the system may have the best performance by using all CPUs for such code modules while constraining the CPUs the rest of the system uses.

To allow you to tune your system to achieve the best performance, TestStand provides several features to control how the TestStand process and its threads utilize the CPUs on your computer. In the descriptions of the features below, the CPU affinity of a process refers to the set of CPUs on which the threads in the process may run. The CPU affinity of a thread refers to the subset of those CPUs on which a particular thread may run.

  • Station Option: Default CPU Affinity for Threads—Determines the default CPU affinity for new threads of executions and for the user interface thread. The default is to allow all CPUs. This option is located on the Preferences tab of the Station Options dialog box.
  • Sequence Call Step: CPU Affinity for New Thread or Execution—The CPU affinity of the new thread in the current execution or the initial thread in a new execution. By default the new thread uses the Default CPU Affinity for Threads station option. This setting allows you to specify that the new thread allows all CPUs, uses the CPU affinity of the calling sequence, or allows specific CPUs that you specify. This option is located on the Sequence Call Advanced Settings window.
  • CPU Affinity Step—Enables you to get or set the CPU affinity of the process or the current execution thread. This step gives you low-level control over how your system utilizes CPUs. This step is located in the Advanced subgroup of the Synchronization group on the Insertion Palette or in the Insert Step submenu of the Steps pane context menu.
Note TestStand represents CPU affinity as an integer number where each bit in the binary representation specifies a CPU, where a CPU can be a single-core processor or a single core on a multi-core processor. The lowest ordered bit specifies the first CPU. When specifying a CPU affinity you may want to use a binary number. You specify a binary number in TestStand with the 0b prefix. For example, setting the CPU affinity for a thread to 0b1010 specifies to allow the thread to execute on the second or fourth CPU. You can use -1 for the CPU affinity of the thread to specify that the thread may run on any CPU on which the process may run. You can use -1 for the CPU affinity for the process to specify that the process may run on any CPU on the computer.

You can use the CPU affinity features together to tune your test system to get the best performance on a particular computer. For example, if you have a quad core computer, you could set the Default CPU Affinity for Threads Station Option to 0b0011 to restrict the test system to use only the first two CPUs and benchmark to determine how the performance of the system is affected. If parts of the test system contain code modules that can be highly parallelized and that do a significant amount of work, you could then use the CPU Affinity settings of a Sequence Call step or a CPU Affinity step to set the execution threads to allow all CPUs for such parts of the system. You may try various CPU affinity configurations to determine what gives the test system the best performance on a particular computer. Since performance results depend on CPU architecture, the best tuning for one computer may be different from the best tuning on another computer.