Introduction By Sergey N. Zheltov and Stanislav V. BratanovAnyone designing multithreaded applications will agree:. The problem of efficient synchronization is among the most difficult tasks of parallel programming Modern operating systems provide a multitude of synchronization objects, which determine a variety of synchronization methods and schemes.This article addresses some of the issues of parallel program synchronization and tries to clarify at least a few of them.Below, several synchronization schemes are discussed; synchronization object behavior is described; internal system implementation of user-visible synchronization objects or functions is . also explained where applicable Hidden timings and latencies are provided; execution thread layout is shown over time, and non-obvious thread time shifts are pointed out.Readers will also find code examples illustrating important points of the synchronization problem.All discussions of synchronization objects And Algorithms APPEARING I n this article pertain to Microsoft Windows * implementations of such objects and algorithms.DefinitionsSeveral abbreviations are used throughout the article to denote the following terms: IPI stands for Inter-Processor Interrupt, an interruption signal sent by one processor to another.APC, Asynchronous Procedure Call, Which is a Microsoft Windows * Notification Scheme Enabling The Execution of a Specified Procedure With Context of a Specified Thread.
Internal Implementation This article addresses wake-up and signaling schemes which are used to directly control the execution of threads. Schemes of mutual exclusion, which are mostly employed to synchronize resource usage (granting access to a resource to only one thread), are beyond the scope of the article.Also, in this article we assume that all necessary resources have been already divided between threads; the task is to wake up the thread and load each of them with an appropriate amount of work.As various mutual exclusion constructions are not within the scope of this article, only two sorts of synchronization objects typically used for signaling are considered: events and semaphores.From an operating system kernel point of view, all objects involved in synchronization (those whose handles may be specified as arguments of WaitForSingleObject / WaitFormultiPleObjects Functions) Are Derived from One Basic Structure Dispatcher_Header Decland in Microsoft Driver Development Kit in T HE FOLLOWING FORM:
Typedef struct _dispatcher_header {
Union {
Struct {
Uchar type;
Uchar absolute;
Uchar size;
Union {
Uchar inserted;
Boolean DebugActive;
}
}
Volatile Long Lock;
}
Long signalState;
List_entry waitlisthead;
DISPATCHER_HEADER;
The most important fields of which are Type, SignalState, and WaitListHead that let the system differentiate between synchronization objects and organize a list of waiting threads which are to be woken up when the SignalState becomes Signaled.
So, Event, Semaphore, Mutex, Timer, And Thread Objects Look Similar To The Kernel.
The operating system kernel supports two types of events corresponding to the user-level auto-reset and manual reset events Namely:. Synchronization events and notification events.The difference between the two types is similar to the user-level events: signaling a synchronization event Would Cause ONLY One Waiting Thread To Wake Up, While Setting a Notification Event to a Signaled State Would Wake As Many Threads as there might be waiting.
The functionality of a semaphore object is close to one of a notification event: when a semaphore is released the operating system tries to satisfy as many waits (ie, to mark as many threads ready for execution) as possible, though staying within the limit specified .
Thread Wake Algorithms
Two Basic Methods of Triggering The Execution of A Thread Include:
WAKING A Thread by Signaling The Synchronization Object IT Waits on and
Queueing an APC to the Context of an Alertable Thread.
An attentive reader may think we forgot about Sleep function as well as about SuspendThread and ResumeThread. But the former function is equivalent to waiting for a timer and thus does not contradict our classification.
The latter suspension / resumption case is more complex, though still within the above classification: suspending a thread is implemented as a wait for an internal semaphore associated with each thread within the system When SuspendThread function is called, the system queues an APC object,. Which in its Turn Enters A Wait State (on the aforementioned semaphore). ResumeThread Function Releases The Semaphore, and The Execution of The Thread Resumes.
The Pseudo-Code Below Exemplifier The Of Waking Up A Waiting Thread.Event- Or Semaphore-Driven Algorithm from a General Point of View Looks As Follows:
OnsetEvent (Object)
{
Switch (Object-> Type)
{
Case SynchronizationEvent:
NEW_THREAD = Object-> Waitlist [0];
Processor = SelectidLeprocessor ();
Sendipi (Processor, New_Thread);
Break;
Case NotificationEvent:
For (i = 0; i
{
NEW_THREAD = Object-> Waitlist [i];
Processor = SelectidLeprocessor ();
Sendipi (Processor, New_Thread);
}
Break;
Case Semaphore:
For (i = 0; i
{
NEW_THREAD = Object-> Waitlist [i];
Processor = SelectidLeprocessor ();
Sendipi (Processor, New_Thread);
}
}
}
THE APC-BASED Thread-Waking Algorithm, As Opposed to The Event-Driven Scheme, Comprises Two Stages.
First, An APC HAS To BE Queued to The Context of A Thread of Interest:
QueueAPC (Thread, Function, Type)
{
IF (Type == Kernel_Mode_APC)
{
INSERTKERNELQUEUEAPC (Thread, Function);
}
Else
{
INSERTNORMALQUEUEAPC (THREAD, FUNCTION);
}
IF (thread.state == waiting_in_user_mode_and_alertable)
{
Processor = SelectidLeprocessor ();
Sendipi (Processor, Thread);
}
}
THEN, WHEN A New Thread is Activated, All APC Functions Associated with The Thread NEED To BE EXECUTED:
ONTHREADSWITCH (New_THREAD)
{
IF (new_thread-> APC_Queue.size! = 0)
{
For (i = 0; i
{
Callapcfunction (new_thread-> apc_queue.function [i])
}
}
}
You may observe that the execution of the destination thread's 'real' code is delayed until all the queued functions are executed. This, in our opinion, may impose additional overhead in case such a scheme is used for thread synchronization. On the other hand, when employed to notify a running or suspended thread on the completion of an input / output operation or expiration of a preset time interval, the APC approach seems to be one of the fastest possible solutions.Useful Timings Let's prove or rebut the above considerations on the synchronization methods and their usability with the actual timing information. A sample OpenMP application built using different compilers (in order to reproduce at least two different synchronization models) and a manual thread control example were investigated. The results are shown below.
Figure 1 gives an overview of the sample application. Three regions separated by dashed lines correspond to APC-, event-, and semaphore-triggered parts of code. Each part includes the main thread that controls the operation of three additional threads. The application is executed by an Intel® Xeon ™ 2.4 GHz Dual-Processor machine with Hyper-Threading Technology enabled. Each thread is represented by a horizontal color bar. Small strokes above the upper right corner of some bars denote the fact that a corresponding thread was stopped due to a call to a synchronization function (the absence of a stroke means that the thread was suspended due expiration of its time quantum to). Similar small strokes below the bars correspond to special marks inserted into the code of the application by a programmer. Thin Lines undercoring The Bars Show The Duration of The Synchronization Functions (EG, SLEEP, WAITFORSINGLEOBJECT, QueueUseraPC, STEVENT, RELEASEMAPHORE).
Figure 1: Parallel Program LayoutNow we'll proceed with the timings.APC Invocation DelayFigure 2 provides an example of an APC-based thread control The main thread inserts three APC objects into the queues of each controlled thread In the figure, the beginning.. . of APC insertion and the start of the last thread are marked with dashed lines The marked interval corresponds to 47 microseconds (as typed in the bottom right corner of the figure), which is the APC invocation delay.Figure 2: APC TimingsEvent Triggering DelayFigure . 3 presents similar information for the event-based synchronization method The execution region ranging from triggering the first event to the activation of the last thread covers 33 microseconds, which is faster than the time the APC version takes to execute.Figure 3: Event TimingsSemaphore Release Delayfigure 4 Exemplifier The Case When All The Three Threads Are Controlled With One Semaphore. The Marked Region, Again, Covers The Time From Calling The ReleaseSemap hore function to the resumption of the last thread. The marked time interval corresponds to 32 microseconds (the difference of 1 microsecond between the semaphore and event triggering delays may be attributed to the error of measurements).
Figure 4: Semaphore TimingsThe above timing information proves our assumptions on the internal algorithms involved in synchronization As it was predicted, the event- and semaphore-based methods yield similar performance, and the double-staged APC queuing introduces additional overhead compared with events and. Semaphores.
Parallel Executions of Threads Consider Figure 5. We marked with yellow the regions of code where the actual computation is performed. You will notice that no part of the computational code is executed in parallel. These results are slightly disappointing, because there is no guarantee that a properly threaded application outperforms its sequential analog Figure 5:. Regions of Parallel ExecutionAll things considered, to ensure that at least part of an application runs in parallel, the following condition should hold true: the Sequential Time of an application divided by the Number of Processors SHOULD BE GREATER THAN 50 Microseconds for the Worst Case (See Figure 2) .now the Question: Is there question: Is there t t t m t TRIGGER MULTIPLE THREADS SIMULTAOSLY?
A closer look at Figure 4 would reveal a discrepancy between the expected behavior of threads (as the type of a user-level synchronization object used would suggest) and the actual operation. Indeed, if three threads are waiting for the same synchronization object, one expects them to wake up synchronously as that object acquires a signaled state. But the figure looks exactly as if each thread was triggered independently.This only confirms the above hypothetical algorithms of operating system kernel synchronization. As one can see, when releasing a semaphore, the operating system picks a new thread from a wait list (associated with the semaphore), selects an idle processor, and then signals to the processor (by means of inter-processor interrupt) to start executing the new thread. This procedure is repeated until THE WAIT LIST IS EMPTY or THE SEMAPHE WAIT LIT IS Analyzed On One Processor and Other Processors Are Invoked Sequentially, The Delay Between Thread s becomes greater or equal to the time of wait list processing plus the time of sending and receiving an IPI.Currently, although there is no means to enable synchronous thread start (at least in the operating systems considered in this article), it may be Easily amended: OnsetEvent (Object)
{
...
IF (Object-> type == Semaphore)
{
Min_limit = min (Object-> Waitlist.size, Object-> Limit, Numberofprocessors - 1);
Sendipi (all_excluding_self, object-> waitlist, min_limit);
For (i = min_limit; i
{
NEW_THREAD = Object-> Waitlist [i];
NEW_THREAD-> State = Ready;
}
}
}
Here we suggest using advanced IPI control features of modern processors which allow shorthand to be specified as an IPI destination. For example, specifying the 'All_Excluding_Self' shorthand would cause an IPI to be sent to all active processors but the current one. This will enable the simultaneous start of all waiting threads and thus reduce the sequential part of an application. While this suggestion will work for operating system writers, the rest of programmers should keep in mind that the biggest component of thread synchronization overhead is not the execution time of system functions, but the time lag between threads being synchronized. And the time lag is proportional to the number of threads or processors (as a computational thread is supposed to occupy a separate processor).
Conclusion A variety of synchronization objects and schemes exist, rooted in the basic support of the operating system kernel. The actual efficiency of different user-level synchronization schemes may appear to be the same, as such schemes may serve as mere synonyms of one general kernel mode construction.On the contrary, some synchronization schemes may support a very specific case, and their performance in other cases of synchronization may degrade.That is why it may be useful, while writing a well-balanced multithreaded application, to keep in mind the above relationships between user- and kernel-based objects, and only apply synchronization schemes that have been specifically designed for a particular purpose for the user.The article provided examples of both the non-standard usage of a synchronization scheme, and the usage of different Schemes Having Similar Performance Due To The Same Underlying System Kernel Algorithm.non-eviDent Timings of Synchronization Calls and Time Shifts between threads are also provided. Such timing information is extremely necessary for planning a well-threaded application, since not taking into account the thread switching overhead and thread scheduling order may adversely affect the scalability of the application, and even decrease its performance as compared with a sequential version.A point that lacks operating system support was identified: relatively large time shifts between threads that were triggered by a common semaphore The solution with broadcast IPIs was proposed.In summary.:
Use synchronization objects and schemes specifically designed by operating system manufacturers to suit your task (it is very likely that such schemes are already optimized for solving your type of synchronization problems), Avoid redundant API calls (in a synchronization loop), Pay attention to the amount of work you are going to load your threads with (it may not be worth it, considering the overhead), Try to compensate for the time shifts by delegating less work to lagging threads (especially for short threads) .Referenceshttp: // msdn .microsoft.com / library / * http://www.orgon.com/w2k_internals/* http://www.sysinternals.com/* Microsoft Windows 2003 Driver Development Kit Sven Schreiber Undocumented Windows 2000 Secrets:. A Programmer's Cookbook Walter Oney. Programming The Microsoft Windows Driver Model (2nd Edition)
About the Authors Sergey ZheltovSergey Zheltov is a project manager in Microprocessor Research, Intel Labs. His research interests include media compression and processing, software and platforms architecture, signal processing, high-order spectra. He received a Diploma in radio-physical engineering and MS degree in theoretical and mathematical physics from Nizhny Novgorod State University.Stanislav BratanovStanislav Bratanov is a software engineer in Microprocessor Research, Intel Labs. His research interests include multi-processor software platforms, operating system environments, and platform-dependent media data coding. He graduated From Nizhny Novgorod State University, Russia.