The development of Cere in UNIX has been faster for 2 years, and it has been working in a major product project team in our department. The product has experienced a year and a half in the market. It can be said that it is already a mature product, the product It has been stable in most customers, there is no problem, and in a few customers, there will be hanged throughout the system from time to time, and this problem does not have any rules, in addition to the system, we will be the whole system In addition to the tracking record of all process stacks with PStack, we don't have any other clues. There is no system crash when we start to face powerful stress from customers. . . . We almost all breath, we have to submit a fault report to our customers, and these fault reports are this problem, which begins to become a sword hanging on our head.
I started to be assigned to solve this problem. In fact, we know this problem very early, knowing this problem is a deadlock problem, but how it happened, we almost investigate the code of the entire system, but there is no. . .
At the beginning of this year, our system has once again appeared again. When I received a call from our maintenance of engineers, my mood was very heavy, I didn't do my responsibility as a programmer. Sometimes, in addition to the process stack tracking, I don't have any other clues, I still accept the email, check the stack tracking record of our maintenance engineer sent back, may be a period of rest for a while, the head is like Special awake, facing the stack tracking record, combined with the source code, I quickly realize that the cause of death lock is the classic source condition problem. When a process is ready to exit, it will wake up a thread inside, the thread first Will get a lock, follow-up, then unlock exit, but because the main thread of the process is not used to wait for the exit of the thread, it is directly exited by Return, we know that when the main thread exits The operating system will directly terminate other threads inside the process, and will not leave any opportunity to make the thread to exit. And that thread may just get the lock, but have not been able to unlock, it is killed, so it will cause other processes or threads to add this lock, it will be blocked, the system deadlock problem has emerged. come out.
Of course, I know that the problem of stamping conditions caused a deadlock, and will simultaneously wait for the thread to exit by pthread_join, to eliminate this race condition, and eliminate dead locks, the problem is actually solved. However, there is no way to eliminate the lock of the Owner who does not release the lock, it will die, and other threads add this lock problem to the deadlock problem?
Regarding this problem, I used to teach many people in Chinaunix, some people want to pass a network lock (that is, locking, unlocking to access a machine on the network) to solve, but these methods are not mature, also very Complicated, as if there is a master disdain for this issue. By collecting a lot of information, I found that it is very simple to solve this problem, and it is simple to be simple.
In many systems, when a lock Owner does not release the lock, you will exit, then the default mode is that when other threads are added to add this lock, it will block, causing dead locks. And initialize this lock through different attributes, we can change this default way:
pthread_mutexattr_setprotocol (& mattr, PTHREAD_PRIO_INHERIT); pthread_mutexattr_setrobust_np (& mattr, PTHREAD_MUTEX_ROBUST_NP); Through the above lock setting two properties, we have changed the default behavior when the owner died after a lock, other threads to go plus the lock time, Will not be blocked, but by returning the value EOWNERDEAD, you can process it according to this error: The first is to call the pthread_mutex_constent_np function to restore the uniformity of the lock, then call unlock PTHREAD_MUTEX_UNLOCK, next to the lock lock In this way, the behavior of the lock will return to normal. If the above function is not successful at the consistent consistency of the lock, you only need to call the unlock function, then return directly, do not call the lock function, then the next thread is in the call lock function, Will get returned VATRECOVERABLE, then you need to call pthread_mutex_destroy to DESTROY to drop the lock, then reactivate the lock's initialization function to reinitialize the lock.
The above solution is more suitable for the case where there is only one lock in the system. If the deadlock of the system is caused by multi-locking resources, then this solution is unable to force. . .
Note: There are some functions behind the functions, indicating that not portable means that it means that these functions are some system you implement, not the POSIX standard.