Oracle Diagnostics Case-Job Task Stop Execution
Last Updated:
Saturday, 2004-11-20 12:47
Eygle
Yesterday received the R & D report, the task was not performed normally, causing some operations to fail.
Start intervention. System environment: Sunos db 5.8 generic_108528-21 Sun4u sparc sunw, Ultra-4 Oracle9i Enterprise Edition Release 9.2.0.3.0 - Production
1. First intervene in the check database task
$ SQLPLUS "/ as sysdba"
SQL * Plus: Release 9.2.0.3.0 - Production On Wed Nov 17 20:23:53 2004
CopyRight (C) 1982, 2002, Oracle Corporation. All Rights Reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.3.3.0 - Production
With the partitioning, OLAP AND ORACLE DATA MINING OPTIONS
JServer Release 9.2.0.3.3.0 - PRODUCTION
SQL> SELECT JOB, LAST_DATE, LAST_SEC, NEXT_DATE, NEXT_SEC, BROKEN, FAILES from DBA_JOBS
Job Last_Date last_sec next_date next_sec b Failures Interval
---------- ---------------------------------------------------------------------------------------------------------------------------------------- ---------- - ------------------------------------------------------
31 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0 Trunc (sysdate 1) 1/24
27 16-NOV-04 00:00:04 17-NOV-04 00:00:00 N 0 Trunc (sysdate) 1
35 16-NOV-04 01:00:02 17-NOV-04 01:00:00 N 0 Trunc (sysdate 1) 1/24
29 16 - NOV-04 00:00:04 17-NOV-04 00:00:00 N 0 Trunc (sysdate) 1
30 01-NOV-04 06:00:01 01-DEC-04 06:00:00 N 0 Trunc (add_months (sysdate, 1), 'mm') 6/24
65 16-NOV-04 04:00:03 17-NOV-04 04:00:00 N 0 Trunc (sysdate 1) 4/24
46 16-NOV-04 02:14:27 17-NOV-04 02:14:27 n 0 sysdate 1
66 16-NOV-04 03:00:02 17-NOV-04 18:14:49 N 0 Trunc (sysdate 1) 3/24
Rows SELECTED.
It is found that the job task is not performed normally, the earliest one should be executed at 17-NOV-04 01:00:00. But no execution. 2. Establish test JOB
Create Or Replace Procedure Pining
IS
Begin
NULL;
END;
/
Variable Jobno Number;
Variable instno number;
Begin
SELECT Instance_Number Into: Instno from V $ INSTANCE
DBMS_Job.Submit (: Jobno, 'Pining;', Trunc (sysdate 1/288, 'mi'), 'Trunc (sysdate 1/288,' 'Mi') ', True,: Instno;
END;
/
Found the same, not executed. But there is no problem with DBMS_Job.Run (
3. Restore attempts
Suspected is the CJQ0 process invalid, first set Job_Queue_Processes to 0, Oracle will kill CJQ0 and corresponding JOB process SQL> ALTER system set job_queue_processes = 0;
Wait 2 ~ 3 minutes, reset
SQL> ALTER SYSTEM SET JOB_QUE_PROCESS = 5;
At this point, PMON will restart the CJQ0 process.
You can see the following information in the alert log:
THU NOV 18 11:59:50 2004
Alter system set job_queue_processes = 0 scope = memory;
THU NOV 18 12:01:30 2004
ALTER system set job_queue_processes = 10 scope = memory;
THU NOV 18 12:01:30 2004
Restarting Dead Background Process CJQ0
CJQ0 Started with PID = 8
But Job still does not perform, and when it is again modified, CJQ0 is dead.
Thu Nov 18:52:05 2004
Alter system set job_queue_processes = 0 scope = memory;
THU Nov 18 14:09:30 2004
ALTER system set job_queue_processes = 10 scope = memory;
Thu Nov 18 14:10:27 2004
Alter system set job_queue_processes = 0 scope = memory;
THU NOV 18 14:10:42 2004
ALTER system set job_queue_processes = 10 scope = memory;
Thu Nov 18 14:31:07 2004
Alter system set job_queue_processes = 0 scope = memory;
THU Nov 18 14:40:14 2004
ALTER system set job_queue_processes = 10 scope = memory;
Thu Nov 18 14:40:28 2004
Alter system set job_queue_processes = 0 scope = memory;
THU Nov 18 14:40:33 2004
Alter system set job_queue_processes = 1 scope = memory;
THU NOV 18 14:40:40 2004alter system set job_queue_processes = 10 scope = memory;
THU NOV 18 15:00:42 2004
Alter system set job_queue_processes = 0 scope = memory;
THU NOV 18 15:01:36 2004
Alter system set job_queue_processes = 15 scope = memory;
4. Try to recover the database this must be done at night
PMON Started with PID = 2
DBW0 Started with PID = 3
LGWR Started with PID = 4
Ckpt Started with PID = 5
SMON Started with PID = 6
Reco Started with PID = 7
CJQ0 Started with PID = 8
QMN0 Started with PID = 9
....
CJQ0 is started normally, but Job still does not execute.
5. No way ...
Continue research ... actually discovered that Oralce has such a bug
1. Clear Description of the problem ENCOUNTERED: SLGCSF () / SLGCS () on Solaris Will Stop Incrementing After 497 Days 2 HRS 28 MINS (Approx) Machine Uptime.
2. Pertinent Configuration Information No Special Configuration Other Than Long Machine Uptime.
3. INDICATION OF THE FREQUENCY AND PREDICTABILITY OF THE PROBLEM 100% But Only After 497 Days.
4. Sequence of events leading to the problem If the gethrtime () OS call returns a value> 42949672950000000 nanoseconds then slgcs () stays at 0xffffffff. This can cause some problems in parts of the code which rely on slgcs () to keep moving. EG: IN KKJSSRH () Does "Now = SLGCS (& SE)" And Compares That To a Previous TimeStamp. After 497 Days Uptime SLGCS () Keeps Returning 0xffffffff SO "Now - KKJLSRT" Will Always Return 0..
5. Technical Impact ON The Customer. Include Persistent After Effects. In this case 497 days uptime. Other symptoms could Occur in Various Places in The Code.
Ok, it turns out that the timer overflows, check my host:
BASH-2.03 $ UPTime
10:00 PM Up 500 Day (s), 14:57, 1 User, Load Average: 1.31, 1.09, 1.08
Bash-2.03 $ DATE
Fri Nov 19 22:00:14 CST 2004 is just more than 497 days. FT.
6. Arrange to restart the host system ..
This problem is depressed enough, NND, who once thought of Oracle, ...
Oracle finally claimed:
FIX Made It Into 9.2.0.6 PatchSet
9206 on Solaris has not been released ... dizzy.
Ok, just a matter, if there is any problem, it is very incredible, so boldly doubizes Oracle, it is bug, it may be bug.
7.faq
Some friends ask questions on PUB Q: Is there the same problem for different platforms?
A: For different platforms, there is the same problem because Oracle uses the standard C function gethrtime reference: http://www.eygle.com/unix/man.page.of.Gethrtime.htm
There is a problem with the code that uses the function.
In MetaLink Note: 3427424.8, the platform impact defined by Oracle is: generic (all / most platage forms affected)
Q. The counter overflows. Look at the job is basically 1 day in Job. If you set up Job for 3 days, whether the uPtime should be 497 * 3?
A: The relative time will not be enhanced through the timer inside the Oracle. Because Oracle internal HRTIME_T uses 32-bit count
So maximum is 0xffffffff0xffffffffff = 4294967295
SLGCS () is 1 billion in seconds, overflow at this point at 42949672950000000.
Note that 0xffffffffff, this value is originally unsigned, and now it is -1, then this value is incremented, 1 = 0. Time will stop this. I wrote a small piece of code to verify this content, reference:
[Oracle @ Jumper Oracle] $ CAT unsign.c
#include
INT main (void) {
Unsigned int num = 0xffffffffffffff;
Printf ("NUM IS% D BITS Long / N", SIZEOF (NUM) * 8);
Printf ("NUM = 0x% x / n", NUM);
Printf ("NUM 1 = 0x% x / n", NUM 1);
Return 0;
}
[Oracle @ Jumper Oracle] $ gcc -o unsign.sh unsign.c
[Oracle @ Jumper Oracle] $ ./unsign.sh
Num IS 32 BITS Long
Num = 0xfffffffff
NUM 1 = 0x0
[Oracle @ Jumper Oracle] $
Author: eygle, Oracle technology followers, Oracle technical forum itpub.www.eygle.com from China is the biggest author's personal site you may contact the author by Guoqiang.Gai@gmail.com welcome to explore technical exchanges and links. exchange.
Original source:
http://www.eygle.com/case/Job.can.not.execute.Auto.htm