Aychin's Oracle RDBMS Blog

Only for Advanced level Professionals

Monthly Archives: August 2014

Oracle FILE_WATCHER and its troubleshooting

Database Version: 11.2.0.3.1 on 64bit Linux

Recently we faced issue related to Oracle file watcher, it worked fine on development server but not on production. Problem was found and solved, the reason why I am writing this post is to share some general information about file watcher and troubleshooting steps of it, because there is not so much information on the web about that. I found that many DBAs faced issues related to file watcher and there is still few opens of them. Even on metalink you can not find some really useful information, there is few bugs registered and still in development. Also troubleshooting steps used here can be applied to any other issue and can be considered as general method.

(!) First of all to start troubleshooting anything, you must clearly understand what it is doing and how it is working.

Considering above mentioned rule, lets try to understand what is it FILE_WATHCER and how it works.

About Oracle FILE_WATCHER

FILE_WATCHER used to monitor operating system folder for new files to be arrived, when it will identify such a file it will trigger the scheduler job which you can use to program some actions related to this event.

How it is designed under the hood is really very simple!

There is a scheduler job owned by SYS user called FILE_WATCHER. Next query will give us more information about it:

select owner, program_owner, program_name, schedule_owner, schedule_name, job_class from dba_scheduler_jobs where job_name='FILE_WATCHER';

file_watcher_job_1

 

 

This job used to trigger FILE_WATCHER_PROGRAM according to interval specified in FILE_WATCHER_SCHEDULE schedule. Lets get information about the schedule:

select owner, repeat_interval from dba_scheduler_schedules where schedule_name='FILE_WATCHER_SCHEDULE';

file_watcher_sched

We can see that it will trigger the job every 10 minutes, it is default interval for file watcher, by adjusting this schedule you can control the intensity of checks for new files. Note that you can not find information about FILE_WATCHER job runs in the DBA_SCHEDULER_JOB_LOG view, it is normal because this job is assigned to the SCHED$_LOG_ON_ERRORS_CLASS job class:

select logging_level from dba_scheduler_job_classes where job_class_name='SCHED$_LOG_ON_ERRORS_CLASS';

file_watcher_schedule_jobclass

 

 

Logging level for this class is set to “FAILED RUNS” which means that only failed executions will be logged to the scheduler job logging table, if there is no entries then job runs was successful.

Ok, lets check what file watcher job executes every 10 minutes:

select owner, program_type, program_action, number_of_arguments from dba_scheduler_programs where program_name='FILE_WATCHER_PROGRAM';

file_watcher_program

 

 

It executes stored procedure DBMS_ISCHED.FILE_WATCH_JOB. This procedure gets information from data dictionary about existing file watchers in the database, this information includes watcher name, its status enabled or not, directory path, file name alias to check, credential name used to operate on OS, and so on, this data stored in data dictionary on the time file watcher was created and maintained each time you modify it. You can get this metadata from the DBA_SCHEDULER_FILE_WATCHERS view. After getting the list of enabled file watchers it starts its iteration over them, it checks for new files by file alias and modification date, if there is files with modification date grater than last file watcher check then this file will be queued to the special queue under the SYS schema called  SCHEDULER_FILEWATCHER_Q.

select owner, queue_table, qid, enqueue_enabled, dequeue_enabled from dba_queues where name='SCHEDULER_FILEWATCHER_Q';

file_watcher_queue

 

 

If you are not good familiar with queues, queue tables and event based jobs then before reading further read my post Using events with DBMS_SCHEDULER it will help you to understand how all this stuff works together.

The queue table of the  SCHEDULER_FILEWATCHER_Q queue is SCHEDULER_FILEWATCHER_QT, it is based on the payload type SCHEDULER_FILEWATCHER_RESULT, which have the next attributes defined:

SYS> select attr_name, attr_type_name from DBA_TYPE_ATTRS where owner='SYS' and type_name='SCHEDULER_FILEWATCHER_RESULT' order by attr_no;

ATTR_NAME		       ATTR_TYPE_NAME
------------------------------ ------------------------------
DESTINATION		       VARCHAR2
DIRECTORY_PATH		       VARCHAR2
ACTUAL_FILE_NAME	       VARCHAR2
FILE_SIZE		       NUMBER
FILE_TIMESTAMP		       TIMESTAMP WITH TZ
TS_MS_FROM_EPOCH	       NUMBER
MATCHING_REQUESTS	       SCHEDULER_FILEWATCHER_REQ_LIST

7 rows selected.

From above listing you can see which information about the files matching to the request, defined by MATCHING_REQUESTS attribute, DBMS_ISCHED.FILE_WATCH_JOB procedure passes to the queue.

As soon as new entry comes to the SCHEDULER_FILEWATCHER_Q queue, event based job associated with this queue, will be triggered and all the information about the file will be passed to the program associated with this user job in the payload type SCHEDULER_FILEWATCHER_RESULT, which can be then used for whatever actions you planned.

Also one note to be considered after you will read next listing from oracle documentation which creates file watcher flow, QUEUE_SPEC attribute of the DBMS_SCHEDULER.CREATE_JOB procedure can accept two options one of them is as regular source queue name as we used in the Using events with DBMS_SCHEDULER  and a file watcher name which is the case in the next listing!

begin
  dbms_scheduler.create_credential(
     credential_name => 'watch_credential',
     username        => 'pos1',
     password        => 'jk4545st');
end;
/
 
create table eod_reports (when timestamp, file_name varchar2(100), 
   file_size number, processed char(1));
 
create or replace procedure q_eod_report 
  (payload IN sys.scheduler_filewatcher_result) as 
begin
  insert into eod_reports values 
     (payload.file_timestamp,
      payload.directory_path || '/' || payload.actual_file_name,
      payload.file_size,
      'N');
end;
/
 
begin
  dbms_scheduler.create_program(
    program_name        => 'eod_prog',
    program_type        => 'stored_procedure',
    program_action      => 'q_eod_report',
    number_of_arguments => 1,
    enabled             => false);
  dbms_scheduler.define_metadata_argument(
    program_name        => 'eod_prog',
    metadata_attribute  => 'event_message',
    argument_position   => 1);
  dbms_scheduler.enable('eod_prog');
end;
/
 
begin
  dbms_scheduler.create_file_watcher(
    file_watcher_name => 'eod_reports_watcher',
    directory_path    => '?/eod_reports',
    file_name         => 'eod*.txt',
    credential_name   => 'watch_credential',
    destination       => null,
    enabled           => false);
end;
/
 
begin
  dbms_scheduler.create_job(
    job_name        => 'eod_job',
    program_name    => 'eod_prog',
    event_condition => 'tab.user_data.file_size > 10',
    queue_spec      => 'eod_reports_watcher',
    auto_drop       => false,
    enabled         => false);
  dbms_scheduler.set_attribute('eod_job','parallel_instances',true);
end;
/

exec dbms_scheduler.enable('eod_reports_watcher,eod_job');

Troubleshooting

Now after it is absolutely clear how Oracle file watcher designed and works we can start our troubleshooting. First lets define our trouble! In my case actual issue was:

Issue:

New file arrives to the monitored directory or timestamp of the existing file refreshed by Linux touch command but file watcher do not capture it. The regular trace file of the job slave for FILE_WATCHER job generates strange error message:

FILE_TRANSFER error is:

Some times with some unreadable trash like this, and this symbols is different for each execution:

FILE_TRANSFER error is: `ýÕÿ

 

The file watcher created is same as in the above example from Oracle Documentation.

First what we need to check is well known general prerequisites to use Oracle file watcher:

  • Linux-PAM must be installed on the Linux host
  • JVM is installed and valid
    • select comp_name, status from dba_registry where comp_name like '%JAVA%'
  • $ORACLE_HOME/bin/extjob file must be owned by root:oraclegroup and must be setuid i.e. 4750 (-rwsr-x—)
  • $ORACLE_HOME/bin/extjobo should have normal 755 (rwxr-xr-x) permissions and be owned by oracle:oraclegroup
  • $ORACLE_HOME/bin/jssu should exist with root setuid permissions i.e. owned by root:oraclegroup with 4750 (-rwsr-x—)

 

In my case all the listed are true. Operating system user is etl_user and directory path for file watcher is /u01/files/in. User etl_user is in special OS group etl_group, which is the requirement from security group. To test file watcher I will execute DBMS_ISCHED.FILE_WATCH_JOB procedure directly from client session to not wait until the next run of the FILE_WATCHER job.

Copy or create test file in the /u01/files/in name it eod_test1.txt, it fully corresponds to the defined file name alias eod*.txt

Now login to sqlplus as sysdba from the terminal window and execute:

SYS> oradebug setmypid
SYS> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_19205.trc
SYS> exec dbms_isched.file_watch_job;
PL/SQL procedure successfully completed.

Now lets check our queue table SCHEDULER_FILEWATCHER_QT if there is any entries. It is empty! It means that for some reason dbms_isched.file_watch_job was unable to queue the file. In the trace file testdb_ora_19205.trc there is only one message:

FILE_TRANSFER error is:

First thing to check is the scheduler Credential, is it correct and is it possible to execute some external OS command using this Credential. In our case credential name is watch_credential. Next Job can be used to check credential and external execution, it will be executed immediately after creation and automatically dropped after execution:

BEGIN
DBMS_SCHEDULER.CREATE_JOB(
   job_name            => 'TESTCRED',
   JOB_TYPE            => 'EXECUTABLE',
   JOB_ACTION          => '/bin/ls' ,
   CREDENTIAL_NAME     => 'WATCH_CREDENTIAL',
   ENABLED             => TRUE,
   AUTO_DROP           => TRUE);
END;
/
select status, error#, additional_info from dba_scheduler_job_run_details where job_name = 'TESTCRED';
STATUS     ERROR#   ADDITIONAL_INFO
---------  ------   ---------------------------------------------------------
SUCCEEDED       0   EXTERNAL_LOG_ID="job_3338509_2301873",USERNAME="etl_user"

Remember that you can not pass arguments to the commands in JOB_ACTION, for example if you will try to execute “/bash/ls -l” then job will fail with “Command not found” error. Then we need to analyze the output from the DBA_SCHEDULER_JOB_RUN_DETAILS table, if Job status is FAILED then check the reason and take required actions. If credentials defined correctly then status will be SUCCEEDED, as listed above.

Next step that I tried and suggest to try for file watcher issues is to execute file watcher as user oracle, user who installed oracle software. It will help to identify OS user/group related issues. Create the scheduler credential with username oracle and repeat watcher test as above. After this test I found that watcher was successful and it queued the new entry for my file and Job automatically triggered! It is good news, now we know that it works with oracle user. Probably it will work if I will add etl_user to the dba group, but it is not allowed by the security! I need to identify which privileges and to which files or folders required to successfully execute file watcher.

etl_user:etl_group have all rights on the /u01/file/in folder.

Lets trace file watcher, Oracle support recommends to use event 27401 for this purpose on level 262144

-bash-$ cd /u01/file/in
-bash-$ touch eod_test1.txt

SYS> oradebug setmypid
SYS> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_17401.trc
SYS> alter system set events '27401 trace name context forever, level 262144';
SYS> exec dbms_isched.file_watch_job;
PL/SQL procedure successfully completed.
SYS> alter system set events '27401 trace name context off';

Now in the testdb_ora_17401.trc file we can see much more information, we will see how watcher iterates over all directories that it finds in data dictionary and how it tries to match file aliases and so on. In my case useful parts was:

file_watcher:: 2014-08-19 17:40:49.196: Processing results
file_watcher:: 2014-08-19 17:40:49.196: Iteration 1
file_watcher:: 2014-08-19 17:40:49.196:   Dir Path: /u01/file/in
file_watcher:: 2014-08-19 17:40:49.196:   File Name: eod_test1.txt
file_watcher:: 2014-08-19 17:40:49.196:   File Size: 0
file_watcher:: 2014-08-19 17:40:49.196:   File Tstamp: 19-AUG-14 03.39.31.000000 PM +00:00
file_watcher:: 2014-08-19 17:40:49.196:   Matching Requests: 
file_watcher:: 2014-08-19 17:40:49.196:     Request: EOD_REPORTS_WATCHER
FILE_TRANSFER error is:  ¾ oÿ
file_watcher:: 2014-08-19 17:40:49.228:     Privilege check failed
file_watcher:: 2014-08-19 17:40:49.228:     No valid matching requests, enqueue not done

First of all we can see that watcher was successfully identified my file eod_test1.txt, but then before queuing it failed with “Privilege check failed” message! Because it was able to identify and read my file it means that privileges of the monitored folder and files are correct.

Lets dive deeper, I want to know what dbms_isched.file_watch_job procedure doing and which privilege checks are failed, in your case it can be some other error message, but only way to identify where is the problem is to find what actually process trying to do.

I will use strace utility as usual to track OS calls.

-bash-$ cd /u01/file/in
-bash-$ touch eod_test1.txt

SYS> oradebug setmypid
SYS> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_19494.trc

From the trace file name we can see that OS process ID of our client process is 19494, from the new terminal as user root I will execute strace utility. Because dbms_isched.file_watch_job will use $ORACLE_HOME/bin/jssu utility to execute OS specific commands we need to trace its execution also, by default strace will attach only to the specified process, to force it to trace all child processes we can use -ff option, this option is used with -o option (output file). With this options specified strace will automatically attach to all child processes and generate file for each of them in form of <outputfile>.<PID>

Now as root:

[root]# cd /tmp
[root]# strace -p 19494 -ff -o fwacher.trc

From sqlplus session now time to execute dbms_isched.file_watch_job procedure. After it is executed exit from sqlplus and check trace files generated by strace.

In my case three files was generated, number of files is the our client process plus the number of processes forked by it, this number can be different, it will depend on which stage your issue is, i.e. how far your execution chain can go before fail. Files generated was:

fwacher.trc.19494
fwacher.trc.19946
fwacher.trc.19951

Lets start looking from fwacher.trc.19494 because it is our main client process. Try to analyze the system calls and understand how the process works, you can find description for each call in the man pages. For example getrusage call, to know what is it, execute “man getrusage” and you will get full description of it.

Interesting lines for me in this trace file starts from accessing $ORACLE_HOME/bin/jssu file

access("/u01/app/oracle/product/11.2.0.3/db01/bin/jssu", X_OK) = 0
pipe([11, 13])                          = 0
pipe([14, 15])                          = 0
pipe([16, 17])                          = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6edf689ec0) = 19946
close(15) = 0
close(11) = 0
close(17) = 0
write(13, "etl_user", 13) = 13
write(13, "\n", 1) = 1
write(13, "mytmppasswd", 11) = 11
write(13, "\n", 1) = 1
write(13, "1\n", 2) = 2
write(13, "/u01/app/oracle/product/11.2.0.3"..., 54) = 54
write(13, "\n", 1)                      = 1
write(13, "5\n", 2)                                  = 2
write(13, "-cp\n", 4)                       = 4
write(13, "/u01/app/oracle/product/11.2.0.3"..., 61)                       = 61
write(13, "\n", 1)                       = 1
write(13, "oracle.scheduler.agent.Execution"..., 38)         = 38
write(13, "check_file\n", 11) = 11
write(13, "/u01/file/in"..., 40) = 40
write(13, "\n", 1)      = 1
close(13)      = 0
read(16, "", 4096) = 0
close(16)                               = 0
read(14, "!@#--!@#10#@!--#@!", 200)     = 18
read(14, "", 182)                       = 0
--- SIGCHLD (Child exited) @ 0 (0) ---

Very interesting peace of trace, what we can see from this calls? Our client process creates child process  to execute jssu utility, OS call clone which returns the PID 19946 of the new child, then passes to it some parameters. First of all it is username of the credential we used which is etl_user and next its password!!! Yes, it is the password of our credential, and it is passed in plain text format, now everybody knows that my password is “mytmppasswd” 🙂 Yes, I think it is security issue. But please do not use it for hacking! 🙂

Next we see how it passes to jssu ORALCE_HOME, “-cp” option, Java class name oracle.scheduler.agent.Execution, “check_file” option and our directory /u01/file/in. As we can understand it asks to use specified java class to check files in /u01/file/in directory. Then this child with process ID 19946 exits with strange message “”!@#–!@#10#@!–#@!”. Something happened inside the child process 19946. Lets analyze fwacher.trc.19946 trace file.

I will list only some open calls from process 19946 trace file, we will know which libraries and files it tries to access:

open("/etc/ld.so.cache", O_RDONLY) = 5
open("/lib64/libcrypt.so.1", O_RDONLY)  = 5
open("/lib64/libdl.so.2", O_RDONLY)     = 5
open("/lib64/libm.so.6", O_RDONLY)      = 5
open("/lib64/libpthread.so.0", O_RDONLY) = 5
open("/lib64/libnsl.so.1", O_RDONLY)    = 5
open("/lib64/libc.so.6", O_RDONLY)      = 5
open("/etc/ld.so.cache", O_RDONLY)      = 5
open("/lib64/tls/x86_64/libpam.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/tls/libpam.so", O_RDONLY)  = -1 ENOENT (No such file or directory)
open("/lib64/x86_64/libpam.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/libpam.so", O_RDONLY)      = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/x86_64/libpam.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libpam.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/x86_64/libpam.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libpam.so", O_RDONLY)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 5
open("/lib64/libpam.so.0", O_RDONLY)    = 5
open("/lib64/libaudit.so.0", O_RDONLY)  = 5
open("/etc/pam.d/su", O_RDONLY)         = 5
open("/lib64/security/pam_rootok.so", O_RDONLY) = 6
open("/etc/ld.so.cache", O_RDONLY)      = 6
open("/lib64/libselinux.so.1", O_RDONLY) = 6
open("/lib64/libsepol.so.1", O_RDONLY)  = 6
open("/etc/selinux/config", O_RDONLY)   = 6
open("/proc/mounts", O_RDONLY)          = 6
open("/etc/pam.d/system-auth", O_RDONLY) = 6
open("/lib64/security/pam_env.so", O_RDONLY) = 8
open("/lib64/security/pam_tally2.so", O_RDONLY) = 8
open("/lib64/security/pam_lsass.so", O_RDONLY) = 8
open("/opt/pbis/lib64/liblsaauth.so.0", O_RDONLY) = 8
open("/opt/pbis/lib64/liblsaclient.so.0", O_RDONLY) = 8
open("/opt/pbis/lib64/liblwmsg_nothr.so.0", O_RDONLY) = 8
open("/opt/pbis/lib64/liblwadvapi_nothr.so.0", O_RDONLY) = 8
open("/opt/pbis/lib64/liblsacommon.so.0", O_RDONLY) = 8
open("/etc/ld.so.cache", O_RDONLY)      = 8
open("/lib64/librt.so.1", O_RDONLY)     = 8
open("/opt/pbis/lib64/liblwbase_nothr.so.0", O_RDONLY) = 8
open("/opt/pbis/lib64/libuuid.so.0", O_RDONLY) = 8
open("/lib64/security/pam_unix.so", O_RDONLY) = 8
open("/etc/ld.so.cache", O_RDONLY)      = 8
open("/usr/lib64/libcrack.so.2", O_RDONLY) = 8
open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 8
open("/lib64/security/pam_deny.so", O_RDONLY) = 8
open("/etc/pam.d/system-auth", O_RDONLY) = 6
open("/lib64/security/pam_permit.so", O_RDONLY) = 8
open("/etc/pam.d/system-auth", O_RDONLY) = 6
open("/lib64/security/pam_cracklib.so", O_RDONLY) = 8
open("/etc/pam.d/system-auth", O_RDONLY) = 6
open("/lib64/security/pam_keyinit.so", O_RDONLY) = 8
open("/lib64/security/pam_limits.so", O_RDONLY) = 8
open("/lib64/security/pam_xauth.so", O_RDONLY) = 6
open("/etc/pam.d/other", O_RDONLY)      = 5
open("/etc/nsswitch.conf", O_RDONLY)    = 5
open("/etc/ld.so.cache", O_RDONLY)      = 5
open("/lib64/libnss_files.so.2", O_RDONLY) = 5
open("/etc/passwd", O_RDONLY)           = 5
open("/etc/shadow", O_RDONLY)           = 5
open("/var/log/tallylog", O_RDWR)       = 5
open("/etc/pbis/user-ignore", O_RDONLY) = 5
open("/etc/pbis/group-ignore", O_RDONLY) = 5
..
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2ac64aaab190) = 19951

We can see that it tries to find libpam.so in all possible locations and finally it finds it under /lib64/libpam.so.0, if file found and opened successfully then open call returns file descriptor if not then error message. Then it opens different PAM libraries and configuration files as well as /etc/passwd and /etc/shadow to verify user. Finally after all verifications passed it forks new child with PID 19951 to do next actions. Now we will open fwacher.trc.19951 trace file.

It again makes some small checks and finally:

setgid(338)                             = 0
setuid(338)                             = 0
execve("/u01/app/oracle/product/11.2.0.3/db01/jdk/jre/bin/java", ["/u01/app/oracle/product/11.2.0.3", "-cp", "/u01/app/oracle/product/11.2.0.3", "oracle.scheduler.agent.Execution", "check_file", "/dbasel/knud_stage/COBA/IN/testf"], [/* 0 vars *
/]) = -1 EACCES (Permission denied)
write(2, "!@#--!@#10#@!--#@!", 18)      = 18

Here it is! 338 is the id of the etl_user and etl_group. It sets userid and groupid successfully, return code is 0. Then it calls execve system call which executes a program /u01/app/oracle/product/11.2.0.3/db01/jdk/jre/bin/java with already known to us parameters! But execution fails with return code -1 EACCES (Permission denied). It is our problem.

Login with etl_user and try to execute /u01/app/oracle/product/11.2.0.3/db01/jdk/jre/bin/java returned same error message! Problem was that etl_user didn’t have execute permission on this java executable, problem was solved by granting required group privileges!

But there is also other possibilities of EACCES return code, you can find them from man page of the execve call:

EACCES Search permission is denied on a component of the path prefix of filename or the name of a script interpreter.  (See also path_resolution(2).)
EACCES The file or a script interpreter is not a regular file.
EACCES Execute permission is denied for the file or a script or ELF interpreter.
EACCES The file system is mounted noexec.

As the conclusion we can say that for successful Oracle FILE_WATCHER working the user used in scheduler credential must have execute privilege on $ORACLE_HOME/jdk/jre/bin/java executable!

Sure your issue may be in different place, but using above mentioned troubleshooting you can find it in almost 95% of cases!

 

.


(c) Aychin Gasimov, 08/2014, Munich, Germany