ERROR: unrecoverable error ORA-29701 raised in ASM I/O path when using Golden Gate replication process
February 19, 2011 by Nakinov · 5 Comments
Yesterday my DBA colleagues had a very challenging day, but I was just a “remote observer” since I couldn’t show on site. But, I have decided to blog about that situation, since we couldn’t find any similar information neither on google or metalink, so you might find is useful. Thanks to my colleague and friend Baki Sahin from Turkey who had a great hunch and found a workaround.
We have a system running on HP-UX 11.31i with latest patch set and Oracle 11g R2 (11.2.0.2) running as a single instance using ASM. Everything was installed and configured almost 2 months before the problem occurred (separate grid user for CSS, separate oracle user for the oracle binaries, additional OS groups etc …); OS was also inspected, kernel parameters were not even 10% utilized; UAT and performance tests were ongoing, as well as various administrative tasks on the database as well as testing 11gr2 new features and everything was fine.
However, when we started with the Golden Gate replication process, the alert log of the database instance had begun to show the following ORA errors: ERROR: unrecoverable error ORA-29701 raised in ASM I/O path and started to terminate Golden Gate processes. Even more, creating new session from other modules was very slow, if not impossible and after a while the instance hanged.
ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 6507 Fri Feb 18 13:51:37 2011 Thread 1 cannot allocate new log, sequence 4824 Private strand flush not complete Current log# 7 seq# 4823 mem# 0: +REDO00/redo10a.rdo Current log# 7 seq# 4823 mem# 1: +REDO01/redo10b.rdo Thread 1 advanced to log sequence 4824 (LGWR switch) Current log# 8 seq# 4824 mem# 0: +REDO00/redo08a.rdo Current log# 8 seq# 4824 mem# 1: +REDO01/redo08b.rdo Fri Feb 18 13:52:34 2011 ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 6509 Fri Feb 18 13:53:37 2011 ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 6506 Fri Feb 18 13:54:28 2011 Thread 1 cannot allocate new log, sequence 4825 Private strand flush not complete Current log# 8 seq# 4824 mem# 0: +REDO00/redo08a.rdo Current log# 8 seq# 4824 mem# 1: +REDO01/redo08b.rdo Thread 1 advanced to log sequence 4825 (LGWR switch) Current log# 1 seq# 4825 mem# 0: +REDO00/redo01a.rdo Current log# 1 seq# 4825 mem# 1: +REDO01/redo01b.rdo Fri Feb 18 13:54:40 2011 ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 26844 Fri Feb 18 13:55:43 2011 ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 6508
Okey, we started with the witch hut, by checking permissions of the underlying ASM disks, user and group permissions, because in most of the cases Oracle complaints with ORA-29701 when communication with cluster manager is not established. This was not the case, because CSS were up and running and alert log belonging to the CSS was clean; even crs_stat showed that all services were up (except the ONS which we don’t use it) and running.
[grid@somehost:+ASM]/home/grid$ crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora....TA00.dg ora....up.type ONLINE ONLINE somehost ora....DO00.dg ora....up.type ONLINE ONLINE somehost ora....DO01.dg ora....up.type ONLINE ONLINE somehost ora....PR.lsnr ora....er.type ONLINE ONLINE somehost ora....SM.lsnr ora....er.type ONLINE ONLINE somehost ora.asm ora.asm.type ONLINE ONLINE somehost ora.cssd ora.cssd.type ONLINE ONLINE somehost ora.diskmon ora....on.type ONLINE ONLINE somehost ora.evmd ora.evm.type ONLINE ONLINE somehost ora.ons ora.ons.type OFFLINE OFFLINE
Running out of ideas, we have completely shutdown the database + CSS and rebooted the machine. After that we started up everything and played around with switching the redo logs, creating very big tables, making parallel queries out of huge tables and everything was working normally. Just like hot and cold when we started Golden Gate replication we got the same error again … ERROR: unrecoverable error ORA-29701 raised in ASM I/O.
We found out two very useful blog posts from Pythian Group (Don Seiler) and Surachart Opun regarding hidden Oracle folders with files related to CSS (I am still puzzled why Oracle did that); Of course we tried all of that, but were back to point 0.
/var/tmp/.oracle hidden directory
Beware the /var/tmp/.oracle Hidden Directory!
Workaround
When Baki disabled auditing (AUDIT_TRAIL = none), the Golden Gate replication process was working perfectly and ORA-29701 disappeared from the alert log. Everybody are very busy with other tasks, so our troubleshooting window is severely limited. But, once we find more time, we will gather more info and try to open an SR to Oracle.
We had the exact same problem. Oracle has filed a bug on our behalf (Bug 12576651: ORA-29701 FROM 11.2 DATABASE FOR NON-DBA ACCOUNTS) but based on the strace we found a work around for our situation.
The issue was the user who received the error didn’t have osdba (and we couldn’t grant it based on ic controls).
The strace showed errors opening a file in $GRID_HOME/auth/css/
We changed permissions of (note not using my real users for security reasons and your environment may vary.
From :
root:osdba 750
/auth grid_owner:osdba 770
/auth/css grid_owner:osdba 770
/auth/css/ grid_owner:osdba 770
To:
root:osdba 755
/auth grid_owner:osdba 755
/auth/css grid_owner:osdba 755
/auth/css/ grid_owner:osdba 775
Thank you LisaG; I will forward your comment to my colleagues just in case they encounter the same error again.
This was perfect. Changing permissions on these directories resolved this issue on my system. Thank you!!
We had the same problem. The issue looks like it happens only for a user without osdba. Oracle opened a bug, but based on the strace the issue was writing to $GRID_HOME/auth/css/. We changed the permissions of $GRID_HOME to 755, $GRID_HOME/auth, $GRID_HOME/auth/css, $GRID_HOME/auth/css/ all to 775 and it worked for our test case.
man, ran into this today. and this helped us sort out our issue. thanks.
but for us, it was even more mundane, it was connecting to the database through sqlplus, and running an OWB block to run an ETL job. some jobs worked, some failed.
and the permissions “fix” solved the problem. for us, we’re running 11.2.0.3.4. and on the directories noted above, i noticed the following:
drwxr-x— 71 root oinstall 4096 Feb 5 13:21 . <—– ????!!!!!!!
drwxrwx— 6 grid oinstall 4096 Feb 5 10:23 auth
drwxrwx— 6 grid oinstall 4096 Feb 5 10:23 .
drwxr-x— 71 root oinstall 4096 Feb 5 13:21 ..
drwxrwx— 3 grid oinstall 4096 Feb 5 10:23 crs
drwxrwx— 3 grid oinstall 4096 Feb 5 10:23 css
drwxrwx— 3 grid oinstall 4096 Feb 5 10:23 evm
drwxrwx— 3 grid oinstall 4096 Feb 5 10:23 ohasd
and thats based on a "clean" install. so the root scripts to help complete the setup, are nakered again it seems. very frustrating. after resetting the permissions, the ETL jobs succeeded, and the error hasnt happened since (touch wood).