Reverse Exadata Elastic Configuration using elastic config marker

As you may already know ,  The elastic configuration process will allow initial IP addresses to be assigned to database servers and cells, regardless of the exact customer configuration ordered. The customer specific configuration can then be applied to the nodes.

Sometime you can make mistakes and end up assigning wrong IP’s or hostnames to Exadata nodes. You can using Exadata elastic config marker to revert applied elastic configuration.

Problem : Applied wrong IP’s to Exadata Nodes 

[root@exdbadm01 linux-x64]# ibhosts
Ca : 0x0010e00001d4f7a8 ports 2 "exadbadm02 S 192.168.10.3,192.168.10.4 HCA-1"
Ca : 0x0010e00001d691f0 ports 2 "exaceladm03 C 192.168.10.9,192.168.10.10 HCA-1"
Ca : 0x0010e00001d68e30 ports 2 "exaceladm01 C 192.168.10.5,192.168.10.6 HCA-1"
Ca : 0x0010e00001d68cd0 ports 2 "exaceladm02 C 192.168.10.7,192.168.10.8 HCA-1"
Ca : 0x0010e00001d60e00 ports 2 "exadbadm01 S 192.168.10.1,192.168.10.2 HCA-1"

Solution : Create ./elasticConfig file at root on all Exadata nodes. Please note that all the IP’s will be changed to factory default.

create elastic marker on all nodes 

[root@node1 /]# cd /
[root@node1 /]# touch .elasticConfig
[root@node1 /]# reboot

Broadcast message from root@exdbadm01.itrans.int
(/dev/pts/0) at 18:38 ...

The system is going down for reboot NOW!

Login Again using factory default IP's 

[root@node8 linux-x64]# ibhosts
Ca : 0x0010e00001d4f7a8 ports 2 "node10 elasticNode 172.16.2.46,172.16.2.46 ETH0"
Ca : 0x0010e00001d691f0 ports 2 "node4 elasticNode 172.16.2.40,172.16.2.40 ETH0"
Ca : 0x0010e00001d68e30 ports 2 "node2 elasticNode 172.16.2.38,172.16.2.38 ETH0"
Ca : 0x0010e00001d68cd0 ports 2 "node1 elasticNode 172.16.2.37,172.16.2.37 ETH0"
Ca : 0x0010e00001d60e00 ports 2 "node8 elasticNode 172.16.2.44,172.16.2.44 ETH0"

 

 

 

 

 

 

Drop cell disks before converting to 1/8th rack

Hi ! , Today i like to share my experience about an issue i faced during the deployment of Exadata Eight Rack.  I faced following issue while executing Exadata deployment step 2 (Executing Update Nodes for Eighth Rack). We faced the issue because storage cells were shipped with default cell disk groups and they need to be dropped before we can continue deploying Exadata Eight Rack.

Issue : 

[root@node1linux-x64]# ./install.sh -cf Intellitrans-ex.xml -s 2
Initializing
Executing Update Nodes for Eighth Rack
Error: Storage cell [cellnode1, cellnode2, cellnode3] contains cell disks. Cannot setup 1/8th rack. Drop cell disks before converting to 1/8th rack rack.
Collecting diagnostics...
Errors occurred. Send /u01/onecommand/linux-x64/WorkDir/Diag-180626_150747.zip to Oracle to receive assistance.

ERROR:
Error running oracle.onecommand.deploy.operatingSystem.ResourceControl method setupResourceControl
Error: Errors occured...
Errors occured, exiting...

Reason : Cell disk already exist , you can validate by login into all storage nodes. 

[root@cellnoe1~]# cellcli
CellCLI: Release 18.1.4.0.0 - Production on Tue Jun 26 16:13:33 EDT 2018

Copyright (c) 2007, 2016, Oracle and/or its affiliates. All rights reserved.

CellCLI> list celldisk
FD_00_ru02 normal
FD_01_ru02 normal
FD_02_ru02 normal
FD_03_ru02 normal

Solution : drop celldisk all force;

[root@cellnode1 ~]# cellcli
CellCLI: Release 18.1.4.0.0 - Production on Tue Jun 26 16:18:11 EDT 2018

Copyright (c) 2007, 2016, Oracle and/or its affiliates. All rights reserved.

CellCLI> list celldisk
FD_00_ru02 normal
FD_01_ru02 normal
FD_02_ru02 normal
FD_03_ru02 normal

CellCLI> drop celldisk all force;

CellDisk FD_00_ru02 successfully dropped
CellDisk FD_01_ru02 successfully dropped
CellDisk FD_02_ru02 successfully dropped
CellDisk FD_03_ru02 successfully dropped

CellCLI> list celldisk

 

Reducing Exadata Active Cores on Compute nodes

Recently, I had an opportunity to deployment eight rack Exadata Machine. As you might already know that it will require reducing active CPU cores on both compute nodes and storage nodes. As per Oracle documentation , this can all be done during Exadata deployment. Make sure you have reduce active CPU cores during OEDA process using capacity on demand section. In my case , Exadata deployment (OEDA) didn’t reduce active cores and i had to manually reduce cores on both DB nodes.

Problem Description : You can clearly see below Exadata deployment process just skipped compute nodes and only reduced CPU cores on storage nodes.

[root@node1 linux-x64]# ./install.sh -cf Intellitrans-ex.xml -s 2 
Initializing 
Executing Update Nodes for Eighth Rack 

Skip Eighth rack configuration in compute node node1 

running setup on: celadm01 
running setup on: celadm03 
running setup on: celadm02 
cellnode3 total CPU cores set from 20 to 10 
cellnode2 needs total CPU cores set from 20 to 10 
cellnode31 needs total CPU cores set from 20 to 10 

Skip Eighth rack configuration in compute node node2 

Successfully completed execution of step Update Nodes for Eighth Rack [elapsed Time [Elapsed = 36051 mS [0.0 minutes] Fri Jul 13 20:31:36 EDT 2018]]
 
[root@node1 linux-x64]# dbmcli -e LIST DBSERVER attributes coreCount 
24/24 

Solution : alter dbserver pendingCoreCount=10 force ( repeat on all db nodes )

[root@node1 linux-x64]# dbmcli -e alter dbserver pendingCoreCount=10 force

Note :- reboot Exadata nodes 

[root@node1 linux-x64]# dbmcli -e LIST DBSERVER attributes coreCount

         10/24

 

The vm.min_free_kbytes configuration is not set as recommended

I saw following issue issue during Exachk review of one of my Exadata deployment. After working with Oracle support and deployment team , it was declare a BUG and will be fixed in future exachk release. But i will still recommend opening an SR with Oracle supprot if we see this issue being report in your exachk report.

Problem Description 
--------------------------------------------------- 
CRITICAL => The vm.min_free_kbytes configuration is not set as recommended 

DATA FROM EXDBADM01 FOR VERIFY THE VM.MIN_FREE_KBYTES CONFIGURATION 

FAILURE: vm.min_free_kbytes is not set as recommended: 
socket count: 1 
minimum size: -1 
in sysctl.conf: 524288 
in active memory: 524288 

Status on nod2: 
CRITICAL => The vm.min_free_kbytes configuration is not set as recommended 

DATA FROM node2 FOR VERIFY THE VM.MIN_FREE_KBYTES CONFIGURATION 

FAILURE: vm.min_free_kbytes is not set as recommended: 
socket count: 1 
minimum size: -1 
in sysctl.conf: 524288 
in active memory: 524288 

Error Codes 
--------------------------------------------------- 
FAILURE: vm.min_free_kbytes is not set as recommended:

 

Clone Oracle Database Home on Exadata Machine

I was asked to clone database home during one of my Exadata deployment project. We wanted to have additional Database home for patching and isolation purposes but its a topic for different blog.  you can use following guidelines to clone database blogs on Exadata machine.

Note :- These steps needs to be performed on all DB nodes.

Step 1 : Create directory or new mount for database home. It’s best to have separate mount for different database homes on Exadata Machine.

mkdir -p /u01/app/oracle/product/11.2.0.4/dbhome_2

Step 2 : Copy all files using root user to new database home (dbhome_2)

[root@exdbadm01 dbhome_1]# cp * -rp /u01/app/oracle/product/11.2.0.4/dbhome_2/

Step 3 : Links RDS required only for Exadata Machine

Set ORACLE_HOME environment variable
cd $ORACLE_HOME/rdbms/lib
make -f $ORACLE_HOME/rdbms/lib/ins_rdbms.mk ipc_rds ioracle

Step 4 : Clone and relink db home using Oracle OUI install in silent mode.

./runInstaller -silent -clone ORACLE_BASE=”/u01/app/oracle” ORACLE_HOME=”/u01/app/oracle/product/11.2.0.4/dbhome_2″ ORACLE_HOME_NAME=”OraDb11g_home2″

export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/dbhome_2

cd $ORACLE_HOME/oui/bin

[oracle@node1 bin]$ ./runInstaller -silent -clone ORACLE_BASE="/u01/app/oracle" ORACLE_HOME="/u01/app/oracle/product/11.2.0.4/dbhome_2" ORACLE_HOME_NAME="OraDb11g_home2"
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB. Actual 24575 MB Passed
Preparing to launch Oracle Universal Installer from /tmp/OraInstall2018-06-27_05-04-52PM. Please wait ...[oracle@node1 bin]$ Oracle Universal Installer, Version 11.2.0.4.0 Production
Copyright (C) 1999, 2013, Oracle. All rights reserved.

You can find the log of this install session at:
/u01/app/oraInventory/logs/cloneActions2018-06-27_05-04-52PM.log
.................................................................................................... 100% Done.

Installation in progress (Wednesday, June 27, 2018 5:04:57 PM EDT)
............................................................................... 79% Done.
Install successful

Linking in progress (Wednesday, June 27, 2018 5:05:00 PM EDT)
Link successful

Setup in progress (Wednesday, June 27, 2018 5:05:17 PM EDT)
Setup successful

End of install phases.(Wednesday, June 27, 2018 5:05:38 PM EDT)
WARNING:
The following configuration scripts need to be executed as the "root" user.
/u01/app/oracle/product/11.2.0.4/dbhome_2/root.sh
To execute the configuration scripts:
1. Open a terminal window
2. Log in as "root"
3. Run the scripts

The cloning of OraDb11g_home2 was successful.
Please check '/u01/app/oraInventory/logs/cloneActions2018-06-27_05-04-52PM.log' for more details.

 

Deconfigure/Reconfigure Exadata node from CRS

Problem Description:

Few days back i started working on Exadata GI upgrade to 12.2 from 12. and ran into a problem upgrading node 1. We had to cancel the upgrade and start rolling back CRS to 12.1. This where we ran into following problem

First we tried to start CRS home after restoring old GRID home from backup but it seems like upgrade process deconfig 12.1 CRS home on node 1. We couldn’t start old CRS 12.1 from node 1.

[oracle@node1 bin]$ ./crsctl start crs 
CRS-4047: No Oracle Clusterware components configured. 
CRS-4000: Command Start failed, or completed with errors.

We could still see other nodes in the cluster but not node 1.

[oracle@node2 ~]$ olsnodes -n -t 
node2 2 Unpinned 
node3 3 Unpinned

Solution :

we wanted rollback node 1 to previous state so we can try the upgrade again. We solved this problem by configuring 12.1 again.

first make sure 12.1 CRS has been deconfig properly.

/u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -deconfig -force

Then Run root.sh from 12.1 CRS home.

/u01/app/12.1.0.2/grid/root.sh

Performing root user operation.

The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME= /u01/app/12.1.0.2/grid
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Relinking oracle with rac_on option
Using configuration parameter file: /u01/app/12.1.0.2/grid/crs/install/crsconfig_params
2018/07/20 23:16:17 CLSRSC-4001: Installing Oracle Trace File Analyzer (TFA) Collector.

2018/07/20 23:16:17 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.

2018/07/20 23:16:18 CLSRSC-363: User ignored prerequisites during installation

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'dm01dbadm01'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'node1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'node1'
CRS-2673: Attempting to stop 'ora.evmd' on 'node1'
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'node1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'node1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'node1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'node1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'node1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'node1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'node1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'node1' succeeded
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'nod01' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'node1'
CRS-2677: Stop of 'ora.cssd' on 'node1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'dm01dbadm01'
CRS-2673: Attempting to stop 'ora.diskmon' on 'node1'
CRS-2677: Stop of 'ora.gipcd' on 'node1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'node1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'node1'
CRS-2672: Attempting to start 'ora.evmd' on 'node1'
CRS-2676: Start of 'ora.mdnsd' on 'node1' succeeded
CRS-2676: Start of 'ora.evmd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'node1'
CRS-2676: Start of 'ora.gpnpd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'node1'
CRS-2676: Start of 'ora.gipcd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1'
CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'node1'
CRS-2672: Attempting to start 'ora.diskmon' on 'node1'
CRS-2676: Start of 'ora.diskmon' on 'node1' succeeded
CRS-2676: Start of 'ora.cssd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'node1'
CRS-2672: Attempting to start 'ora.ctssd' on 'node1'
CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'node1' succeeded
CRS-2679: Attempting to clean 'ora.asm' on 'node1'
CRS-2681: Clean of 'ora.asm' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'node1'
CRS-2676: Start of 'ora.asm' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'node1'
CRS-2676: Start of 'ora.storage' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'node1'
CRS-2676: Start of 'ora.crf' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded
CRS-6023: Starting Oracle Cluster Ready Services-managed resources
CRS-6017: Processing resource auto-start for servers: node1
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'node2'
CRS-2672: Attempting to start 'ora.net1.network' on 'node1'
CRS-2676: Start of 'ora.net1.network' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.ons' on 'node1'
CRS-2673: Attempting to stop 'ora.node1.vip' on 'node3'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'node2' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'node2'
CRS-2677: Stop of 'ora.node1.vip' on 'node3' succeeded
CRS-2672: Attempting to start 'ora.node1.vip' on 'node1'
CRS-2677: Stop of 'ora.scan1.vip' on 'node2' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'node1'
CRS-2676: Start of 'ora.ons' on 'node1' succeeded
CRS-2676: Start of 'ora.node1.vip' on 'node1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'node1'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'node1' succeeded
CRS-6016: Resource auto-start has completed for server node1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.
2018/07/20 23:18:19 CLSRSC-343: Successfully started Oracle Clusterware stack

clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 12c Release 1.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.

 

How to clear Exadata Storage Alerts

There are times you need to clear Exadata storage alerts. Its very important that you investigate and resolve the issue before clearing any storage  alers.  Additionally, you want to make a note of storage alert before you clear that alert. You can follow below steps to clear storage alert on one or all storage cells

Step 1 : Login to cellcli utility 

[root@cell01 ~]# cellcli
CellCLI: Release 18.1.4.0.0 - Production on Wed Jun 27 19:32:28 EDT 2018

Copyright (c) 2007, 2016, Oracle and/or its affiliates. All rights reserved.


Step 2 : Validate cell configuration 

CellCLI> ALTER CELL VALIDATE CONFIGURATION ;
Cell exceladm01 successfully altered

Step 3 : List Exadata Storage Alerts 

CellCLI> list alerthistory
1 2018-06-13T11:09:48-04:00 critical "ORA-00700: soft internal error, arguments: [main_21], [11], [Not enough open file descriptors], [], [], [], [], [], [], [], [], []"
2 2018-06-13T11:35:06-04:00 critical "RS-700 [No IP found in Exadata config file] [Check cellinit.ora] [] [] [] [] [] [] [] [] [] []"
3_1 2018-06-25T13:26:17-04:00 critical "Configuration check discovered the following problems: Verify network configuration: 

3_2 2018-06-26T13:25:17-04:00 clear "The configuration check was successful."

Step 4 : Drop all Storage alerts 

CellCLI> drop alerthistory all
Alert 1 successfully dropped
Alert 2 successfully dropped

Step 5 : List storage alerts to validate they are gone

CellCLI> list alerthistory

CellCLI> exit
quitting

Step 6 : Repeat above steps on all storage cells 

Enabling SSH User Equivalency on Exadata Machine

Passwordless SSH configuration is a mandatory installation requirement. SSH is used during installation to configure cluster member nodes, and SSH is used after installation by configuration assistants, Oracle Enterprise Manager, OPatch, and other features.

In the examples that follow, i used the Root user but same can be done for Oracle or Grid user

Step 1 : Create all_group file

[root@node01 oracle.SupportTools]# pwd
/opt/oracle.SupportTools

[root@node01 oracle.SupportTools]# cat all_group
node01
node02
cell01
cell02
cell03

Step 2 : Generate ssh keys

[root@node01 oracle.SupportTools]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
e1:51:4b:ba:7c:c3:48:e8:e9:5f:2b:f4:3c:11:ea:65 
root@node1
The key's randomart image is:
+--[ RSA 2048]----+
| o |
| . + . |
| . = . |
| . = *. |
| o S.+. |
| . o.E. |
| .o =.. |
| .o.+. |
| .... |
+-----------------+

Step 3 : Copy ssh keys to all nodes 

[root@node01 oracle.SupportTools]# dcli -g ./all_group -l root -k -s '-o StrictHostKeyChecking=no'
root@node01's password:
root@node02's password:
root@cell01's password:
root@cell02's password:
root@cell03's password:
node01: ssh key added
node02: ssh key added
cell01: ssh key added
cell02: ssh key added
cell03: ssh key added

 

 Step 4 : Validate passwordless is working 

[root@node1 oracle.SupportTools]# dcli -g all_group -l root hostname
node01: XXXXXXX
node02: XXXXXXX
cell01: XXXXXXX
cell02: XXXXXXX
cell03: XXXXXXX

 

 

CLSRSC-180: An error occurred while executing the command ‘/bin/rpm -qf /sbin/init’

Hello,

I recently encountered following error during Exadata GI home upgrade from 12.1.0.1 to 12.2.0.1, We encountered this error during the execution of rootupgrade.sh script on node 1 itself.

2018/06/10 02:59:18 CLSRSC-180: An error occurred while executing the command '/bin/rpm -qf /sbin/init' 
Died at /u01/app/12.2.0.1/grid/crs/install/s_crsutils.pm line 2372. 
The command '/u01/app/12.2.0.1/grid/perl/bin/perl -I/u01/app/12.2.0.1/grid/perl/lib -I/u01/app/12.2.0.1/grid/crs/install /u01/app/12.2.0.1/grid/crs/install/rootcrs.pl -upgrade' execution failed 

I thought of furthur investigating this issue by running target command manually and i got following error. These errors were also logged in installation logfile.

[root@dm01dbadm01 ~]# /bin/rpm -qf /sbin/init 
rpmdb: Thread/process 261710/140405403039488 failed: Thread died in Berkeley DB library 
error: db3 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal error, run database recovery 
error: cannot open Packages index using db3 - (-30974) 
error: cannot open Packages database in /var/lib/rpm 
rpmdb: Thread/process 261710/140405403039488 failed: Thread died in Berkeley DB library 
error: db3 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal error, run database recovery 
error: cannot open Packages database in /var/lib/rpm 
rpmdb: Thread/process 261710/140405403039488 failed: Thread died in Berkeley DB library 
error: db3 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal error, run database recovery 
error: cannot open Packages database in /var/lib/rpm 
file /sbin/init is not owned by any package 
You have new mail in /var/spool/mail/root 

 

Issue was related corruption of OS level database RPM. We can validate this issue by running following command.

# /bin/rpm -qa | more

 

We had to fix RPM corruption issue by using following and so we can continue our Exadata upgrade.

As root OS user run the following: 
# rm -f /var/lib/rpm/__* 
# /bin/rpm --rebuilddb 
# echo $?

 

After rebuilding corrupted RPMs , using following command to validate them.

# /bin/rpm -qa | more

 

 

 

 

 

ERROR : 192.168.1.1 is responding to ping request

I recently ran into a following error while running Exadata checkip script during Exadata deployment process.

Processing section FACTORY
ERROR : 192.168.1.1 is responding to ping request

I checked and realized above IP is being used by another device on the network. Good news ! As per Oracle Exadata Manual this is a factory default IP used by older Exadata Machines and we can ignore this error.

As Per Oracle Exadata Manual (2.5 Default IP addresses) ,  In earlier releases, Oracle Exadata Database Machine had default IP addresses set at the factory, and the range of IP addresses was 192.168.1.1 to 192.168.1.203.