Aychin's Oracle RDBMS Blog

Only for Advanced level Professionals

Monthly Archives: July 2015

If you use Exadata

In this post I want to share some knowledge about Exadata gathered from my practice and Oracles official sources. It will be not about physical side of the Exadata machine which is well described in many sources. I will write about some aspects which I find interesting and worth to know if your database runs on Exadata machine.

What Exadata can do.

Main benefits that Exadata storage servers can provide:

  • Predicate filtering. (Predicate pushing)
  • Column filtering.
  • Join processing (offload Bloom Filter checking).
  • Offload encrypted data scan.
  • High level data compression.
  • Storage Indexes.
  • Flash cache.

So called Exadata storage cells are Oracle Linux hosts with directly attached storage and Flash cards. Number of hard disks and Flash cards per storage server depends on Exadata configuration. These storage servers communicate with servers which hosts Databases (so called Compute nodes) over InfiniBand networking hardware. What makes storage servers unique is the software installed on them, main component of this software is CELLSRV process. CELLSRV is multi-threaded  process. It forks also special offload processes, so called offload groups. Usually it forks two offload group processes one to serve version 11 databases and one to serve version 12 databases. CELLSRV process itself has more than 100 threads, each offload process also has threads but much less.

We can check it by querying dynamic view V$CELL_OFL_THREAD_HISTORY (from, this view accessible from any database instance running on compute nodes including ASM instance. Contents will be same independent from the database itself, you can use any instance to query it. This view contains history of CELLSRV and CELLOFL (offload servers) threads last activity.

col cell_name format a30
col group_name format a20
select distinct cell_name, group_name, process_id from  v$cell_ofl_thread_history order by 1,2;
CELL_NAME                      GROUP_NAME           PROCESS_ID
------------------------------ -------------------- ----------;      CELLSRV                   21832;      SYS_112331_141117         23879;      SYS_121211_150316         23876;      CELLSRV                   21672;      SYS_112331_141117         23708;      SYS_121211_150316         23705;     CELLSRV                   21681;     SYS_112331_141117         24136;     SYS_121211_150316         24129

9 rows selected.

Above output comes from 3 cell configuration. Cell name column shows names of the cell in form of InfiniBand interfaces of it, each cell has 2 interfaces ib0 and ib1, starting from X4 machines they work in Active-Active mode, in older versions it was Active-Passive. Group name column shows the CELLSRV process and number of offload servers started by it. Names of the offload server groups is SYS_<Software version>_<release date of the software>. On above example offload server group SYS_112331_141117 runs libraries from Exadata software version released on 17.11.2014. Note that cell itself have software installed. Actually CELLSRV starts two server groups, one of which is running on current cell software version and another one on older version. SYS_112331_141117 uses libcell11.so library, this library used by 11g databases. Offload server group SYS_121211_150316 uses libcell12.so which is also included in $ORACLE_HOME/lib of the 12c databases. We can guess that offload group for older version is used to handle offload requests from pre 12c databases. Third column prcess_id is the OS process ids on the corresponding cells.

We can use ps command on third cell to see all processes, it will be enough to grep only CELLSRV process ID, because offload servers is child of it and will be listed as well:

# ps -aef | grep 21681
root     21681 21679 15 May20 ?        7-12:23:03 /opt/oracle/cell/cellsrv/bin/cellsrv 100 5000 9 5042
root     24129     1  0 May20 ?        04:21:12 /opt/oracle/cell/cellofl- -startup 1 0 1 5042 21681 SYS_121211_150316 cell
root     24136     1  0 May20 ?        04:38:03 /opt/oracle/cell/cellofl- -startup 2 0 1 5042 21681 SYS_112331_141117 cell

To see all threads of each process, you can add -L key to ps command:

# ps -aefL | grep 21681

The output will be big, because only CELLSRV has more than 100 threads. Each offload server also has by 20 threads.

Each server group including CELLSRV has its own SGA, shared memory. This shared memory consists from different size block groups, starting from 512 byte blocks and more, 2048, 4096, 8192, 16384 and so on. Information about how many threads allocated, how much memory allocated for each server group, how many memory blocks is in use and high watermarks can be obtained from  v$cell_config view. This view contains information about CELL itself and for next components CELLDISKS, GRIDDISKS, LUNS, PHYSICALDISKS and IORM (IO resource manager) per cell. All information is in XML format, you can save it in XML file and open with your favorite XML viewer. Cell hash values which you can find also from v$cell view is used in 1st parameter of cell related wait events, it can help you to bind wait event with the particular cell.

About LIBCELL library, it is main library used by the instances to exchange I/O requests with the cells. It is included with the database in $ORACLE_HOME/lib folder. When database instance or ASM instance requires to issue some I/O request on Exadata machine it uses libcell library to convert regular I/O request to iDB protocol format.  iDB is Intelligent Database transport protocol developed by oracle, it is built on Reliable Datagram Sockets (RDS) protocol that runs on InfiniBand networking hardware. iDB includes in each request sent to the cells database identifier from which request sent and resource management group, it helps CELLSRV to enforce IO resource management rules on Database and DB Resource management groups level. All I/O requests coming from the instances have special I/O reason like “RedoLog Write”, “BufferCache Read”, “ASM Cache IO”, “ControlFile Write” or “Smart scan” and so on. Information about I/O reason and related to this I/O information included into iDB protocol request to the cells. You can see active requests on particular cell by executing celcli command LIST ACTIVEREQUEST, to see related information add DETAIL option.

For example IO Reason “RedoLog Write” will have next properties:

name:                   2923
asmDiskGroupNumber:     3
asmFileIncarnation:     861189077
asmFileNumber:          1249
consumerGroupID:        0
dbID:                   3084484146
dbName:                 TDB01
dbRequestID:            2923
fileType:               Onlinelog
id:                     2923
instanceNumber:         1
ioBytes:                4096
ioBytesSofar:           0
ioGridDisk:             D022_160_CD_05_exa01cell02_adm
ioOffset:               19101635
ioReason:               "RedoLog Write"
ioType:                 CachePut
objectNumber:           -2
parentID:               2923
pdbID:                  0
requestState:           "Transitional NSC"
sessionID:              1277
sessionSerNumber:       1
sqlID:                  0000000000000
tableSpaceNumber:       0

We can see that this request includes ASM group number, file incarnation and file number to which write must be applied. The database ID and name from which this I/O request comes, note that dbID indicated here is not the same as the DBID of the database from v$database, this dbID is cell related, you can see all dbIDs on cell by executing cellcli command LIST DATABASE DETAIL. Request also includes SessionID and Serial number from database instance, obvious that it is SID of LGWR process.

All I/O reasons that do not include offloading functionality are handled by CELLSRV process itself, all offload requests passed to offload server processes. For example when instance requires to read one block of index to the Buffer Cache it will send generate I/O request which will be converted by libcell to iDB protocol format and sent to the cell, this request will be handled by CELLSRV process. If it will be table full scan eligible to be offloaded by filtering predicates then operation will be divided by number of cells and sent in parallel to all corresponding cells, offload servers will handle this requests and send back to the database process which will combine them in PGA memory and return result to the user.

In Oracle 12c there is more v$ dynamic views related to the cells, also now they are part of AWR repository, there is DBA_HIST_% table for each of this views! It means that we can generate AWR reports on 12c which will include Exadata Related section! This section includes more than enough information about exadata configuration and dynamic statistics! By my opinion it is excellent report covering all required information from the cells. By default each AWR report will include this Exadata section, it is possible to disable it but it is not possible to generate it separately. I think they will implement this possibility in the future, because it makes sense. Exadata related section do not depend to database related section at all, and there is no need in this additional overhead of generating database related report.

Requirements for Smart Scan.
  1. Only full table or full index scans can benefit from Smart Scan
  2. Only direct path reads can be offloaded by Smart Scan

Index access or single block read or multi-block read for conventional path full scans are handled by CELLSRV as “CashPut” I/O type, it will receive this requests over iDB protocol, apply IO Resource management rules if any and queue this request for regular read or write operation. But if the full table scan or index scan will be required and direct-path read will be used then offload servers will be involved in process of execution of Smart Scan. Last minutes of activity of the threads can be queried from V$CELL_OFL_THREAD_HISTORY view:

select cell_name, group_name, thread_id, job_type, sql_id, database_id, instance_id, session_id, session_serial_num, snapshot_time 
    order by snapshot_time desc;

You can use this query to have an idea how Exadata Smart Scan used by different databases, search for JOB_TYPEs “PredicateCacheGet”, “PredicateOflFilter”, “PredicateDiskRead” and “PredicateOflExec”.

Direct-path reads returns data to PGA of the server process initiated the read request, when conventional path reads data to SGA, Buffer Cache. The reason why Exadata support only direct-path reads for Smart Scan to play is because of it returns Data Set and not blocks of data. It can not write Data Set to Buffer Cache because it is designed to cache data blocks. Smart Scan will do predicate filtering and/or column filtering and will return set of rows, not a blocks.

Direct-path reads used by default in Parallel Query slaves. But also can be used by serial queries in case of full table scans. Serial full scan will use direct-path access if the size of the table in blocks is bigger than the value of “_small_table_threshold” parameter, which is by default set to 4295 blocks (in 8k blocks it will be about 33,5 Mbyte). If table size is less than this value then conventional path will be used. This default setting looks fine, it is good to cache small tables to Buffer Cache, the other sessions will be able to share it without requesting I/O to the storage at all. It is possible to force also small tables to be read by direct-path reads by setting “_serial_direct_read” parameter to TRUE, but I will not suggest to do it in most of cases.

Decision to do Smart Scan or not is made on the run-time execution. Optimizer can predict if Smart Scan can be used and will mention it in execution plan. But if you will see from execution plan that storage access will be used to filter predicates it doesn’t mean that on run-time smart-scan will be implemented. There is special parameter cell_offload_plan_display which has default value AUTO, with this setting execution plans of the queries will show possibility of Smart Scan if database runs on Exadata servers, but you can set it to value ALWAYS, then it will show offload information even on databases running on non-exedata environments.

To check if offload was done, it is good to use session statistics (V$SESSTAT), you will find several . For example, you can after execution of the statement, which by your opinion should use smart scan, query V$SESSTAT for the next two statistics:

cell physical IO interconnect bytes
cell physical IO interconnect bytes returned by smart scan

If values of these two statistics is same, it means that all required data was obtained by smart scan. If bytes returned by smart scan is less than whole interconnect bytes, then part of the data (difference between these two) was returned by smart scan other part was returned in form of blocks. It is possible that combination of this two method will be used, part of the data returned by smart scan and part by the conventional path.  The reason why oracle can’t do smart scan can be different, one of the main reasons may be the fact that some transaction updated part of the rows and still not committed, to maintain consistency oracle should read from undo segments consistent copies of the blocks, it will generate conventional block I/O to the buffer cache. Also if dynamic sampling used for the query, it will be conventional I/O. Also check your tables for chained rows, because cell software can not handle this chains and forces conventional I/O.

Predicate and column filtering.

Predicate filtering occurs when Exadata cell software filters out rows according to query predicate. Unlike conventional path access (through buffer cache) storage (Exadata cells) will return Data Set, i.e. rows already filtered by predicate condition. The amount of data transferred to database memory structures will be decreased dramatically. It means more memory resources for other needs and much less CPU time.

Many common SQL functions can be evaluated on cells, offloaded. To get the list of supported functions query V$SQLFN_METADATA view which has OFFLOADABLE column:

select * from v$sqlfn_metadata where offloadable = 'YES';

Column filtering occurs when we specify specific columns in the query. Cellsrv will send back to PGA only selected columns, it can also save huge amount of the server resources processing tables with large amount of columns. With conventional path access it is not possible, if table contains 200 columns, all this data will be uploaded to buffer cache!

When used together column and predicate filtering can dramatically improve performance of the queries and decrease consumed resources.

Join processing.

Exadata cell software can offload join processing by implementing Bloom Filter. Bloom filter is a very efficient way to join big data sets. It can help to identify if some value not in data set. Bloom filter is the data structure containing digital map of the hashed values from some data set. To join two data sets, usually values from smaller one are hashed and hash values mapped to Bloom Filter. Then, bigger data set scanned and each value from it, after applying same hash function, probed against Bloom Filter data structure made from first data set, this probation can say exactly if value is not included, this way most of not matching values can be eliminated.

How exadata cells involved in this operation? Actually server process can prepare this Bloom Filter from smaller data set and then pass it as additional predicate to the second data set, then as part of predicate filtering cell software will probe each value with this Bloom Filter on cell level, this way it will eliminate not matching values and will send back to PGA only filtered out values.

I will show one example, just to make it clear. I have two tables SH.TEST and SH.COUNTRIES, where SH.COUNTRIES is much smaller table. Lets join this tables by COUNTRY_ID field and show execution plan:

SQL> explain plan for select cust_last_name, cust_gender, country_name from sh.test a, sh.countries b 
                        where a.country_id=b.country_id and 
                              a.cust_id between 1 and 10 and 

SQL> select * from table(dbms_xplan.display());

Plan hash value: 2945719171

| Id  | Operation			 | Name      | Rows  | Bytes | Cost (%CPU)| Time     |	  TQ  |IN-OUT| PQ Distrib |
|   0 | SELECT STATEMENT		 |	     |	  64 |	2240 | 18313   (1)| 00:00:01 |	      |      |		  |
|   1 |  PX COORDINATOR 		 |	     |	     |	     |		  |	     |	      |      |		  |
|   2 |   PX SEND QC (RANDOM)		 | :TQ10001  |	  64 |	2240 | 18313   (1)| 00:00:01 |	Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN			 |	     |	  64 |	2240 | 18313   (1)| 00:00:01 |	Q1,01 | PCWP |		  |
|   4 |     JOIN FILTER CREATE		 | :BF0000   |	   1 |	  15 |	   3   (0)| 00:00:01 |	Q1,01 | PCWP |		  |
|   5 |      PX RECEIVE 		 |	     |	   1 |	  15 |	   3   (0)| 00:00:01 |	Q1,01 | PCWP |		  |
|   6 |       PX SEND BROADCAST 	 | :TQ10000  |	   1 |	  15 |	   3   (0)| 00:00:01 |	Q1,00 | S->P | BROADCAST  |
|   7 |        PX SELECTOR		 |	     |	     |	     |		  |	     |	Q1,00 | SCWC |		  |
|*  8 | 	TABLE ACCESS STORAGE FULL| COUNTRIES |	   1 |	  15 |	   3   (0)| 00:00:01 |	Q1,00 | SCWP |		  |
|   9 |     JOIN FILTER USE		 | :BF0000   |	1216 | 24320 | 18310   (1)| 00:00:01 |	Q1,01 | PCWP |		  |
|  10 |      PX BLOCK ITERATOR		 |	     |	1216 | 24320 | 18310   (1)| 00:00:01 |	Q1,01 | PCWC |		  |
|* 11 |       TABLE ACCESS STORAGE FULL  | TEST      |	1216 | 24320 | 18310   (1)| 00:00:01 |	Q1,01 | PCWP |		  |

Predicate Information (identified by operation id):

   3 - access("A"."COUNTRY_ID"="B"."COUNTRY_ID")
   8 - storage("B"."COUNTRY_NAME"='Germany')
  11 - storage("A"."CUST_ID"<=10 AND "A"."CUST_ID">=1 AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."COUNTRY_ID"))
       filter("A"."CUST_ID"<=10 AND "A"."CUST_ID">=1 AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."COUNTRY_ID"))

Under execution plan there is “Predicate Information” section, this section shows predicate filters for the execution plan steps marked by “*”.

First step 8 of the execution plan will be performed, table SH.COUNTRIES will be accessed, full table scan access method will be used. We can see from “Predicate Information” section that for step 8 there is filter by COUNTRY_NAME field and this filter will be offloaded to cells, it is indicated by “storage(“B”.”COUNTRY_NAME”=’Germany’)” line. If operation “TABLE ACCESS STORAGE FULL” indicated in execution plan without “storage” access path in “Predicate Information” section it means that there is no offload processing planned, cells will return blocks of data as it is, because there is nothing to filter for this table.

Cellsrv will return filtered data set to the server process PGA, there will be two columns in data set COUNTRY_ID and COUNTRY_NAME. Then step 4 will take place, based on the COUNTRY_ID column values server process will create Bloom Filter :BF0000. Then this Bloom filter will be passed to the cells as part of predicate filter for the table SH.USERS, we can see it from “Predicate Information” section, line “storage(“A”.”CUST_ID”<=10 AND “A”.”CUST_ID”>=1 AND SYS_OP_BLOOM_FILTER(:BF0000,”A”.”COUNTRY_ID”))“, internal function SYS_OP_BLOOM_FILTER with two arguments passed, first argument is the filter created in step 4 of the execution plan and second argument is the field which will be checked against this filter. At the end of this step data set will be returned to server process already filtered on storage level! Last step is HASH JOIN of two data sets returned from stages 8 and 11.

Also note that the fact that execution plan shows that offload will take place, it can be so that no actual offload will take place on execution phase. Optimizer only predicts that offload can be used, but actual decision to use it or not made run-time, see previous section.



(c) Aychin Gasimov, 10/2015, Munich, Germany