Comments on Playing With System Statistics : Oracle Sponge -- Now Moved To Wordpress

Collapse all comments

Expand all comments

Anonymous

There is something a little odd about your results.

Despite doing "only a serial tablescan", you have recorded times for single AND multiblock reads - more significantly you have recorded something that I've always assumed was for parallel execution before now, the MAXTHR and SLAVETHR.

I don't know how Oracle calculates SLAVETHR - there are no obvious clues in the traces - but MAXTHR may simply be the bytes per second: which makes it look like your tablescan may have been a parallel one.

System Statistics may become an interesting discussion point for complex systems - Christo Kutrovsky (from Pythian) has pointed out that he often sees cases where the hardware actually supplies multiblock reads faster than singleblock reads - which is bad news because Oracle stops doing CPU costing calculations when mreadtim < sreadtim.

Problem is, it can be perfectly valid for a SAN to be that useless - great for massive tablescans with automatic read-ahead kicking in when you don't always want it; but completely useless for single block reads and OLTP systems in general.

Jonathan Lewis

7:39 AM

David Aldridge

Nuno posted an example of MREADTIM less than SREADTIM here: http://dizwell.com/forum/index.php?topic=34.0

An OLTP system with a very small MBRC, so read ahead caching is very likely. It makes me wonder whether the MBRC actually ought to be higher, to approach whatever the read size is of the i/o subsystem. But as he says he's using RBO ...

Anyway, my guess would be that there's some recursive SQL associated with starting and stopping the system stat gathering that falls between the snapshots of the file io that the process presumably takes, and that sequential reads are involved. I'll run a 10046 trace and see, and also make of that the process is serial (I can usually detect parallel operations on my test system just from HD noise -- didn't notice anything but I'll check)

Thanks Jonathan

8:53 AM

Anonymous

The mreadtim < sreadtim is another whole new area where you may have to end up faking it to tell the truth. I saw the thread on HJR forum- and this led to your blog -and that kicked me off on thinking about the stuff I'm supposed to be writing about system stats.

I think the idea I chatted about with Christo was the need to multiply the MBRC and mreadtim time up (to keep them in synch) until the mreadtim was larger than the sreadtim. This way the CBO would use the figures, and the I/O rate would be described nearly truthfully in the mreadtim/MBRC.

Jonathan Lewis

9:54 AM

David Aldridge

I'm intrigued by the following SQL statement in the trace file:

SELECT S1.VALUE - S2.VALUE MBREADS
FROM V$SYSSTAT S1, V$SYSSTAT S2
WHERE S1.NAME = 'physical reads'
AND S2.NAME = 'physical reads direct'

This may be a knee-jerk reaction but it implies to me that direct reads are being deliberately excluded by something (and I guess that they're not directly sized by DFMBRC either?)

That's an interesting thought, to multiply up the MBRC and MREADTIM. On a system by system basis I'd be interested in exploring changes to the DFMBRC to see what effect that had on those numbers also - if read ahead caching is responsible for the low MREADTIM then an increase to the MBRC ought to be fairly "cheap", I'd think.

And it also ought to show up in a 10046 level 8 trace as alternatingly high and low elapsed times I guess, from which the read ahead cache size might usefully be estimated.

10:27 AM

Noons

"if read ahead caching is responsible for the low MREADTIM then an increase to the MBRC ought to be fairly "cheap", I'd think."

I'll try that tonight. 8 and 16 to see if any major diff from previous numbers from 10046.
Can't run a gather_system_stats with reasonable production load until probably Monday: this server is being tested at this stage and the next test cycle starts then.

MBRC*MREADTIME when mreadtime < sreadtime makes a LOT of sense.

8:40 PM

Anonymous

David,
regarding readahead.
I have seen trace files with "db file scattered read" showing patterns of consecutive lines like:

ela= 12000 p3 = 8
ela= 800 p3 = 8
ela= 800 p3 = 8
ela= 800 p3 = 8
ela= 12000 p3 = 8
ela= 800 p3 = 8
ela= 800 p3 = 8
ela= 800 p3 = 8
ela= 12000 p3 = 8
ela= 800 p3 = 8
ela= 800 p3 = 8
ela= 800 p3 = 8

It's usually been a special feature of SANs where there HASN'T BEEN some sort of destructive interference between striping at the SAN and secondary striping at the host. (Where there has been a collision, the read times are uniformly poor to very bad - there's a hint of one such case on the HJR forum)

It can be very revealing in 10g to look at v$event_histogram for "db file scattered read" to get an overview of this.

The Direct Reads and the subtraction in the queryh against v$sysstat may be how Oracle works out the MAXTHR figure - which is making the assumption that any direct reads are PX slave processes.

Jonathan Lewis

1:44 AM

Noons

Nope, not much difference in pattern with 16 in dfmbrc. Two or three slow reads at the start, then they settle quickly into the fast read - at twice the time to read, as expected.

Same pattern as before. Happens a lot faster though, compared to the 8 or so requests before to get into read ahead mode. Two or three requests are enough to trigger the read ahead. My guess is because the number of blocks to be read on each dfmbrc has just doubled, so the xserve firmware "detects" a sequential read faster.

This is why I had it set low before: we have a lot of smallish tables that just happen to be around the 50K or so mark and they get a lot of FTS work on them. I don't want the firmware to go into read ahead mode where it just plain doesn't need to.

The reverse: as far as I know 9ir2 does not communicate to the xserve that it is not doing a FTS, just a range scan. So even these can be a problem sometimes, depending on disk access pattern.

This is where database level optimizations such as dfmbrc can easily conflict with firmware level optimizations. And why in the absence of any "handshaking" between the two I prefer to turn one side off and let the other do the work rather than have them trod on each other's turf. It might not be optimal in 100% of the cases, but at least it is predictable.

6:23 AM

David Aldridge

I wonder if this is a circumstance where you would be justified in caching those small FTS-prone tables in a seperate keep buffer pool?

Or to put it more theoretically, if you have a database where you have a lot of small tables that are subject to FTS, then in order to avoid read ahead kicking in on the smallest tables you have to set MBRC artificially small, thus disadvantaging FTS of less small tables. If those smaller tables were cached so that a physical reads requirement was rare then you could up the MBRC and get more efficient scans on the larger tables.

Bit of a niche case, admittedly.

Was there any detectable increase in the FTS performance due to the large MBRC size invoking read ahead more quickly, Noons? If you wanted to email me trace files at oracle.sponge@yahoo.com I'd be happy to work the math if you like.

8:20 AM

Noons

No appreciable difference on FTS or create index, for that matter, between 16 and 8. That's why I'm not so keen on letting read-ahead take over. If you look at the times on the dumps in dizwell, the dfmbrc = 16 are approximately twice the ones of dfmbrc = 8, when the read ahead has kicked in. That means I'm not really gaining that much other than the initial reads.

What worries me is I can't detect the cycling you folks talk about! That makes me think I'm either getting a very large read ahead "chunk" (something I don't necessarily want!) or the xserve is doing something very weird. Either way, I'd rather stay away from it at the moment. Given that it makes bugger all difference if I have 16 or 8 except for the start, not a problem.

Yes, I'm definitely considering putting those smaller tables undergoing fts into a keep cache. Definitely a niche case. It's not OLTP, it's not DSS, it's not DW. More a data mart, if that's still a valid expression nowadays?

8:47 AM

David Aldridge

Nothing wrong with Data Marts at all.

Well I'm no hardware wonk but I'd imagine that the read ahead functionality is configurable somehow, maybe by size, maybe by threshold -- no ideas other than that though.

10:54 AM

To leave a comment, click the button on the top of this page to sign in with Google.