filesystems - What are the typical application uses of reverse/stride/pread and pwrite? -


if impatient, skip "question" headline below.

context

i work unix(like) system administration , infrastructure development, think question answered best programmers :o)

what want learn how benchmark file systems (plain, volume managed, virtualized, encrypted, etc.) using iozone. exercise, benchmarked usb pendrive meant used system disk in slug (http://www.nslu2-linux.org/) formatted respectively vfat, ntfs, ext3, ext4 , xfs. test produced surprising results posted below. reason why results surprised me, though, may because still new iozone , don't know how interpret numbers. hence, post.

in test, iozone ran benchmarks on 11 different file operations, on 1 record size (4k, matching block size of tested file systems) , on 1 file size (512mb). one-sidedness of file system record size , file size of course leaves test bias. anyway, file operations listed below, each own short explanation:

  • initial write: write new data disk sequentially, regular file usage
  • rewrite: appended new data existing sequentially, regular file usage
  • read: sequentially read data, regular file usage
  • re-read: sequentially re-read data (buffer test, or what?)
  • reverse read: ???
  • stride read: ???
  • random read: non-sequentially read, typically database usage
  • random write: non-sequentially write, typically database usage
  • pread: reading of data on position - indexing databases?
  • pwrite: writing of data on position - indexing databases?
  • mixed workload: (obvious)

some of these operations seem straight-forward. guess initial write, rewrite , read used regular file handling, involving letting pointer seek until block reached, reading or writing sequentially (often through many blocks), having jump forward little because of fragmented files. sole objective of re-read test (i guess) buffer testing. in parallel, random read/write typical database operations, pointer has jump place place within same file collecting database records, example when joining tables.

so question?

so far, good. highly appreciate corrections above assumptions, although seem common knowledge. real question: why ever reverse read? stride read? , "position" operations pread , pwrite, i've been told, used indexed databases, why not keep index in memory? or happens, , pread comes in handy jumping exact location of record once given index? else use pread/pwrite for?

to sum up, of time feel able interpret iozone results halfways. more or less know why high numbers on random operations make file system database, why need read files in reverse order, , stride read tell me? typical application uses of these operations be?

bonus question

having asked that, here bonus question. administrator of given file system, having gratefully learned how interpret file system benchmarks insightfull programmers ;) - have suggestions on how make analisys of actual use of file system? experimenting file system record (block) size trivial, although time consuming. , concerning size , distribution of files in given file system, 'find' friend. do counts on actual file system calls read(), pwrite(), etc.?

also appreciate comments on influence of other ressources on file system test results, such role of processor power , ram capacity , speed. example, difference make make test on machine holding 1.66ghz atom processor , 2 gigs of ddr2 ram when want use pendrive in slug 266 mhz arm intel xscale processor , 32/8 mb sd/flash ram?

architecturally minded documentation?

since don't repeat myself don't ask of others either, so, if these questions cannot answered in short manner, appreciate links further documentation, important thing not being explains above file operations (i apis that), documentation architecturally minded, is, explains how these operations typically used in real life applications.

test results

right. promised results of rather humble usb pendrive file system test. main expectation poor results on writes (as flash drive, given it's nature, has bigger block size actual file system administering it, meaning write small change relatively large amounts of unchanged data have rewritten), , nice results on reads. main points turned out be:

  • vfat did on operations, except obscure (to me, anyway) reverse , stride read. guess lack of features eliminates lot of bookkeeping.

  • ntfs sucks on rewrite (append) , read operations, making poor candidate regular file operation. sucks on pread operation, making poor candidate indexed databases.

  • surprisingly, ext3 , ext4, latter marginately better on operations, sucks @ initial writes, rewrite, read, random write , pwrite operations, making them poor candidates regular file usage, intensely updated databases. ext4, though, master @ random read , pread, making excellent candidate static databases(?). both ext3 , ext4 score high on obscure reverse read , stride read operations, whatever means.

  • the unsurpassed all-over test winner xfs, weak point seem reverse read. on initial write, rewrite, read, random write , pwrite, among best, making excellent candidate regular file usage (intensely updated) databases. on reread, random read , pread among runner-ups, making candidate (somewhat static) databases. on stride read - whatever means!

any comments on interpretation of these results welcome! numbers listed beneath (somewhat cut reasons of length), 1 iozone test suite pr. file system type, tested on standard 4gb verbatim pendrive (orange of colour ;)), docked in samsung n105p laptop n450 1.66ghz atom cpu , 2gb ddr2 667 mhz ram, running linux 3.2.0-24 x86 kernel encrypted swap (yeah, know, should install 64bit linux , leave swap in clear!).

regards, torsten

ps. after writing found out apparently, debian nslu2 distribution not support xfs. questions still stand, though!

--- vfat ---

iozone: performance test of file i/o         version $revision: 3.397 $     compiled 32 bit mode.     build: linux   contributors:william norcott, don capps, isom crawford, kirby collins              al slater, scott rhine, mike wisner, ken goss              steve landherr, brad smith, mark kelly, dr. alain cyr,              randy dunlap, mark montague, dan million, gavin brebner,              jean-marc zucconi, jeff blomberg, benny halevy, dave boone,              erik habbinga, kris strecker, walter wong, joshua root,              fabrice bacchella, zhenghua xue, qin li, darren sawyer.              ben england.  run began: mon jun  4 14:23:57 2012  record size 4 kb file size set 524288 kb command line used: iozone -l 1 -u 1 -r 4k -s 512m -f /mnt/iozone.tmp output in kbytes/sec time resolution = 0.000002 seconds. processor cache size set 1024 kbytes. processor cache line size set 32 bytes. file stride size set 17 * record size. min process = 1  max process = 1  throughput test 1 process each process writes 524288 kbyte file in 4 kbyte records  children see throughput  1 initial writers  =   12864.82 kb/sec parent sees throughput  1 initial writers   =    3033.39 kb/sec  children see throughput  1 rewriters    =   25271.86 kb/sec parent sees throughput  1 rewriters     =    2876.36 kb/sec  children see throughput  1 readers      =  685333.00 kb/sec parent sees throughput  1 readers       =  682464.06 kb/sec  children see throughput 1 re-readers    =  727929.94 kb/sec parent sees throughput 1 re-readers     =  726612.47 kb/sec  children see throughput 1 reverse readers   =  458174.00 kb/sec parent sees throughput 1 reverse readers    =  456910.21 kb/sec  children see throughput 1 stride readers    =  351768.00 kb/sec parent sees throughput 1 stride readers     =  351504.09 kb/sec  children see throughput 1 random readers    =  553705.94 kb/sec parent sees throughput 1 random readers     =  552630.83 kb/sec  children see throughput 1 mixed workload    =  549812.50 kb/sec parent sees throughput 1 mixed workload     =  547645.03 kb/sec  children see throughput 1 random writers    =   19958.66 kb/sec parent sees throughput 1 random writers     =    2752.23 kb/sec  children see throughput 1 pwrite writers    =   13355.57 kb/sec parent sees throughput 1 pwrite writers     =    3119.04 kb/sec  children see throughput 1 pread readers     =  574273.31 kb/sec parent sees throughput 1 pread readers  =  572121.97 kb/sec 

--- ntfs ---

iozone: performance test of file i/o         version $revision: 3.397 $     compiled 32 bit mode.     build: linux   contributors:william norcott, don capps, isom crawford, kirby collins              al slater, scott rhine, mike wisner, ken goss              steve landherr, brad smith, mark kelly, dr. alain cyr,              randy dunlap, mark montague, dan million, gavin brebner,              jean-marc zucconi, jeff blomberg, benny halevy, dave boone,              erik habbinga, kris strecker, walter wong, joshua root,              fabrice bacchella, zhenghua xue, qin li, darren sawyer.              ben england.  run began: mon jun  4 13:59:37 2012  record size 4 kb file size set 524288 kb command line used: iozone -l 1 -u 1 -r 4k -s 512m -f /mnt/iozone.tmp output in kbytes/sec time resolution = 0.000002 seconds. processor cache size set 1024 kbytes. processor cache line size set 32 bytes. file stride size set 17 * record size. min process = 1  max process = 1  throughput test 1 process each process writes 524288 kbyte file in 4 kbyte records  children see throughput  1 initial writers  =   11153.75 kb/sec parent sees throughput  1 initial writers   =    2848.69 kb/sec  children see throughput  1 rewriters    =    8723.95 kb/sec parent sees throughput  1 rewriters     =    2794.81 kb/sec  children see throughput  1 readers      =   24935.60 kb/sec parent sees throughput  1 readers       =   24878.74 kb/sec  children see throughput 1 re-readers    =  144415.05 kb/sec parent sees throughput 1 re-readers     =  144340.90 kb/sec  children see throughput 1 reverse readers   =   76627.60 kb/sec parent sees throughput 1 reverse readers    =   76362.93 kb/sec  children see throughput 1 stride readers    =  367293.25 kb/sec parent sees throughput 1 stride readers     =  366002.25 kb/sec  children see throughput 1 random readers    =  505843.41 kb/sec parent sees throughput 1 random readers     =  500556.16 kb/sec  children see throughput 1 mixed workload    =  553075.56 kb/sec parent sees throughput 1 mixed workload     =  551754.97 kb/sec  children see throughput 1 random writers    =    9747.23 kb/sec parent sees throughput 1 random writers     =    2381.89 kb/sec  children see throughput 1 pwrite writers    =   10906.05 kb/sec parent sees throughput 1 pwrite writers     =    1931.43 kb/sec  children see throughput 1 pread readers     =   16730.47 kb/sec parent sees throughput 1 pread readers  =   16194.80 kb/sec 

--- ext3 ---

iozone: performance test of file i/o         version $revision: 3.397 $     compiled 32 bit mode.     build: linux   contributors:william norcott, don capps, isom crawford, kirby collins              al slater, scott rhine, mike wisner, ken goss              steve landherr, brad smith, mark kelly, dr. alain cyr,              randy dunlap, mark montague, dan million, gavin brebner,              jean-marc zucconi, jeff blomberg, benny halevy, dave boone,              erik habbinga, kris strecker, walter wong, joshua root,              fabrice bacchella, zhenghua xue, qin li, darren sawyer.              ben england.  run began: sun jun  3 16:05:27 2012  record size 4 kb file size set 524288 kb command line used: iozone -l 1 -u 1 -r 4k -s 512m -f /media/verbatim/1/iozone.tmp output in kbytes/sec time resolution = 0.000001 seconds. processor cache size set 1024 kbytes. processor cache line size set 32 bytes. file stride size set 17 * record size. min process = 1  max process = 1  throughput test 1 process each process writes 524288 kbyte file in 4 kbyte records  children see throughput  1 initial writers  =    3704.61 kb/sec parent sees throughput  1 initial writers   =    3238.73 kb/sec  children see throughput  1 rewriters    =    3693.52 kb/sec parent sees throughput  1 rewriters     =    3291.40 kb/sec  children see throughput  1 readers      =  103318.38 kb/sec parent sees throughput  1 readers       =  103210.16 kb/sec  children see throughput 1 re-readers    =  908090.88 kb/sec parent sees throughput 1 re-readers     =  906356.05 kb/sec  children see throughput 1 reverse readers   =  744801.38 kb/sec parent sees throughput 1 reverse readers    =  743703.54 kb/sec  children see throughput 1 stride readers    =  623353.88 kb/sec parent sees throughput 1 stride readers     =  622295.11 kb/sec  children see throughput 1 random readers    =  725649.06 kb/sec parent sees throughput 1 random readers     =  723891.82 kb/sec  children see throughput 1 mixed workload    =  734631.44 kb/sec parent sees throughput 1 mixed workload     =  733283.36 kb/sec  children see throughput 1 random writers    =     177.59 kb/sec parent sees throughput 1 random writers     =     137.83 kb/sec  children see throughput 1 pwrite writers    =    2319.47 kb/sec parent sees throughput 1 pwrite writers     =    2200.95 kb/sec  children see throughput 1 pread readers     =   13614.82 kb/sec parent sees throughput 1 pread readers  =   13614.45 kb/sec 

--- ext4 ---

iozone: performance test of file i/o         version $revision: 3.397 $     compiled 32 bit mode.     build: linux   contributors:william norcott, don capps, isom crawford, kirby collins              al slater, scott rhine, mike wisner, ken goss              steve landherr, brad smith, mark kelly, dr. alain cyr,              randy dunlap, mark montague, dan million, gavin brebner,              jean-marc zucconi, jeff blomberg, benny halevy, dave boone,              erik habbinga, kris strecker, walter wong, joshua root,              fabrice bacchella, zhenghua xue, qin li, darren sawyer.              ben england.  run began: sun jun  3 17:59:26 2012  record size 4 kb file size set 524288 kb command line used: iozone -l 1 -u 1 -r 4k -s 512m -f /media/verbatim/2/iozone.tmp output in kbytes/sec time resolution = 0.000005 seconds. processor cache size set 1024 kbytes. processor cache line size set 32 bytes. file stride size set 17 * record size. min process = 1  max process = 1  throughput test 1 process each process writes 524288 kbyte file in 4 kbyte records  children see throughput  1 initial writers  =    4086.64 kb/sec parent sees throughput  1 initial writers   =    3533.34 kb/sec  children see throughput  1 rewriters    =    4039.37 kb/sec parent sees throughput  1 rewriters     =    3409.48 kb/sec  children see throughput  1 readers      = 1073806.38 kb/sec parent sees throughput  1 readers       = 1062541.84 kb/sec  children see throughput 1 re-readers    =  991162.00 kb/sec parent sees throughput 1 re-readers     =  988426.34 kb/sec  children see throughput 1 reverse readers   =  811973.62 kb/sec parent sees throughput 1 reverse readers    =  810333.28 kb/sec  children see throughput 1 stride readers    =  779127.19 kb/sec parent sees throughput 1 stride readers     =  777359.89 kb/sec  children see throughput 1 random readers    =  796860.56 kb/sec parent sees throughput 1 random readers     =  795138.41 kb/sec  children see throughput 1 mixed workload    =  741489.56 kb/sec parent sees throughput 1 mixed workload     =  739544.09 kb/sec  children see throughput 1 random writers    =     499.05 kb/sec parent sees throughput 1 random writers     =     399.82 kb/sec  children see throughput 1 pwrite writers    =    4092.66 kb/sec parent sees throughput 1 pwrite writers     =    3451.62 kb/sec  children see throughput 1 pread readers     =  840101.38 kb/sec parent sees throughput 1 pread readers  =  831083.31 kb/sec 

--- xfs ---

iozone: performance test of file i/o         version $revision: 3.397 $     compiled 32 bit mode.     build: linux   contributors:william norcott, don capps, isom crawford, kirby collins              al slater, scott rhine, mike wisner, ken goss              steve landherr, brad smith, mark kelly, dr. alain cyr,              randy dunlap, mark montague, dan million, gavin brebner,              jean-marc zucconi, jeff blomberg, benny halevy, dave boone,              erik habbinga, kris strecker, walter wong, joshua root,              fabrice bacchella, zhenghua xue, qin li, darren sawyer.              ben england.  run began: mon jun  4 14:47:49 2012  record size 4 kb file size set 524288 kb command line used: iozone -l 1 -u 1 -r 4k -s 512m -f /mnt/iozone.tmp output in kbytes/sec time resolution = 0.000005 seconds. processor cache size set 1024 kbytes. processor cache line size set 32 bytes. file stride size set 17 * record size. min process = 1  max process = 1  throughput test 1 process each process writes 524288 kbyte file in 4 kbyte records  children see throughput  1 initial writers  =   21854.47 kb/sec parent sees throughput  1 initial writers   =    3836.32 kb/sec  children see throughput  1 rewriters    =   29420.40 kb/sec parent sees throughput  1 rewriters     =    3955.65 kb/sec  children see throughput  1 readers      =  624136.75 kb/sec parent sees throughput  1 readers       =  614326.13 kb/sec  children see throughput 1 re-readers    =  577542.62 kb/sec parent sees throughput 1 re-readers     =  576533.42 kb/sec  children see throughput 1 reverse readers   =  483368.06 kb/sec parent sees throughput 1 reverse readers    =  482598.67 kb/sec  children see throughput 1 stride readers    =  537227.12 kb/sec parent sees throughput 1 stride readers     =  536313.77 kb/sec  children see throughput 1 random readers    =  525219.19 kb/sec parent sees throughput 1 random readers     =  524062.07 kb/sec  children see throughput 1 mixed workload    =  561513.50 kb/sec parent sees throughput 1 mixed workload     =  560142.18 kb/sec  children see throughput 1 random writers    =   24118.34 kb/sec parent sees throughput 1 random writers     =    3117.71 kb/sec  children see throughput 1 pwrite writers    =   32512.07 kb/sec parent sees throughput 1 pwrite writers     =    3825.54 kb/sec  children see throughput 1 pread readers     =  525244.94 kb/sec parent sees throughput 1 pread readers  =  523331.93 kb/sec 

the times have needed dig in depth filesystem performance on windows systems. general principals apply no matter os/filesystem using...

why ever reverse read?

as program runs reads block 987654 using data determines needs block 123456. might happen on join: db might using index on table1 pick records (using index) out of table two. picking operation might happen in table 1 order (reverse of table 2 order).

similar sort of situations can happen single table selects when using 2 keys.

what stride read?

reading every n-th block ex. reading block 12345600 block 12345700 block 12345800 stride of 100. imagine table many and/or large columns. table might have rows need several filesystem blocks hold data. typically database organize data record each row each record occupying several sequential filesystem blocks. if db rows occupy 10 filesystem blocks , selecting on 2 columns might need read 1st , 6th blocks of 10 block record. query need read block 10001, 10006, 10011, 10016, 10021, 10026 - stride of 5.

and "position" operations pread , pwrite, i've been told, used indexed databases, why not keep index in memory?

the size of index may exceed reasonable amount of ram usage. or, prior usage called other indexes or data ram causing unused index evected filesystem/db cache.

or happens, , pread comes in handy jumping exact location of record once given index? yep, might database doing.

what else use pread/pwrite for?

some data files have predefined "interesting" locations. might root of b-tree index, table header, log/journal tail or else depending on db implementation. pread/rwrite testing performance of hopping set specific locations repeatedly instead of uniformly random mix of locations.

links?

there exist system utilities mainstream oses can capture every os filesystem operation. think these might named dtrace or ptap or ptrace on *nix systems. can use mountains of data (filtered intelligently) these monitors see disk access pattern in system.

then general rule of thumb db usage obscene amounts of ram helpful. indexes reside in ram time.


Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -