caching - Number of banks in Nehalem l2 cache -


i studying access time different cache configurations when stumbled on term in cacti interface "number of banks".

number of banks number of interleaved modules in cache increases bandwidth of cache , number of parallel accesses it.

in context, wanted find number of banks in caches of nehalem architecture. googled thing did not hit useful.

my reasoning here :

  1. l1 data , instruction cache must have single bank. access granularity word here.
  2. l2 cache supports misses of l1 data , instruction cache. hence must support 2 banks.
  3. l3 cache shared across cores in system , hence must have large (32) number of banks.

is intuition correct ?? plus, number of banks change way data/program in structured (ideally should not still ...) ??

the overview graphics of wikipedia article depicts nehalem (first cpu branded "core i7") have 256 kbyte of l2 cache per core.

i don't mean word "bank" here. nehalem's cache 8-way associative 64bits (8 bytes) per cache line.

that means every read/write access cache 8 bytes of data transferred corresponds 64bit architecture virtual addresses have 8 bytes. every time address has retrieved or stored in memory, 8 bytes have transported, natural fit design single entry in cache way. (other cache sizes make sense, too, depending on applications: such larger sizes data caches vector processing units).

x-way associativity determines relationship of memory address , place information in address can stored inside cache. term "8 ways associativity" refers fact data stored @ memory address can held in 8 different cache lines. caches have address comparison mechanism select matching entry inside 1 way, , replacement strategy decide of x ways used - possibly expelling previous valid value.

your using of term "bank" refers 1 such "set" of 8-way associativity. answer question "8". , again, that's 1 l2 cache per core, , each have structure.

your assumption on simultaneous access valid 1 well. documented e.g. arm's cortax a15 however, if , how sets or banks of cache can accessed independently anyone's guess. wikipedia diagram shows 256 bit bus between l1 data cache , l2 cache. both imply possible access 4 ways independently (4*64=256, more 1 memory load/store transferred @ given time , slower l2 cache feeds 4 cache lines simultaneously faster l1 cache in 1 call burst.

this assumption supported fact system architecture manual can found on intel's page, in chapter 2.2.6 lists later sandy bridge improvements, emphasizing "internal bandwidth of 2 loads , 1 store each cycle.". cpus before sandybridge should have smaller number of concurrent load/stores.

note there's difference of "in flight" load/stores , actual data transmitted. "in flight" operations being executed. in case of load entail waiting memory yield data after caches reported misses. can have many loads going on in parallel, can still have data bus between 2 caches used once @ given time. above sandybridge improvement widens data bus 2 loads , 1 store transmitting data @ same time nehalem (one "tock", or 1 architecture before sandy bridge) not do.

your intuition not correct on accounts:

  1. hyper threading , multi threading in general allows cpu execute more 1 statement per cycle. (nehalem, chapter 2.2.5: "provides 2 hardware threads (logical processors) per core. takes advantage of 4-wide execution engine". makes sense support multiple concurrent load/stores l1 cache.
  2. the l2 cache serves both l1 data , l1 instruction cache - you're correct on part. reason in (1) may make sense support more 2 simultaneous operations.
  3. generally scale number l3 cache, in practice not make sense. don't know got number 32 from, maybe guess. additional access point ("bank" in terminology) must have address decoders, tag arrays (for handling address comparisons cache lines, replacement strategy, , cache data flags (dirty bit, etc)). every access port requires overhead in transistors , area , power on silicon. every port exists slows down cache access, if not in use. (details out of scope of answer). delicate design decision, , 32 way high. kind of memory inside cpu numbers range 1 6-8 read ports , 1 2-4 write ports. there may exceptions, of course.

regarding point software optimizations: worry if low level hardware/firmware developer. otherwise follow high level ideas: if can, keep innermost loop of intense operations small make fit l3 cache. not start more threads intense computing on local data have cores. if start worry such speed implications, start compiling/optimizing code matching cpu switches, , control other tasks on machine (even infrastructure services).

in summary:

  • nehalem's l2 cache 8 way associative
  • it supports less 2 simultaneous load , 1 store operation, one. each load/store can transmit 256 bits @ 1 time to/from l1 data cache.
  • the number of simultaneous load/store operation not scale 32 l3 cache due physical design restrictions (timing/area/power)
  • you should not worry these details in applications - except know sure have (e.g. in high performance computing)

Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -