Iterating over a large data set in long running Python process - memory issues? -


i working on long running python program (a part of flask api, , other realtime data fetcher).

both long running processes iterate, quite (the api 1 might hundreds of times second) on large data sets (second second observations of economic series, example 1-5mb worth of data or more). interpolate, compare , calculations between series etc.

what techniques, sake of keeping processes alive, can practice when iterating / passing parameters / processing these large data sets? instance, should use gc module , collect manually?

update

i c/c++ developer , have no problem (and enjoy) writing parts in c++. have 0 experience doing so. how started?

any advice appreciated. thanks!

working large datasets isn't going cause memory complications. long use sound approaches when view , manipulate data, can typically make frugal use of memory.

there 2 concepts need consider you're building models process data.

  1. what smallest element of data need access to perform given calculation? example, might have 300gb text file filled numbers. if you're looking calculate average of numbers, read 1 number @ time calculate running average. in example, smallest element single number in file, since element of our data set need consider @ point in time.

  2. how can model application such access these elements iteratively, 1 @ time, during calculation? in our example, instead of reading entire file @ once, we'll read 1 number file @ time. approach, use tiny amount of memory, can process arbitrarily large data set. instead of passing reference dataset around in memory, pass view of dataset, knows how load specific elements on demand (which can freed once worked with). similar in principle buffering , approach many iterators take (e.g., xrange, open's file object, etc.).

in general, trick understanding how break problem down tiny, constant-sized pieces, , stitching pieces 1 one calculate result. you'll find these tenants of data processing go hand-in-hand building applications support massive parallelism, well.

looking towards gc jumping gun. you've provided high-level description of working on, you've said, there no reason need complicate things poking around in memory management yet. depending on type of analytics doing, consider investigating numpy aims lighten burden of heavy statistical analysis.


Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -