algorithm - Optimal Bucket Size and No. of Buckets -


sorry post not related coding more data structures , algorithms. i'm having large amount of data each having different frequencies. approximate figure plot seems bell curve. want display data in ranges precisely describes frequency of ranges. e.g. entire range of data has total no. of frequencies range or bucket size not precise , may made more precise.(e.g if data more concentrated in particular frequency zone, may build bucket less data size having more closely related frequencies.)
regarding algorithm . thought of algorithm related binary search. ideas folks.

not sure following, seems looking k beans, each 2 beans, probability of data falling in 1 bean identical being in other bean.

from description, data seems normally distributed, or t-distributed.

one can evaluate mean , standard deviation of data, let extracted s.d. s , mean u.

the standard formulas evaluating mean , s.d. sample are1:

u = (x1 + x2 + ... + xn) / n (simple average) s^2 = sigma((xi - u)^2)/(n-1) 

given information, can evaluate distribution of data, n(u,s^2). given information, can create random variabe: x~n(u,s^2)2

now left finding a,b,... follows (assuming 10 buckets, can modified wish):

p(x<a) = 0.1 p(x<b) = 0.2 p(x<c) = 0.3 ... 

after finding a,b,c,... have beans: (-infinity,a], (a,b], (a,c], ...


(1) evaluating variance: http://en.wikipedia.org/wiki/variance#population_variance_and_sample_variance
(2)the real distribution variable t-distribution, since variance unknown - , extracted data. - large enough n - t-distribution decays normal distribution.


Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -