c# - Create document-term matrix from dictionary -


i'm trying pre-process text file, each line bi-gram words of document frequency in document. here example of each line:

i_like 1 you_know 2 .... not_good 1

i managed create dictionary whole corpus. want read corpus line line , having dictionary, create document-term matrix each element (i,j) in matrix frequency of term "j" in document "i".

create function generates integer index each word using dictionary:

dictionary<string, int> m_wordindexes = new dictionary<string, int>();  int getwordindex(string word) {   int result;   if (!m_wordindexes.tryget(word, out result)) {     result = m_wordindexes.count;     m_wordindexes.add(word, result);   }   return result; } 

the result matrix is:

list<list<int>> m_matrix = new list<list<int>>(); 

processing each line of text file generates 1 row of matrix:

list<int> processline(string line) {   list<int> result = new list<int>();   . . . split line in sequence of word / number of occurences . . .    . . . each word / number of occurences . . .{     int index = getwordindex(word);           while (index > result.count) {       result.add(0);     }       result.insert(index, numberofoccurences);   }   return result; } 

your read text file 1 line @ time, calling processline() on each line , adding resulting list m_matrix.


Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -