c# - Create document-term matrix from dictionary -
i'm trying pre-process text file, each line bi-gram words of document frequency in document. here example of each line:
i_like 1 you_know 2 .... not_good 1
i managed create dictionary whole corpus. want read corpus line line , having dictionary, create document-term matrix each element (i,j) in matrix frequency of term "j" in document "i".
create function generates integer index each word using dictionary:
dictionary<string, int> m_wordindexes = new dictionary<string, int>(); int getwordindex(string word) { int result; if (!m_wordindexes.tryget(word, out result)) { result = m_wordindexes.count; m_wordindexes.add(word, result); } return result; } the result matrix is:
list<list<int>> m_matrix = new list<list<int>>(); processing each line of text file generates 1 row of matrix:
list<int> processline(string line) { list<int> result = new list<int>(); . . . split line in sequence of word / number of occurences . . . . . . each word / number of occurences . . .{ int index = getwordindex(word); while (index > result.count) { result.add(0); } result.insert(index, numberofoccurences); } return result; } your read text file 1 line @ time, calling processline() on each line , adding resulting list m_matrix.
Comments
Post a Comment