I am working on a database self project. I have an input file got from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
After processing into 1400 separate file, each named 00001.txt,... 01400.txt...) and after applying Stemming on them, I will store them separately in a specific folder lets call it StemmedFolder with the following format:
in StemmedFolder: 00001.txt includes:
investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing
in StemmedFolder: 00756.txt includes:
remark
eddi
viscos
compress
mix
flow
lu
ting
And so on....
I wrote the codes that do:
- get the StemmedFolder, Count the Unique words
- Sort Alphabetically
- Add the ID of the document
- save each to a new file 00001.txt to 01400.txt as will be described
{I can provide my codes for these 4 sections in case somebody needs to see how is the implementation or change or any edit}
output of each file will be result to a separate file. (1400, each named 00001.txt, 00002.txt...) in a specific folder lets call it FrequenceyFolder with the following format:
in FrequenceyFolder: 00001.txt includes:
00001,aerodynam,2
00001,agre,3
00001,angl,1
00001,attack,7
00001,basi,4
....
in FrequenceyFolder: 00999.txt includes:
00999,aerodynam,5
00999,evalu,1
00999,lift,3
00999,ratio,2
00999,result,9
....
in FrequenceyFolder: 01400.txt includes:
01400,subtract,1
01400,support,1
01400,theoret,1
01400,theori,1
01400,.....
______________
Now my question:
I need to combine these 1400 files again to output a txt file that looks like this format with some calculation:
'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....
'zzzz' totalFrequency=1doc: [[Doc_01235,1]]
Thanks for spending time reading this long post
You can use a
Map
ofList
.Map<String,List<FileInformation>> statistics = new HashMap<>()
In the above map, the key will be the word and the value will be a
List<FileInformation>
object describing the statistics of individual files containing the word. TheFileInformation
class can be declared as follows :To populate the above Map, use the following steps :
FrequencyFolder
Map
.FileInformation
object and set theoccurrenceCount
to the number of occurrences found and set thefileName
to the name of the file it was found in. Add this object in theList<FileInformation>
corresponding to the key created in step 2.FileInfomation
object and add it to theList<FileInformation>
corresponding to the entry in the map for the word.Once you have the
Map
populated, printing the statistics should be a piece of cake.