I have a huge File around 2 GB having more then 20million rows
what i want is
Input File will be like this
07.SHEKHAR@GMAIL.COM,1
07SHIBAJI@GMAIL.COM,1
07.SHINDE@GMAIL.COM,1
07.SHINDE@GMAIL.COM,2
07.SHINDE@GMAIL.COM,3
07.SHINDE@GMAIL.COM,4
07.SHINDE@GMAIL.COM,5
07.SHINDE@GMAIL.COM,6
07.SHINDE@GMAIL.COM,7
07.SHOBHIT@GMAIL.COM,1
07SKERCH@RUSKIN.AC.UK,1
07SONIA@GMAIL.COM,1
07SONIA@GMAIL.COM,2
07SONIA@GMAIL.COM,3
07SRAM@GMAIL.COM,1
07SRAM@GMAIL.COM,2
07.SUMANTA@GMAIL.COM,1
07SUPRIYO@GMAIL.COM,1
07SUPRIYO@GMAIL.COM,2
07SUPRIYO@GMAIL.COM,3
07.SUSHMA@GMAIL.COM,1
07.SWETA@GMAIL.COM,1
07.SWETA@GMAIL.COM,2
07.SWETA@GMAIL.COM,3
07.TEENA@GMAIL.COM,1
07.TEENA@GMAIL.COM,2
07.UDAY@GMAIL.COM,1
07.UMESH@GMAIL.COM,1
07VAISHALISINGH@GMAIL.COM,1
07.VISHAL@GMAIL.COM,1,1
07.VISHAL@GMAIL.COM,2
07.VISHAL@GMAIL.COM,3
07.VISHAL@GMAIL.COM,4
07.VISHAL@GMAIL.COM,5
07.VISHAL@GMAIL.COM,6
07.VISHAL@GMAIL.COM,7
07.YASH@GMAIL.COM,1
07.YASH@GMAIL.COM,2
07.YASH@GMAIL.COM,3
07.YASH@GMAIL.COM,4
Output File Needed:-
07.SHEKHAR@GMAIL.COM,1,1
07SHIBAJI@GMAIL.COM,1,1
07.SHINDE@GMAIL.COM,1,7
07.SHINDE@GMAIL.COM,2,7
07.SHINDE@GMAIL.COM,3,7
07.SHINDE@GMAIL.COM,4,7
07.SHINDE@GMAIL.COM,5,7
07.SHINDE@GMAIL.COM,6,7
07.SHINDE@GMAIL.COM,7,7
07.SHOBHIT@GMAIL.COM,1,1
07SKERCH@RUSKIN.AC.UK,1,1
07SONIA@GMAIL.COM,1,3
07SONIA@GMAIL.COM,2,3
07SONIA@GMAIL.COM,3,3
07SRAM@GMAIL.COM,1,2
07SRAM@GMAIL.COM,2,2
07.SUMANTA@GMAIL.COM,1,1
07SUPRIYO@GMAIL.COM,1,3
07SUPRIYO@GMAIL.COM,2,3
07SUPRIYO@GMAIL.COM,3,3
07.SUSHMA@GMAIL.COM,1,1
07.SWETA@GMAIL.COM,1,3
07.SWETA@GMAIL.COM,2,3
07.SWETA@GMAIL.COM,3,3
07.TEENA@GMAIL.COM,1,2
07.TEENA@GMAIL.COM,2,2
07.UDAY@GMAIL.COM,1,1
07.UMESH@GMAIL.COM,1,1
07VAISHALISINGH@GMAIL.COM,1,1
07.VISHAL@GMAIL.COM,1,7
07.VISHAL@GMAIL.COM,2,7
07.VISHAL@GMAIL.COM,3,7
07.VISHAL@GMAIL.COM,4,7
07.VISHAL@GMAIL.COM,5,7
07.VISHAL@GMAIL.COM,6,7
07.VISHAL@GMAIL.COM,7,7
07.YASH@GMAIL.COM,1,4
07.YASH@GMAIL.COM,2,4
07.YASH@GMAIL.COM,3,4
07.YASH@GMAIL.COM,4,4
i,e 1 more column containing maximum no of entries corresponding to a particular email in each column so that every row now contains maximum occurence of each email. I am looking for a feasible soln for such a large file preferably in python or shell script and complexity of O(n) or O(nlogn) O(n**2) wont do in this case
I understand that you want to have a third column containing the maximun value of the second column for each mail.
I that case, I would use a map to store the maximun second column value found for each mail:
Pseudo code:
Lets try a python script since you may be more familiar with that language, doesn't require huge memory or hard disk space. Tested on Python 2.7 and 3.2
Anyone want to try an
awk
script?