Want to print maxm entry of every email against it

2019-07-25 00:36发布

I have a huge File around 2 GB having more then 20million rows

what i want is

Input File will be like this

07.SHEKHAR@GMAIL.COM,1
07SHIBAJI@GMAIL.COM,1
07.SHINDE@GMAIL.COM,1
07.SHINDE@GMAIL.COM,2
07.SHINDE@GMAIL.COM,3
07.SHINDE@GMAIL.COM,4
07.SHINDE@GMAIL.COM,5
07.SHINDE@GMAIL.COM,6
07.SHINDE@GMAIL.COM,7
07.SHOBHIT@GMAIL.COM,1
07SKERCH@RUSKIN.AC.UK,1
07SONIA@GMAIL.COM,1
07SONIA@GMAIL.COM,2
07SONIA@GMAIL.COM,3
07SRAM@GMAIL.COM,1
07SRAM@GMAIL.COM,2
07.SUMANTA@GMAIL.COM,1
07SUPRIYO@GMAIL.COM,1
07SUPRIYO@GMAIL.COM,2
07SUPRIYO@GMAIL.COM,3
07.SUSHMA@GMAIL.COM,1
07.SWETA@GMAIL.COM,1
07.SWETA@GMAIL.COM,2
07.SWETA@GMAIL.COM,3
07.TEENA@GMAIL.COM,1
07.TEENA@GMAIL.COM,2
07.UDAY@GMAIL.COM,1
07.UMESH@GMAIL.COM,1
07VAISHALISINGH@GMAIL.COM,1
07.VISHAL@GMAIL.COM,1,1
07.VISHAL@GMAIL.COM,2
07.VISHAL@GMAIL.COM,3
07.VISHAL@GMAIL.COM,4
07.VISHAL@GMAIL.COM,5
07.VISHAL@GMAIL.COM,6
07.VISHAL@GMAIL.COM,7
07.YASH@GMAIL.COM,1
07.YASH@GMAIL.COM,2
07.YASH@GMAIL.COM,3
07.YASH@GMAIL.COM,4

Output File Needed:-

07.SHEKHAR@GMAIL.COM,1,1
07SHIBAJI@GMAIL.COM,1,1
07.SHINDE@GMAIL.COM,1,7
07.SHINDE@GMAIL.COM,2,7
07.SHINDE@GMAIL.COM,3,7
07.SHINDE@GMAIL.COM,4,7
07.SHINDE@GMAIL.COM,5,7
07.SHINDE@GMAIL.COM,6,7
07.SHINDE@GMAIL.COM,7,7
07.SHOBHIT@GMAIL.COM,1,1
07SKERCH@RUSKIN.AC.UK,1,1
07SONIA@GMAIL.COM,1,3
07SONIA@GMAIL.COM,2,3
07SONIA@GMAIL.COM,3,3
07SRAM@GMAIL.COM,1,2
07SRAM@GMAIL.COM,2,2
07.SUMANTA@GMAIL.COM,1,1
07SUPRIYO@GMAIL.COM,1,3
07SUPRIYO@GMAIL.COM,2,3
07SUPRIYO@GMAIL.COM,3,3
07.SUSHMA@GMAIL.COM,1,1
07.SWETA@GMAIL.COM,1,3
07.SWETA@GMAIL.COM,2,3
07.SWETA@GMAIL.COM,3,3
07.TEENA@GMAIL.COM,1,2
07.TEENA@GMAIL.COM,2,2
07.UDAY@GMAIL.COM,1,1
07.UMESH@GMAIL.COM,1,1
07VAISHALISINGH@GMAIL.COM,1,1
07.VISHAL@GMAIL.COM,1,7
07.VISHAL@GMAIL.COM,2,7
07.VISHAL@GMAIL.COM,3,7
07.VISHAL@GMAIL.COM,4,7
07.VISHAL@GMAIL.COM,5,7
07.VISHAL@GMAIL.COM,6,7
07.VISHAL@GMAIL.COM,7,7
07.YASH@GMAIL.COM,1,4
07.YASH@GMAIL.COM,2,4
07.YASH@GMAIL.COM,3,4
07.YASH@GMAIL.COM,4,4

i,e 1 more column containing maximum no of entries corresponding to a particular email in each column so that every row now contains maximum occurence of each email. I am looking for a feasible soln for such a large file preferably in python or shell script and complexity of O(n) or O(nlogn) O(n**2) wont do in this case

2条回答
别忘想泡老子
2楼-- · 2019-07-25 00:43

I understand that you want to have a third column containing the maximun value of the second column for each mail.

I that case, I would use a map to store the maximun second column value found for each mail:

Pseudo code:

  1. Create a empty map where mail strings are the keys set -> M
  2. For each line (l) of the input file:
    1. if (l.mail not in M) OR (l.mail in M AND l.secondColumn > M[l.mail].secondColumn) Then: l.thirdColumn = l.secondColunn AND M[l.mail] = l;
  3. Create a new file -> fOut
  4. Iterate over the map entries (for each map entry):
    1. Append M[entry] to fOut.
查看更多
Deceive 欺骗
3楼-- · 2019-07-25 00:56

Lets try a python script since you may be more familiar with that language, doesn't require huge memory or hard disk space. Tested on Python 2.7 and 3.2

#!/usr/bin/python
email = "" # Initialize the email
count = 0  # and counter
import fileinput

for line in fileinput.input("word.txt"): # Interator: process a line at a time
  myArr = line.split(",")
  if (email != myArr[0]): # New email; print and reset count, email
    for n in range(0,count):
      print email + "," + str(n+1) + "," + str(count)
    email = myArr[0]
    count = 1  
  else: # Same email, increment count
    count = count + 1

# Print the final email
for n in range(0,count):
  print email + "," + str(n+1) + "," + str(count)

Anyone want to try an awk script?

查看更多
登录 后发表回答