Hi all: Now I have a 3G bytes tomcat access log named urls, each line is a url. I want to count each url and sort these urls order by the number of each url. I did it this way:
awk '{print $0}' urls | sort | uniq -c | sort -nr >> output
But it took really long time to finish this job, it's already took 30 minutes and its still working. log file is like bellow:
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/loan/recent_apply_info?passportId=Y20151206000011745
/loan/recent_apply_info?passportId=Y20160331000000423
/open_api/borrow_business/get_apply_by_user
...
Is there any other way that I could process and sort a 3G bytes file? Thanks in advance!
I'm not sure why you're using awk at the moment - it's not doing anything useful.
I would suggest using something like this:
This builds up a count of each URL and then sorts the output.
I generated a sample file of 3,200,000 lines, amounting to 3GB, using Perl like this:
I then tried sorting it in one step, followed by splitting it into 2 halves and sorting the halves separately and merging the results, then splitting into 4 parts and sorting separately and merging, then splitting into 8 parts and sorting separately and merging.
This resulted, on my machine at least, in a very significant speedup.
Here is the script. The filename is hard-coded as
BigBoy
, but could easily be changed and the number of parts to split the file into must be supplied as a parameter.Needless to say, the resulting sorted files are identical :-)