I'm using the below command using an alias to print the sum of all file sizes by owner in a directory
ls -l $dir | awk ' NF>3 { file[$3]+=$5 } \
END { for( i in file) { ss=file[i]; \
if(ss >=1024*1024*1024 ) {size=ss/1024/1024/1024; unit="G"} else \
if(ss>=1024*1024) {size=ss/1024/1024; unit="M"} else {size=ss/1024; unit="K"}; \
format="%.2f%s"; res=sprintf(format,size,unit); \
printf "%-8s %12d\t%s\n",res,file[i],i }}' | sort -k2 -nr
but, it doesn't seem to be fast all the times.
Is it possible to get the same output in some other way, but faster?
Parsing output from
ls
- bad idea.How about using
find
instead?${dir}
-maxdepth 1
)-type f
)-printf "%u %s\n"
)-a
)END {...}
) print out the hash contents, sorted by key, i.e. user nameA solution using Perl:
Test run:
Interesting difference. The Perl solution discovers 3 more files in my test directory than the
find
solution. I have to ponder why that is...Get a listing, add up sizes, and sort it by owner (with Perl)
I didn't get to benchmark it, and it'd be worth trying it out against an approach where the directory is iterated over, as opposed to
glob
-ed (while I foundglob
much faster in a related problem).I expect good runtimes in comparison with
ls
, which slows down dramatically as a file list in a single directory gets long. This is due to the system so Perl will be affected as well but as far as I recall it handles it far better. However, I've seen a dramatic slowdown only once entries get to half a million or so, not a few thousand, so I am not sure why it runs slow on your system.If this need be recursive then use File::Find. For example
This scans a directory with 2.4 Gb, of mostly small files over a hierarchy of subdirectories, in a little over 2 seconds. The
du -sh
took around 5 seconds (the first time round).It is reasonable to bring these two into one script
I find this to perform about the same as the one-dir-only code above, when run non-recursively (default as it stands).
Note that File::Find::Rule interface has many conveniences but is slower in some important use cases, what clearly matters here. (That analysis should be redone since it's a few years old.)
Using
datamash
(and Stefan Becker'sfind
code):Did I see some awk in the op? Here is one in GNU awk using filefuncs extension:
Sample outputs:
Another:
Yet another test with a million empty files:
Another perl one, that displays total sizes sorted by user:
Not sure why question is tagged perl when awk is being used.
Here's a simple perl version:
glob
gets our file listm//
discards.
and..
stat
the size and uid%h
>>10
is integer divide by 1024)//
provides fallback)To exclude symlinks, subdirectories, etc, change the
if
to appropriate-X
tests. (eg.(-f $_)
,(!-d $_ and !-l $_)
, etc). See perl docs on the_
filehandle optimisation for caching stat results.