A good way to merge files according to the file na

2019-09-14 17:47发布

I have thousands of zipped csv files named like this:

result-20120705-181535.csv.gz

181535 means 18:15:35, now I want to merge these files on daily basis(I have data over a week, all named like the above example), from 2:00 am in the morning till 2:00 am the next day,then moved the processed files into a folder called merged

so in the current folder, I have tons of .csv.gz files, and I want to scan the names, merge everything like 20120705-02*, 20120705-03*...until 20120706-01* into 20120705-result.csv.gz, then move 20120705-02*, 20120705-03*...until 20120706-01* files into a folder called merged, and started to find the next day's data: 20120706-02*.....20120707-01*

I am wondering whether to use python or bash script to do it, and how?

标签: python bash
2条回答
ら.Afraid
2楼-- · 2019-09-14 18:14

This answer is completely untested, but hopefully it will give a place to work from:

import datetime
import glob
from collections import defaultdict
import gzip
import os
import shutil

def day(fname):
    """
    Finds the "logical" day (e.g. the day minus 2 hours since your days 
    run from 2AM
    """
    d=datetime.datetime.strptime('result-%Y%m%d-%H%M%S.csv.gz')-datetime.timedelta(hours=2)
    return d.strftime('%Y%m%d')

files=sorted(glob.glob('result-*.csv.gz'))
cat_together=defaultdict(list)
for f in files:
    cat_together[day].append(f)

os.mkdir('merged')
for d,files in cat_together.items():
    outfile=gzip.open(d+'-result.csv.gz','wb')
    for f in files:
        gfile=gzip.open(f)
        outfile.write(gfile.read())
        gfile.close()
        shutil.move(f,'merged')
    outfile.close()
查看更多
我只想做你的唯一
3楼-- · 2019-09-14 18:17

Create a textfile containing these lines:

#!/bin/bash

mkdir merged
shopt -s extglob

d1=$1
d2=$(date -d "$d1 +1 day")

for f in result-@($d1-@(0[2-9]|[1-2][0-9])|$d2-0[01])*.csv.gz ; do
  gzip -cd $f
  mv $f merged/$f
done | gzip > $d1-result.csv.gz

and save it with a .sh extention (say, myscript.sh). Next, in a terminal, type

chmod +x myscript.sh

Now you can type things like

./myscript.sh 20120705

which will then do as you described.

To automatically execute this on a daily basis, you can put a line in your /etc/crontab file, something like

2 2 * * * root ./myscript.sh 

assuming creating the last .csv.gz file takes 1 minute, plus 1 extra minute just to be sure :)

For this way of automation to work properly, the script above needs to be modified a bit. Assuming it will then operate on the current day, change the two lines defining the dates:

d1=$(date +%Y%m%d -d "now -1 day")
d2=$(date +%Y%m%d)

That should do. As always, test it thoroughly before automating it!

查看更多
登录 后发表回答