facing issue with “wget” in python

2019-06-09 09:49发布

I am very novice to python. I am facing issue with "wget" as well as " urllib.urlretrieve(str(myurl),tail)"

when I run script it's downloading files but filename are ending with "?"

my complete code :

import os
import wget
import urllib
import subprocess
with open('/var/log/na/na.access.log') as infile, open('/tmp/reddy_log.txt', 'w') as outfile:
    results = set()
    for line in infile:
        if ' 200 ' in line:
            tokens = line.split()
            results.add(tokens[6]) # 7th token
    for result in sorted(results):
        print >>outfile, result
with open ('/tmp/reddy_log.txt') as infile:
     results = set()
     for line in infile:
     head, tail = os.path.split(line)
                print tail
                myurl = "http://data.xyz.com" + str(line)
                print myurl
                wget.download(str(myurl))
                #  urllib.urlretrieve(str(myurl),tail)

output :

# python last.py
0011400026_recap.xml

http://data.na.com/feeds/mobile/android/v2.0/video/games/high/0011400026_recap.xml

latest_1.xml

http://data.na.com/feeds/mobile/iphone/article/league/news/latest_1.xml

currenttime.js

Listing the files :

# ls
0011400026_recap.xml?                   currenttime.js?  latest_1.xml?      today.xml?

标签: python wget
1条回答
一纸荒年 Trace。
2楼-- · 2019-06-09 10:30

A possible explanation of the behaviour you experience is that you do not sanitize your input line

with open ('/tmp/reddy_log.txt') as infile:
     ...
     for line in infile:
         ...
         myurl = "http://data.xyz.com" + str(line)
         wget.download(str(myurl))

When you iterate on a file object, (for line in infile:) the string you get is terminated by a newline ('\n') character — if you do not remove the newline before using line, oh well, the newline character is still there in what is produced by your use of line

As an illustration of this concept, have a look at the transcript of a test I've done

08:28 $ cat > a_file
a
b
c
08:29 $ cat > test.py
data = open('a_file')
for line in data:
    new_file = open(line, 'w')
    new_file.close() 
08:31 $ ls
a_file  test.py
08:31 $ python test.py
08:31 $ ls
a?  a_file  b?  c?  test.py
08:31 $ ls -b
a\n  a_file  b\n  c\n  test.py
08:31 $

As you can see, I read lines from a file and create some files using line as the filename and guess what, the filenames as listed by ls have a ? at the end — but we can do better, as it's explained in the fine manual page of ls

  -b, --escape
         print C-style escapes for nongraphic characters

and, as you can see in the output of ls -b, the filenames are not terminated by a question mark (it's just a placeholder used by default by the ls program) but are terminated by a newline character.

While I'm at it, I have to say that you should avoid to use a temporary file to store the intermediate results of your computation.

A nice feature of Python is the presence of generator expressions, if you want you can write your code as follows

import wget

# you matched on a '200' on the whole line, I assume that what
# you really want is to match a specific column, the 'error_column'
# that I symbolically load from an external resource
from my_constants import error_column, payload_column

# here it is a sequence of generator expressions, each one relying
# on the previous one

# 1. the lines in the file, stripped from the white space
#    on the right (the newline is considered white space)
#    === not strictly necessary, just convenient because
#    === below we want to test for non-empty lines
lines = (line.rstrip() for line in open('whatever.csv'))

# 2. the lines are converted to a list of 'tokens' 
all_tokens = (line.split() for line in lines if line)

# 3. for each 'tokens' in the 'all_tokens' generator expression, we
#    check for the code '200' and possibly generate a new target
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

# eventually, use the 'targets' generator to proceed with the downloads
for target in targets: wget.download(target)

Don't be fooled by the amount of comments, w/o comments my code is just

import wget
from my_constants import error_column

lines = (line.rstrip() for line in open('whatever.csv'))
all_tokens = (line.split() for line in lines if line)
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

for target in targets: wget.download(target)
查看更多
登录 后发表回答