Smartest way to store huge amounts of data

2019-07-07 04:23发布

问题:

I want to access the flickr API with a REST request and download the Metadata of approx. 1 Mio photos (maybe more). I want to store them in a .csv file and import them then into a MySQL Database for further processing

I am wondering what is the smartest way to handle such big data. What I am not sure about is how to store them after accessing the website in Python, passing them to the .csv file and from there to the db. Thats one big questionmark.

Whats happening now (for my understanding, see code below) is that a dictionary is created for every photo (250 per called URL). This way I would end up with as many dictionaries as photos (1 Mio or more). Is that possible? All these dictionaries will be appended to a list. Can I append that many dictionaries to a list? The only reason I want to append the dictionaries to the list is because it seems way easier to save from a list, row per row, to a .csv file.

What you should know is that I am a complete beginner to programming, python or what so ever. My profession is a completely different one and I just started to learn. If you need any further explanations please let me know!

#accessing website
list = []
url = "https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5...1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description"
soup = BeautifulSoup(urlopen(url)) #soup it up
for data in soup.find_all('photo'):
    dict = {
        "id": data.get('id'),
        "title": data.get('title'),
        "tags": data.get('tags'),
        "latitude": data.get('latitude'),
        "longitude": data.get('longitude'),
    }
print (dict)

list.append(dict)

I am working with python 3.3. The reason why I do not pass the data direct into the db is because I cannot get the python connecter for mysql db on my os x 10.6 to run.

Any help is very appreciated. Thank you folks!

回答1:

I recommend using SQLite for prototyping this rather than messing with CSV. SQLite works very well with python but you don't have to go through all the headache of setting up a separate database.

Also, I don't think you want to use BeautifulSoup for this, since it doesn't sound like scraping is what you really want. It sounds like you want to access the REST API directly. For that you'll want to use something like the requests library or better yet one of the flickr Python bindings.

Once you have that up and running, I would write to the DB during each iteration of the loop, saving as you go. That way you're not using tons of memory and if something crashes you don't lose the data you've pulled so far.