How to read a large json in pandas?

2020-06-12 05:53发布

问题:

My code is :data_review=pd.read_json('review.json') I have the data review as fllow:

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

But I got the follow error:

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

My jsonfile do not contain any comments and 3.8G! I just download the file from here to practice link

When I use the follow code,throw the same error:

import json
with open('review.json') as json_file:
    data = json.load(json_file)

回答1:

Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.

From the yelp dataset I have seen, your file must be containing something like:

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

Hence, it is important to realize that this is not single json data rather it is multiple json objects in one file.

To read this data into pandas data frame the following solution should work:

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

Assuming the size of data to be pretty large, I think your machine will take considerable amount of time to load the data into data frame.



回答2:

If you don't want to use a for-loop, the following should do the trick:

import pandas as pd

df = pd.read_json("foo.json", lines=True)

This will handle the case where your json file looks similar to this:

{"foo": "bar"}
{"foo": "baz"}
{"foo": "qux"}

And will turn it into a DataFrame consisting of a single column, foo, with three rows.

You can read more at Panda's docs