Using JSON or regex when processing tweets

2019-07-28 07:51发布

问题:

Which is faster method, using JSON parser (python 2.6) or regex for obtaining relevant data. Since the amount of data is huge, I presume there will considerable difference in time when one method is used in comparison to other.

回答1:

Assuming what you are asking...

I believe you're asking if it's faster to obtain information from a serialized JSON string by deserializing it or searching for the relevant match via regex.

Quick answer

In my unofficial experience with looking for a single key-value pair in an activity streams object (tweet, retweet or quote) in serialized JSON, using regex scales better than parsing the entire JSON object.

Why?

This is because tweets are pretty big, and when you're working with hundreds of thousands of them, deserializing the entire JSON string and randomly accessing the resulting JSON object for a single key-value pair is like using a sledgehammer to crack a nut.

Potential plotholes...

The problem arises, however, when keys are repeated at different levels of nesting.

For example, quotes have a root level attribute called twitter_quoted_status which contains a copy of the tweet this quote object refers to.

That means any attribute name shared by both tweets and quotes would return at least 2 matches if you searched a serialized quote object with regex.

Since you cannot and should not rely on the reliability of the order of attributes within a JSON object (dictionary keys are supposed to be unordered!), you can't even rely on the match you want being the first or second (or whatever) match if you've identified that pattern so far.

The only evidence I can share with you at the moment, is that to retrieve a single key-value pair from 100,000 original tweet objects (no quotes nor retweets), my desktop tended to take 8-14 seconds when using the deserialization method, and 0-2 when using regex.

Disclaimer

Numbers are approximate and from memory. Sorry just providing a quick answer, don't have the tools to test this and post findings at my disposal right now.



回答2:

You can't use regex to parse JSON.

As an example, if you wanted to select an item from a JSON list, you would have to count the number of elements that come before it. This would require you to know what an element is and to be smart about matching braces and so forth. Pretty soon you'll have implemented a JSON parser, but one that depends on lots of tiny regexes that probably aren't very efficient.