Parsing json with awk/sed in bash to get key value

2019-04-02 01:04发布

问题:

I have read many existing questions at SO but none of them answers what I am looking for. I know it is difficult to parse json in bash using sed/awk but I only need a few key-value pairs per record out of a whole list of key-value pairs per record. I want to do this because it will be faster as the main JSON is pretty big with millions of records.

The JSON format is like following:

{
    "documents":
    [
        {
            "title":"a",   //needed
            "description":"b",  //needed
            "id":"c",  //needed
            ....(some more:not useful)....
            "conversation":
            [
                {
                    "message":"",
                    "id":"d",   //not needed
                    .....(some more)....
                    "createDate":"e",   //not needed
                },
                ...(some more messages)....
            ],
            "createDate":"f",  //needed
            ....(many more labels).....
        }
    ],
    ....(some more global attributes)....
}

Now for this I require attributes which are marked as needed but their common key make it a problem to get by simple sed/awk. Could anyone suggest if we can do it with sed/awk. if possible any help to achieve the same would be appreciated.

P.S.: I know about jsawk but I do not want to introduce any dependency, so if possible please suggest usage of sed/awk.

EDIT: Multiple extries of the format given below(as in document we have a list)

"title":"a",
"description":"b"
"id":"c"
"createDate":"f"

EDIT: The JSON is without any spaces. It has been formated for readability.

回答1:

I would advise that you use 'jq', or a real JSON parser. You can't "parse" JSON with arbitrary regular expressions. You could hack something with awk, but that will break easily if your input has a form you didn't anticipate.

So, the answer is, introduce a cheap dependency (jq, or similar tool), and script around that. Unless you're running this script in a router or an embedded computer, chances are you can easily install jq.



回答2:

If the key characters [, and {, }, and ] are always isolated in every line this would work:

#!/usr/bin/awk -f

function walk(level, end) {
    while (getline > 0) {
        if (level && $NF ~ end) {
            return
        } 
        if ($NF == "{") {
            walk(level + 1, "},?")
        } else if ($NF == "[") {
            walk(level + 1, "],?")
        } else if (level == 3 && match($0, /"(title|description|id|createDate)":"[^"]*"/)) {
            print substr($0, RSTART, RLENGTH)
        }
    }
}

BEGIN {
    walk(0)
    exit
}

Input:

{
"documents":
[
{
"title":"a",   //needed
"description":"b",  //needed
"id":"c",  //needed
....(some more:not useful)....
"conversation":
[
{
"message":"",
"id":"d",   //not needed
.....(some more)....
"createDate":"e",   //not needed
},
...(some more messages)....
],
"createDate":"f",  //needed
....(many more labels).....
}
],
....(some more global attributes)....
}

Output:

"title":"a"
"description":"b"
"id":"c"
"createDate":"f"


回答3:

Well, if you're going to use a regex to parse JSON, which will by nature be quick, dirty and heavily reliant on the exact syntax of the input file, you could write something that relies on the amount of white space occurring before the key value pairs you're interested in. Depending on the kind of output you're looking for, you could use something along the lines of:

awk '/^ {12}"title/
/^ {12}"description/
/^ {12}"id/
/^ {12}"createDate/' input_file.json

Not great, but it does the trick on your example input...