Get JSON string from within javascript on a html p

2019-03-01 08:51发布

There's valid json in a javascript on a html page that I want to parse with a shell script. First of all I would like to get the entire json string from { to } and then I can parse it with jq for example.

This is basically how my html looks:

<!DOCTYPE html>
<html>
  <head>
    <title>foobar</title>

  </head>

  <body>

  <script type="text/javascript" src="resources/script.js" charset="UTF-8"></script>
  <script type="text/javascript" src="resources/resources.js" charset="UTF-8"></script>

    <script type="text/javascript">
    if( foo.foobar.getInstance().isbar() ) 
    {
        foo.bar.Processor.message( {"head":{"url":"anotherfoo;barid=347EDAFA2B136D7825745B0A490DE32"},...});
    }
    else
    {....}
    </script>
  </body>
</html>

In the end I want to get the ID that's at "barid=...". I was playing around trying to use grep foo.bar.Processor.message and then sed or cut but I think there's better ways to do it. If you could point me in the right direction that'd be great! Thank you!

3条回答
Anthone
2楼-- · 2019-03-01 08:56

as it was mentioned in prior posts, parsing of nested formats should be done by the corresponding nested-aware processors.

in this case, one of the correct ways to approach the problem would be: convert HTML to JSON, then find the required field with a JSON-aware parser and then strip off the parts which are not required.

as I developer of of both unix utilities required to achieve this ask (without being exposed to false positives), I propose this solution (assuming your html in file.html):

bash $ cat file.html | jtm | jtc -w'<barid=>R' | sed -E 's/.*barid=([0-9A-F]+).*/\1/g'
347EDAFA2B136D7825745B0A490DE32
bash $ 

that way no false positives are expected.

PS. you may find both of the jtm (html <-> json convertor) and jtc (json tool) by googling up the keywords jtm, json, jtc, both of utilities are FOSS.

查看更多
戒情不戒烟
3楼-- · 2019-03-01 09:07

Usually it is not recommended to use unix command line tools for parsing HTML. But if you know your marker string foo.bar.Processor.message, then you may use this sed + jq solution:

sed -n 's/foo\.bar\.Processor\.message(\([^)]*\).*/\1/p' file.html |
jq -r '.head.url | split(";")[1] | split("=")[1]'

347EDAFA2B136D7825745B0A490DE32

In the absence of jq, you may use this sed + gnu grep solution:

sed -n 's/foo\.bar\.Processor\.message(\([^)]*\).*/\1/p' file.html |
grep -oP ';barid=\K\w+'
查看更多
ゆ 、 Hurt°
4楼-- · 2019-03-01 09:20

One option might be to use , at least for parsing the HTML:

< input.html pup 'script:not(:empty) text{}' |
  grep foo.bar.Processor.message | grep -o '{.*}' |
  jq -r '.head.url
         | split(";")[]
         | select(test("barid="))
         | sub("barid=";"")'

With your HTML (adjusted to ensure the JSON in the HTML is valid), this produces:

347EDAFA2B136D7825745B0A490DE32

Of course there are many caveats. YMMV.

查看更多
登录 后发表回答