There's valid json in a javascript on a html page that I want to parse with a shell script.
First of all I would like to get the entire json string from {
to }
and then I can parse it with jq
for example.
This is basically how my html looks:
<!DOCTYPE html>
<html>
<head>
<title>foobar</title>
</head>
<body>
<script type="text/javascript" src="resources/script.js" charset="UTF-8"></script>
<script type="text/javascript" src="resources/resources.js" charset="UTF-8"></script>
<script type="text/javascript">
if( foo.foobar.getInstance().isbar() )
{
foo.bar.Processor.message( {"head":{"url":"anotherfoo;barid=347EDAFA2B136D7825745B0A490DE32"},...});
}
else
{....}
</script>
</body>
</html>
In the end I want to get the ID that's at "barid=...".
I was playing around trying to use grep foo.bar.Processor.message
and then sed
or cut
but I think there's better ways to do it.
If you could point me in the right direction that'd be great!
Thank you!
as it was mentioned in prior posts, parsing of nested formats should be done by the corresponding nested-aware processors.
in this case, one of the correct ways to approach the problem would be: convert HTML to JSON, then find the required field with a JSON-aware parser and then strip off the parts which are not required.
as I developer of of both unix utilities required to achieve this ask (without being exposed to false positives), I propose this solution (assuming your html in file.html):
that way no false positives are expected.
PS. you may find both of the
jtm
(html <-> json convertor) andjtc
(json tool) by googling up the keywordsjtm
,json
,jtc
, both of utilities are FOSS.Usually it is not recommended to use unix command line tools for parsing HTML. But if you know your marker string
foo.bar.Processor.message
, then you may use thissed + jq
solution:In the absence of
jq
, you may use thissed + gnu grep
solution:One option might be to use pup, at least for parsing the HTML:
With your HTML (adjusted to ensure the JSON in the HTML is valid), this produces:
Of course there are many caveats. YMMV.