I'm working on my applescript right now and I'm stuck here.. Lets take this snippet as an example of html code
<body><div>Apple don't behave accordingly <a href = "http://apple.com>apple</a></div></body>
What I need now is to return the word without the html tags. Either by deleting the bracket with everything in it or maybe there is any other way to reformat html into plain text..
The result should be:
Apple don't behave accordingly apple
How about using textutil?
on run -- example (don't forget to escape quotes)
removeMarkup from "<body><div>Apple don't behave accordingly <a href = \"http://apple.com\">apple</a></div></body>"
end run
to removeMarkup from someText -- strip HTML using textutil
set someText to quoted form of ("<!DOCTYPE HTML PUBLIC>" & someText) -- fake a HTML document header
return (do shell script "echo " & someText & " | /usr/bin/textutil -stdin -convert txt -stdout") -- strip HTML
end removeMarkup
Thought I would add an extra answer because of the problem I had. If you want UTF-8 characters to not get lost you need:
set plain_text to do shell script "echo " & quoted form of ("<!DOCTYPE HTML PUBLIC><meta charset=\"UTF-8\">" & html_string) & space & "| textutil -convert txt -stdin -stdout"
You basically need to add the <meta charset=\"UTF-8\">
meta tag to make sure textutil sees this as an utf-8 document.
on findStrings(these_strings, search_string)
set the foundList to {}
repeat with this_string in these_strings
considering case
if the search_string contains this_string then set the end of the foundList to this_string
end considering
end repeat
return the foundList
end findStrings
findStrings({"List","Of","Strings","To","find..."}, "...in String to search")