I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.
I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.
Set a reference to "Microsoft HTML object library".
Tim
Tim's solution was great, worked liked a charm.
I´d like to contribute: Use this code to add the "Microsoft HTML Object Library" in runtime:
It worked on Windows XP and Windows 7.
Yes! I managed to solve my problem as well. Thanks everybody/
In my case, I had this sort of input:
And I did not want the result to be all jammed together without breaklines.
So I first splitted my input for every
<p>
tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).In addition I concatenated each cleaned 'paragraph' with this breakline character
Crh(10)
for VBA/Excel.The final code is:
A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.
This can leave lots of extraneous whitespace, but it will help in removing the tags.
Tim's answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.