I have approx. 12000 cells in excel containing RTF (including formatting tags). I need to parse them to get to the unformatted text.
This is the example of one of the cells with text:
{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}
And all I really need is this:
TPR 0160 000
IPR 0160 000
OB-R-02-28
The problem with simple looping over the cells and removing unnecessary formatting is, that not everything in those 12000 cells is as straightforward as this is. So I would need to manually inspect many different versions and write several variations; and still at the end there would be a lot of manual work to do.
But if I copy the contents of one cell to empty text document and save it as RTF, then open it with MS Word, it instantly parses the text and I get exactly what I want. Unfortunately it's extremely inconvenient to do so for a 12000 cells.
So I was thinking about VBA macro, to move cell contents to Word, force parsing and then copy the result back to the originating cell. Unfortunately I'm not really sure how to do it.
Does anybody has any idea? Or a different approach? I will be really grateful for a solution or a push in the right direction.
TNX!
Your post made it sound as if each RTF document was stored in a single Excell cell. If so, then
Solution using .Net Framework RichTextBox control
will convert the RTF in each cell to plain text in 2 lines of code (after a little system configuration to get the right .tlb file to allow reference to the .Net Framework). Put the cell value in rtfsample and
Some of the solutions here require a reference to the MS Word Object Library. Playing with the cards I am dealt, I found a solution that does not rely on it. It strips RTF tags, and other fluff like font tables and stylesheets, all in VBA. It might be helpful to you. I ran it across your data, and other than the whitespace, I get the same output as what you expected.
Here is the code.
First, something to check if a string is alphanumeric or not. Give it a string that's one character long. This function is used to work out delimitation here and there.
Next up is to remove and entire group. I use this to remove font tables and other rubbish.
Okay, and this function removes any tags.
We can remove curly braces in the obvious way:
Once you have the functions above copy-pasted into your module, you can create a function that uses them to strip away any stuff you don't need or want. The following works perfectly in my case.
I hope this helps. I wouldn't use it in a word processor or anything, but it might do for scraping data if that's what you're doing.
You can try to parse every cell with regular expression and leave only the content you need.
Every RTF control code start with "\" and ends with space, without any additional space between. "{}" are use for grouping. If your text won't contain any, you can just remove them (the same for ";"). So now you stay with your text and some unnecessary words as "Arial", "Normal" etc. You can build the dictionary to remove them also. After some tweaking, you will stay with only the text you need.
Look at http://www.regular-expressions.info/ for more information and great tool to write RegExp's (RegexBuddy - unfortunately it isn't free, but it's worth the money. AFAIR there is also trial).
UPDATE: Of course, I don't encourage you to do it manually for every cell. Just iterate through active range: Refer this thread: SO: About iterating through cells in VBA
Personally, I'll give a try to this idea:
And how to use RegExp's in VBA (Excel)?
Refer: Regex functions in Excel and Regex in VBA
Basically you've to use VBScript.RegExp object through COM.
If you did want to go down the route of using Word to parse the text, this function should help you out. As the comments suggest, you'll need a reference to the MS Word Object Library.
You could call it for each of your 12,000 cells using something similar to this:
The ParseRTF function takes about a second to run (on my machine at least), so for 12,000 cells this will work out at about three and a half hours.
Having thought about this problem over the weekend, I was sure there was a better (quicker) solution for this.
I remembered the RTF capabilities of the clipboard, and realised that a class could be created that would copy RTF data to the clipboard, paste to a word doc, and output the resulting plain text. The benefit of this solution is that the word doc object would not have to be opened and closed for each rtf string; it could be opened before the loop and closed after.
Below is the code to achieve this. It is a Class module named clsRTFParser.
You could call it for each of your 12,000 cells using something similar to this:
I have simulated this using example RTF strings on my machine. For 12,000 cells it took two and a half minutes, a much more reasonable time frame!