How to extract text from such type of html source?

2019-07-19 07:07发布

I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.

<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>

I want to strip html marks and only have:

RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150 

How can I extract such text in delphi?

Thank you very much in advance.

Update:

Cosmin Prund is right. I mistakenly skipped a part. What I want is :

RT @HighfashionUK  RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150 

Cosmin Prund is great.

1条回答
We Are One
2楼-- · 2019-07-19 07:55

Since all HTML markup is between < and >, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK - your example skipped that, don't know why.

function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
    InTag: Boolean;
    P: PChar;
begin
  SetLength(Result, Length(source));
  P := PChar(Result);
  InTag := False;
  count := 0;
  for i:=1 to Length(source) do
    if InTag then
      begin
        if source[i] = '>' then InTag := False;
      end
    else
      if source[i] = '<' then InTag := True
      else
        begin
          P[count] := source[i];
          Inc(count);
        end;
  SetLength(Result, count);
end;
查看更多
登录 后发表回答