How to extract text from such type of html source?

2019-07-19 07:07发布

I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.

<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>

I want to strip html marks and only have:

RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150

How can I extract such text in delphi?

Thank you very much in advance.

Update:

Cosmin Prund is right. I mistakenly skipped a part. What I want is :

RT @HighfashionUK  RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150

Cosmin Prund is great.

标签： html delphi parsing

1条回答

We Are One

2楼-- · 2019-07-19 07:55

Since all HTML markup is between < and >, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK - your example skipped that, don't know why.

function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
    InTag: Boolean;
    P: PChar;
begin
  SetLength(Result, Length(source));
  P := PChar(Result);
  InTag := False;
  count := 0;
  for i:=1 to Length(source) do
    if InTag then
      begin
        if source[i] = '>' then InTag := False;
      end
    else
      if source[i] = '<' then InTag := True
      else
        begin
          P[count] := source[i];
          Inc(count);
        end;
  SetLength(Result, count);
end;

0人赞添加讨论(0) 举报

How to extract text from such type of html source?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间