extract elements from a html page

2019-08-03 19:16发布

I download some youtube comment page and I want to extract username(or user display name) and the link like from the following code block:

 <p class="metadata">
      <span class="author ">
        <a href="/channel/UCuoJ_C5xNTrdnc4motXPHIA" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKG174zFqbQCFZmaIQodtmyE0A%3D%3D" dir="ltr">Sabil Muhammad</a>
      </span>
        <span class="time" dir="ltr">
          <a dir="ltr" href="http://www.youtube.com/comment?lc=S2ZH2gSPYaef43vTRkLDxUzo2fYicVUc3SFvmYq2jrs">
            il y a 1 jour
          </a>
        </span>
    </p>

I want to extract /channel/UCuoJ_C5xNTrdnc4motXPHIA and Sabil Muhammad

there are of course many many lines in the html page, but I only want to focus on code blocks as the above and extract all usernames and corresponding links, and put them into a log file

are there any good scripts for this? I know bash and c/c++

thanks!

3条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-08-03 19:58

If you use jQuery, it's quite easy. However, if you're doing it in bash or c/c++ you'll need to retrieve the content of the page and parse for the elements you are interested in. You could treat the elements as XML and parse for attributes fairly easily.

You could use regex, or simple text matching with sub strings.

查看更多
你好瞎i
3楼-- · 2019-08-03 20:00

with awk(if you are good in bash) you can read the page line by line and put a filter to catch <p class="metadata"> and start to copy and end copy if you face </p>

then work on that extracted part, and so on...

查看更多
Juvenile、少年°
4楼-- · 2019-08-03 20:08

You could use jQuery to accomplish something like this by iterating through all of the 'metadata' classes and pulling the contents that you need :

//After including jQuery within your page
$(document).ready(function()
{
    //Iterates through each of the metadata tags
    $('.metadata').each(function()
    {
          //Pulls the username
          var username = $('.yt-user-name', this).text();
          //Pulls the link
          var link = $('.time a', this).attr('href');
          //Process each accordingly
          alert(username + ':' + link);
    });
});

Working Example

查看更多
登录 后发表回答