extract elements from a html page

2019-08-03 19:17发布

问题:

I download some youtube comment page and I want to extract username(or user display name) and the link like from the following code block:

 <p class="metadata">
      <span class="author ">
        <a href="/channel/UCuoJ_C5xNTrdnc4motXPHIA" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKG174zFqbQCFZmaIQodtmyE0A%3D%3D" dir="ltr">Sabil Muhammad</a>
      </span>
        <span class="time" dir="ltr">
          <a dir="ltr" href="http://www.youtube.com/comment?lc=S2ZH2gSPYaef43vTRkLDxUzo2fYicVUc3SFvmYq2jrs">
            il y a 1 jour
          </a>
        </span>
    </p>

I want to extract /channel/UCuoJ_C5xNTrdnc4motXPHIA and Sabil Muhammad

there are of course many many lines in the html page, but I only want to focus on code blocks as the above and extract all usernames and corresponding links, and put them into a log file

are there any good scripts for this? I know bash and c/c++

thanks!

回答1:

You could use jQuery to accomplish something like this by iterating through all of the 'metadata' classes and pulling the contents that you need :

//After including jQuery within your page
$(document).ready(function()
{
    //Iterates through each of the metadata tags
    $('.metadata').each(function()
    {
          //Pulls the username
          var username = $('.yt-user-name', this).text();
          //Pulls the link
          var link = $('.time a', this).attr('href');
          //Process each accordingly
          alert(username + ':' + link);
    });
});

Working Example



回答2:

If you use jQuery, it's quite easy. However, if you're doing it in bash or c/c++ you'll need to retrieve the content of the page and parse for the elements you are interested in. You could treat the elements as XML and parse for attributes fairly easily.

You could use regex, or simple text matching with sub strings.



回答3:

with awk(if you are good in bash) you can read the page line by line and put a filter to catch <p class="metadata"> and start to copy and end copy if you face </p>

then work on that extracted part, and so on...