Bash - Read HTML & find div based on two different

2019-08-02 09:29发布

问题:

I am trying to get information from a div based on a date that I am holding as a variable, then I am trying to filter the returned results based on another variable to narrow down the results list to a single match in order to extract the URL.

Example of the HTML of the page, this will have another 10 items with the information being different. The same date may appear more than once..

<div class="bhangra-artist details ">

    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Artist Name goes here" title="Artist Name">Artist Name</a>                            </div>


        <time>
            September 08, 2018            </time>

        <div class="release-information">
                                <a class="date-of-release" href="releases-today" data-trackid="releases today" title="releases today">
                    <span class="label-left-box">releases today</span>
                    <span class="label-text">releases today</span>
                </a>
                            <span class="label-hd "></span>
        </div>
    </div>

In my script I am running

DATE=$(cat html.txt | sed -n -e '/bhangra-artist card/,/<\/time>/ p'  )
echo "${DATE}"

This returns the below but all results so theirs about 10 matches returned.. I am simply showing example of 3.

    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Artist Name goes here" title="Artist Name">Artist Name</a>                            </div>


        <time>
            September 08, 2018            </time>
    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Name Artist goes here" title="Name Artist">Name Artist</a>                            </div>


        <time>
            September 08, 2018            </time>
    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Artist1 Name & Artist2 Name" title="Artist1 Name & Artist2 Name">Artist1 Name & Artist2 Name</a>                            </div>


        <time>
            September 05, 2018            </time>

With the returned results I am now attempting to narrow them down to one result. I have a variable called $ReleaseDate which will have have value September 08, 2018 So now that ${DATE} has 10 different divs with dates I need to match all the ones containing the date in $ReleaseDate This is the part I am not sure about how to do.

I'd expect the results to be narrowed down to the date variable so with the above example i'd expect the 3 results to be down to 2 results.

    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Artist Name goes here" title="Artist Name">Artist Name</a>                            </div>


        <time>
            September 08, 2018            </time>
    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Name Artist goes here" title="Name Artist">Name Artist</a>                            </div>


        <time>
            September 08, 2018            </time>

Once I have narrowed down the results from 10 to ones matching my date variable their will be 1-3 results left. So now I need to filter this down to 1 result.

I have my final variable which is $artistName This unfortunately contains "Artist Name Name Of The Album" so what I am looking to do is simply match the first word which will always be a artist name.

So I am looking to match $artistName to the line "Artist Name " once this has been done I'd want the containing div and all other divs removed so that I am left with one result.

    <div class="bhangra-artist card">
        <div class="bhangra-artist-title" style="text-overflow: none;">
        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>
        </div>

        <div class="artist-names">
            <a href="/bhangra/artist/profile/id/1234/artist-name/" data-trackid="Artist Name goes here" title="Artist Name">Artist Name</a>                            </div>


        <time>
            September 08, 2018            </time>

Once I have only one result. I am attempting to get the link for this album. I believe I can target this already but I am matching it against the html so all instances as I cannot filter the divs based on variables I have.

        <a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
            Title Of The Album            </a>

End Result: /bhangra/artist/album/id/123456/title-of-the-album/

I have certain elements done but I am totally stuck on how to piece it all together.

So to Recap:-

My Variables and Values are:-

$DATE=September 08, 2018
$artistName=Artist Name Name Of The Album

The code I have so far.

#!/bin/bash

echo "date : ${DATE}" #This has the value September 08, 2018
echo "artist: ${artistName}" #This has the value Artist Name Name Of The Album
# Get HTML and find the DIV's containing the information that is required.
# GetContainer Reads the html file html.txt, using sed to target the entire bhangra artist panel until the time section. This returns 10 results.
GetContainer=$(cat html.txt | sed -n -e '/bhangra-artist details/,/<\/time>/ p' )
IFS=$OIFS
# Run GetContainer into another variable called filterDATE and now search this for a date retain the containing div and remove all other results. This step should take the results down from 10 results to only a handful 1-3 usually. 
As the the date match would only return 1-3 results in total.
filterDATE=$(echo -n "$GetContainer" )
filterDATE=$(echo -n "$filterDATE" ) #Unsure how to do this so this step is blank as I am unsure how to verify the selected date against my variable $DATE and the get all of the outer div.

# Now that I only have 1-3 results I want to narrow this down to one result. 
Using $artistName retain the containing div and remove all other results. This step would mean that I should now only be left with 1 result.

# Once I have narrowed down the results to 1 strip away html so that only the link is remaining. I guess at this point I need to use the results from filterDATE but for now I am checking I can pull back the link from the raw html.
GETURL=$(cat html.txt | sed -n -e '/bhangra-artist-title/,/<\/a>/ p' | grep "bhangra" | sed 's/<a href=\"//g' | sed 's/"//g' )

echo "${DATE}"
echo "${filterDATE}"
echo "${GETURL}"

Any help would be appreciated.