I'm trying to parse an HTML file which I have converted to a TXT file inside of Automator.
I previously downloaded the HTML file from a website using Automator, and I am now struggling to parse the source code.
Preferably, I want to take the information of just the table and I need to repeat this action for 1800 different HTML files.
Here is an example of the source code:
</head>
<body>
<div id="header">
<div class="wrapper">
<span class="access">
<div id="fb-root"></div>
<span class="access">
Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward | <a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>
</span>
</span>
</div><!-- /wrapper -->
</div><!-- /header -->
<div id="masthead">
<div class="wrapper">
<a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
<div id="navigation">
<ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>
</div><!-- /navigation -->
</div><!-- /wrapper -->
</div><!-- /masthead -->
<div id="content">
<div class="wrapper">
<div id="main-content">
<!-- per Project stuff -->
<span class="section">
<img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
<h1><span id="profile-name-104947" >Christian Sieling</span></h1>
<ul class="gbutton-group right">
<li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">« Back </a></li>
<li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
</ul>
<div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
<span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
<a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
</div>
<h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>
</span>
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
<tr>
<th>Role</th>
<td>
<p>Other</p> </td>
</tr>
<tr>
<th>Organisation Type</th>
<td>
<p>Asset Manager</p> </td>
</tr>
<tr>
<th>Email</th>
<td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td>
</tr>
<tr>
<th>Website</th>
<td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
</tr>
<tr>
<th>Phone</th>
<td>41 78 616 7334</td>
</tr>
<tr>
<th>Fax</th>
<td></td>
</tr>
<tr>
<th>Mailing Address</th>
<td>Birrenstrasse 30</td>
</tr>
<tr>
<th>City</th>
<td>Schindellegi</td>
</tr>
<tr>
<th>State</th>
<td>CH</td>
</tr>
<tr>
<th>Country</th>
<td>Switzerland</td>
</tr>
<tr>
<th class="lastrow" >Zip/ Postal Code</th>
<td class="lastrow" >8834</td>
</tr>
</table>
</div><!-- /main-content -->
<div id="sidebar" >
</div>
<div id="similar_sidebar" class="similar_refine" >
</div>
</div><!-- /wrapper -->
</div><!-- /content -->
<div id="footer">
</div>
My AppleScript attempt that is using text item delimiters
to extract the table in a similar fashion:
set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween
How can I parse the table from the HTML file?
Rather than make your own HTML parser, you can exploit the HTML parser in Safari via the do javascript command. JavaScript has built-in functionality for working with HTML elements and data.
This script gets the HTML for just the first table in a page:
You can use this technique to apply basic DOM Scripting to any page and grab out any data that you want to read out. You can get just the values of the table cells, or whatever you want.
One-line wonder that works:
tell application "Safari" to set sourceCode to characters (offset of
<table
in (source of document 1 as string)) thru ((offset of "/table" in (source of document 1 as string)) + (count of "/table")) of (source of document 1 as string) as stringTry:
You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...
So I modified your code to look for that tag in 2 steps. First...
And then this separately...
In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...