Retrieve data from HTML table in C#

2020-04-08 13:51发布

问题:

I want to retrieve data from HTML document. I am scraping data from a web site I almost done but get issue when tried to retrieve data from the table. Here is HTML code

<div id="middle_column">
<form action="url?" method="post" name="inquirydetail">
    <input type="hidden" name="ServiceName" value="SurgeWebService">
    <input type="hidden" name="TemplateName" value="Inpat_AvailableResponses.htm">
    <input type="hidden" name="CurrentPage" value="inquirydetail">
    <form method="post" action="url" name="ResponseSel" onSubmit="return EditPage(document.forms[3])">    
<TABLE
<tBody
 <table
....
</table

 <table
....
</table
 <table border="0" width="90%">
                    <tr>
                      <td width="10%" valign="bottom" class="content"> Service Number</td>
                      <td width="30%" valign="bottom" class="content"> Status</td>
                      <td width="50%" valign="bottom" class="content"> Status Date</td>
                    </tr>
                    <tr>
                      <td width="20%" bgcolor="white" class="subtitle">1</td>
                      <td width="40%" bgcolor="white" class="subtitle">Approved</td>
                      <td width="40%" bgcolor="white" class="subtitle">03042014</td>
                    </tr>
                    <tr>
                      <td></td>
                    </tr>
                  </table>
</tbody>
</TABle>
</div>

I have to retrieve data for Status field It is Approved and write it in SQL DB There are many tables in the form tag.Tables do not have IDs.How I can get correct table,row and cell Here is my code

 HtmlElement tBody = WB.Document.GetElementById("middle_column");
   if (tBody != null)
                {
                   string sURL = WB.Url.ToString();
                    int iTableCount = tBody.GetElementsByTagName("table").Count;
                 }
   for (int i = 0; i <= iTableCount; i++)
                    {
                        HtmlElement tb=tBody.GetElementsByTagName("table")[i];
                    }

Something is wrong here Please help with this.

回答1:

Don't you have any control over the page being displayed within the Webbrowser control? If you do it's better you add an id field for status TD. Then your life would be much easier.

Anyway, here's how you could search a value within a table.

HtmlElementCollection tables = this.WB.Document.GetElementsByTagName("table");

            foreach (HtmlElement TBL in tables)
            {
                foreach (HtmlElement ROW in TBL.All)
                {

                    foreach (HtmlElement CELL in ROW.All)
                    {

                        // Now you are looping through all cells in each table

                        // Here you could use CELL.InnerText to search for "Status" or "Approved"
                    }
                }
            }

But, this is not a good approach as you are looping through each table and each cell within each table to find your text. Keep this as the last option.

Hope this helps you to get an idea.



回答2:

I prefer using the dynamic type and the DomElement property, but you must be using .net 4+.

For tables, the main advantage here is that you don't have to loop through everything. If you know the row and column that you are looking for, then you can just target the important data by row and column numbers instead of looping through the whole table.

The other big advantage is that you can basically use the entire DOM, reading more than just the contents of the table. Make sure you use lowercase properties as required in javascript, even though you are in c#.

HtmlElement myTableElement;
//Set myTableElement using any GetElement...  method.
//Use a loop or square bracket index if the method returns an HtmlElementCollection.
dynamic myTable = myTableElement.DomElement;
for (int i = 0; i < myTable.rows.length; i++)
{
    for (int j = 0; j < myTable.rows[i].cells.length; j++)
    {
        string CellContents = myTable.rows[i].cells[j].innerText;

        //You are not limited to innerText; you have the whole DOM available.

        //Do something with the CellContents.

    }
}