-->

Parsing complex HTML tables

2020-08-24 02:51发布

问题:

I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar. An example of the schedule can be seen here:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html

The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code compared with their actual visual position in the browser seem partially arbitrary).

What I've already tried:

  1. Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
  2. Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
  3. Looking for an existing solution, at which point I found some tools (e.g. http://code.google.com/p/skd-schedule-parser/) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
  4. Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.

Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from complex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.

Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?

Edit:

On behalf of a request, I'll paste an example of raw code...

This week:

Results from this code:
http://pastebin.com/BJduUVtU

Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've come with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the comments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html

It looks like this:

The corresponding HTML-code of that week can also be accessed here:
http://pastebin.com/gLTWz5KU

And now the PHP-code (I had a hard time translating the comments since I already struggled expressing them in my first language...I hope they may still be helpful):
http://pastebin.com/Nzi8m2v8

Update

So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting.com. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() command.
Any idea would be appreciated!

For the sake of completeness, I'll post both solutions, existing until now:

  1. jQuery parser by Pierre Dubois:

    (function ($) { $(document).ready(function() {

        var _pe = window.pe || {
            fn : {}
        };
    
        var tblNumber = 0; // Just a incremental number to identify the schedule item with the table
    
        // For each table
        $('table').each(function () {
    
            $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
            // console.log('Parsing the table number: ' + tblNumber);
            tblNumber += 1;
    
            var currentTable = this;
    
    
            // Parser the complex table
            _pe.fn.parsertable.parse($(currentTable));
    
            // Retrieve the parsed data
            var parsedData = $(currentTable).data().tblparser;
    
            //
            // Information about the column structure, nice that is consistent
            //
    
            // Day: Cell index position (0 based)
            // Mo: 3
            // Di: 7
            // Mi: 11
            // Do: 15
            // Fr: 19
            // Sa: 23
    
            // Title Location at Row index position "0"
    
            // "i" represent the middle column position
            for (var i = 3; i < 24; i += 4) {
    
                var currentDay;
    
                // Get the day
                currentDay = $(parsedData.row[0].cell[i].elem).text();
    
                $('#output').append('  Day: ' + currentDay + '<br>');
                // console.log('Day: ' + currentDay);
    
                // Get all the events for that day, excluding the first row and the last row
                for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {
    
                    // First column 
                    if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Second Column
                    if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                        parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Third Column
                    if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                        parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
                } 
    
            }
    
        });
    
    
    });
    

    }(jQuery));

  2. JS parser using positional information by me, realizing rambo coder's idea

回答1:

You could make use of a browsers rendering/layout engine here.

Use http://phantomjs.org/ to get access to a headless browser that lets you execute javascript on a webpage's dom.

A dash of jquery would make the remaining pseudocode easy to implement:

foreach (td.t as dateElement) {
    //parse date from element text
    //use pixel position + dimensions to calc pixel coord of center
    // save this center in a list along with the date
}

foreach (td.v as calendarEntryElement) {
    //parse time + other stuff from element text
    //use pixel position to find the closest date element in that list(it must be the closest one above)
}

I feel positional information would be very reliable here, because everything is a nested rectangle and its all done via tables.

You don't need to use phantomjs, you could just as easily execute a browser manually, and let it send a request to a local server to collect the results.

Some shell command roughly like

firefox file://foo123.html

Where you've appended some custom <script> to the end of one of their webpages and saved it.



回答2:


I study at the same university and a few weeks ago I faced the same problem to parse this time table and convert it to an ICS file. Finally I found my own solution and generalized the code, so that students from other universities, using the Sked software and have a much more complex time table, can import their time table too.
I also created a website, where students can sign up and configure the urls to the time tables which they want to subscribe. In the background runs a cronjob which ensures, that the subscribed calendars are always up to date. You can find the result of the project on my website:
http://calendar.pineappledeveloper.com/
(it is only in German available).



回答3:

Many developers use HTML Agility Pack to parse HTML:

http://htmlagilitypack.codeplex.com/



回答4:

I have some hope for your problem since the html is well formed and is x-html compliant.

The following Java program parse it successfully but without extracting information.

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;

public final class AgendaParser {
   public static void main( String[] args ) throws Throwable {
      long atStart = System.currentTimeMillis();
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
      DocumentBuilder db = dbf.newDocumentBuilder();
      Document doc = db.parse( "WIA13-7. Block.html" );
      System.err.println( "Elapsed time: " + ( System.currentTimeMillis() - atStart ) / 1000L + " seconds" );
      System.err.println( doc.getElementsByTagName( "table" ));
   }
}

it outputs:

Elapsed time: 118 seconds (very long, no?)
com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@7faea002

I thing a couple of hours of work gives you 80% of the extraction.



回答5:

Interresting complex table. A mix between a layout table and a data table.

This may help you, that solution is in jQuery and use the same complex table parser available in the Web Experience Toolkit Project. With that table parser, you will be able to retreive your schedule data. The only thing that need to be done would be to parse the schedule item content for the importation to your calendar apps.

This solution use the column instead of the row to retrieve the schedule items

Working example : http://jsfiddle.net/3t2A8/2/

Here the javascript code that is used to extract and show the schedule items

(function ($) {
    $(document).ready(function() {

        var _pe = window.pe || {
            fn : {}
        };

        var tblNumber = 0; // Just a incremental number to identify the schedule item with the table

        // For each table
        $('table').each(function () {

            $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
            // console.log('Parsing the table number: ' + tblNumber);
            tblNumber += 1;

            var currentTable = this;


            // Parser the complex table
            _pe.fn.parsertable.parse($(currentTable));

            // Retrieve the parsed data
            var parsedData = $(currentTable).data().tblparser;

            //
            // Information about the column structure, nice that is consistent
            //

            // Day: Cell index position (0 based)
            // Mo: 3
            // Di: 7
            // Mi: 11
            // Do: 15
            // Fr: 19
            // Sa: 23

            // Title Location at Row index position "0"

            // "i" represent the middle column position
            for (var i = 3; i < 24; i += 4) {

                var currentDay;

                // Get the day
                currentDay = $(parsedData.row[0].cell[i].elem).text();

                $('#output').append('  Day: ' + currentDay + '<br>');
                // console.log('Day: ' + currentDay);

                // Get all the events for that day, excluding the first row and the last row
                for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {

                    // First column 
                    if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {

                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();

                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }

                    // Second Column
                    if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                        parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {

                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i].cell[j].elem).text().trim();

                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }

                    // Third Column
                    if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                        parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {

                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();

                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
                } 

            }

        });


    });

}(jQuery));

it would output for the "double courses" table

Parsing the table number: 0
  Day: Mo, 22.10.2012
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  Day: Di, 23.10.2012
  + Event: 9:00 - 10:30 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 10:45 - 12:15 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  Day: Mi, 24.10.2012
  + Event: 9:00 - 10:30 Uhr
      Vorlesung
      DATMOD
      HG: 09
  + Event: 10:45 - 12:15 Uhr
      Vorlesung
      DATMOD
      HG: 09
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      IuF
      HG: 09
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      IuF
      HG: 09
  Day: Do, 25.10.2012
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      Linux Gr. 2
      HG: 15
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      Linux Gr. 2
      HG: 15
  Day: Fr, 26.10.2012
  Day: Sa, 27.10.2012
  + Event: 9:00 - 11:15 Uhr
      Klausur
      MP INT RW
      HG: 14

:-)



回答6:

pseudoPHP

class ScheduleTableParser {
  buildTimetableFromTable() {
    //Trivial
    Parse Day Rowspans, Day Names, Dates into $this->days;
    e.g. $days[0]['rowspan'] = 4 for Monday 22/10/2012

    //Extract Lessons      
    $tr = 0;
    foreach tr {
      $td = 0;
      foreach td{
        if(td.class = 'v') {
          parseClass($td,$tr,$tdDOMObject);
        }
        $td++;
      }
    }
 }
 parseClass($td,$tr,$tdDOMObject) {
   //Trivial
   Get the Class Name Etc   

   //Calculate Time
   $time = 9:00 + 5mins*tr;
   $tr = $tr - 2;
   $i = 0;
   while($tr > 0) {
     $tr - $this->days[$i]['rowspan'];
     $day = $this->days[$i]['name'];
     $date = $this->days[$i]['date'];
     $i++;
   }
 }
}