I receive an html string using curl:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html_string = curl_exec($ch);
When I echo
it I see a perfectly good html as I require for my parsing needs.
But, When trying to send this string to HTML DOM PARSER
method str_get_html($html_string)
, It would not upload it (returns false from the method invocation).
I tried saving it to file and opening with file_get_html
on the file, but the same thing occurs.
What can be the cause of this? As I said, the html looks perfectly fine when I echo it.
Thanks a lot.
The code itself:
$html = file_get_html("http://www.bgu.co.il/tremp.aspx");
$v = $html->find('input[id=__VIEWSTATE]');
$viewState = $v[0]->attr['value'];
$e = $html->find('input=[id=__EVENTVALIDATION]');
$event = $e[0]->attr['value'];
$html->clear();
unset($html);
$body = " A_STRING_THAT_CONTAINS_SOME_DATA "
$ch = curl_init("http://www.bgu.co.il/tremp.aspx");
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html_string = curl_exec($ch);
$file_handle = fopen("file.txt", "w");
fwrite($file_handle, $html_string);
fclose($file_handle);
curl_close($ch);
$html = str_get_html($html_string);
You curl link seems have many element(large file).
And I am parsing a string(file) as large as your link and encounter this problem.
After I saw the source code, I found the problem. It works for me !
I found that simple_html_dom.php have limit the size you read.
you must to change the default size below (It's on the top of the simple_html_dom.php)
maybe change to 100000000 ? it's up to you.
I asume that you are using curl + str_get_html instead of simply using file_get_html with the URL because of the POST parameters you need to send.
You can use this W3C validator (http://validator.w3.org/#validate_by_input+with_options) to validate the returned HTML, then, once you are sure the result is a 100% valid HTML code you can report a bug here: http://sourceforge.net/p/simplehtmldom/bugs/.
Did you check if the HTML is somehow encoded in a way HTML DOM PARSER doesn't expect? E.g. with HTML entities like
<html>
instead of<html>
– that would still be displayed as correct HTML in your browser but wouldn't parse.