I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:
Notice: Trying to get property 'nodeValue' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.
Script I've tried with:
<?php
function get_content($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$htmlContent = curl_exec($ch);
curl_close($ch);
return $htmlContent;
}
$link = "https://www.purplle.com/search?q=hair%20fall%20shamboo";
$xml = get_content($link);
$dom = @DOMDocument::loadHTML($xml);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//h1[@class="br-hdng"]/span')->item(0)->nodeValue;
echo "{$title}";
?>
My expected output is:
hair fall shamboo
Although the xpath
I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title
can be found:
<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>
PostScript: The title
I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?
The following are the scripts I've created using two different languages and found them working like magic.
I got success using javascript
:
const puppeteer = require('puppeteer');
function run () {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo");
let urls = await page.evaluate(() => {
let items = document.querySelector('h1.br-hdng span');
return items.innerText;;
})
browser.close();
return resolve(urls);
} catch (e) {
return reject(e);
}
})
}
run().then(console.log).catch(console.error);
Again, I got success using python
:
import requests_html
with requests_html.HTMLSession() as session:
r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo')
r.html.render()
item = r.html.find("h1.br-hdng span",first=True).text
print(item)
What's wrong with php
then?
php doesn't run javascript. presumably,
puppeteer
from your javascript code, as well as requests_html from your python code, both run javascript.your problem is that this page loads the
br-hdng
title & products with javascript, it's not part of the HTML at all. it's all actually loaded fromhttps://www.purplle.com/api/shop/itemsv3
, with a bunch of GET parameters, . you need to do JSON parsing here, not HTML parsing :) but before you can access that api, you need cookies given by the search page, and the search string must match the api search string (otherwise the api will just return errors), check this:output:
It could very well be that there are more issues with your code than I have covered in this answer, but the most prominent issue that I see is the following:
DOMDocument::loadHTML()
is not a static method, but an instance method (which returns a boolean). You should first create an instance ofDOMDocument
and then callloadHTML()
on that instance:However, since you have suppressed errors with the
@
operator on that particular line, you are not receiving a warning about this. And although it's very commonly seen that the error suppressor operator@
is used to suppress HTML validation errors, like this, you should look into usinglibxml_use_internal_errors()
1 instead, as this does not suppress general PHP errors.As a final note:
It's possible to load a DOM document from a URL directly (without the need for
cURL
) withDOMDocument::loadHTMLFile()
, if your PHP installation is configured to allow loading of URLs via the configuration settingallow_url_fopen
. Be aware though that this setting is often disabled for security reasons, so use it with care, if you plan on using it.Here's a simple test-case which should work as expected:
See this example interpreted online on 3v4l.org
You should replace the contents of
$html
with the output of yourget_content()
call. If it doesn't work, then either:there's something wrong with fetching the HTML with
cURL
(dovar_dump( $html );
before loading intoDOMDocument
, for instance, to see the contents you retrieved), or...perhaps you are working inside a namespace, in which case you should prepend a backslash before
DOMDocument
andDOMXPath
, i.e.:new \DOMDocument;
andnew \DOMXPath( $dom );
.1. LibXML is the XML library that is used by DOMDocument to parse XML/HTML documents.