I am building a search engine and webcrawler using PHP, and i would like to detect the language of a website, how would i detect the language of a page by:
- Checking the URL for https://twitter.com/?lang=jap
if that is not set then i would like to: - Check the URL https://www.google.co.jp/
if i still can't find anything then i would to set default to English
the code i have so far for scraping pages is:
function crawl($url){
$html = file_get_html($url);
if($html && is_object($html) && isset($html->nodes)){
$weblinks[]=$url;
foreach($html->find('a') as $element) {
global $weblinks;
$link = $element->href;
$base_url = parse_url($url, PHP_URL_HOST);
if(substr($link,0,7)=="http://"){
$link = $link;
}else if(substr($link,0,8)=="https://"){
$link = $link;
}else if(substr($link,0,2)=="//"){
$link = substr($link, 2);
}else if(substr($link,0,1)=="#"){
$link = $html;
}else if(substr($link,0,7)=="mailto:"){
$link = "";
}else if(substr($link,0,11)=="javascript:"){
$link = "";
}else{
if(substr($link, 0, 1) != "/"){
$link = $base_url."/".$link;
}else{
$link = $base_url . $link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && $link != ""){
if(substr($url, 0, 8) == "https://"){
$link = "https://".$link;
}else{
$link = "http://".$link;
}
}
if(!in_array($link, $weblinks)){
$weblinks[]=$link;
}
}
$html->clear();
}else{
}
}
function info($weblinks){
foreach($weblinks as $link) {
$linkhtml = file_get_html("$link");
if($linkhtml && is_object($linkhtml) && isset($linkhtml->nodes)){
$titleraw = $linkhtml->find('title',0);
$title = $titleraw->innertext;
$des = $linkhtml->find("meta[name='description']",0)->content;
//detect language here
echo "<tr><td>".$title."</td><td>".$link."</td><td>".$des."</td></tr>";
$sql = mysql_query("INSERT into web once");
$title = "";
$des = "";
$linkhtml->clear();
}
}
}
To get the language from
?lang=
:You should then validate this with a switch/case statement and a list of languages that you support, like this:
After that, you can use this logic to determine whether you need to check the URL for the language:
Hopefully this will help get you started, although this is a big can of worms you're opening up. There are other things you need to check in order to be certain of the language of a website in addition to what you've mentioned. One of them is the language meta tag, which is like this:
<meta name="language" content="english">
and goes in the head of the webpage, though not all websites use it.Some multilingual websites, like mine, use a subdomain like
http://it.website.com
orhttp://fr.website.com
Others use query strings that are different from
?lang=
. So you'll need to do a significant amount of research to cover all your bases.