I have written some code in PHP that returns the html content from .edu domains. A brief introduction is given here: Errors regarding Web Crawler in PHP
The crawler works fine when the number of links to crawl are small (something around 40 URLS) but I am getting "MySQL server has gone away" error after this number.
I am storing html content as longtext in MySQL tables and I am not getting why the error arrives after at least 40-50 insertions.
Any help in this regard is highly appreciated.
Please note that I have already altered the wait_timeout and max_allowed_packet to accomodate my queries and the php code and now I don't know what to do. Please help me in this regard.
You might be inclined to handle this problem by "pinging" the mysql server before a query. This is a bad idea. For more on why, check this SO post: Should I ping mysql server before each query?
The best way to handle the issue is by wrapping queries inside try/catch
blocks and catching any database exceptions so that you can handle them appropriately. This is especially important in long running and/or daemon type scripts. So, here's a very basic example using a "connection manager" to control access to DB connections:
class DbPool {
private $connections = array();
function addConnection($id, $dsn) {
$this->connections[$id] = array(
'dsn' => $dsn,
'conn' => null
);
}
function getConnection($id) {
if (!isset($this->connections[$id])) {
throw new Exception('Invalid DB connection requested');
} elseif (isset($this->connections[$id]['conn'])) {
return $this->connections[$id]['conn'];
} else {
try {
// for mysql you need to supply user/pass as well
$conn = new PDO($dsn);
// Tell PDO to throw an exception on error
// (like "MySQL server has gone away")
$conn->setAttribute(
PDO::ATTR_ERRMODE,
PDO::ERRMODE_EXCEPTION
);
$this->connections[$id]['conn'] = $conn;
return $conn;
} catch (PDOException $e) {
return false;
}
}
}
function close($id) {
if (!isset($this->connections[$id])) {
throw new Exception('Invalid DB connection requested');
}
$this->connections[$id]['conn'] = null;
}
}
class Crawler {
private $dbPool;
function __construct(DbPool $dbPool) {
$this->dbPool = $dbPool;
}
function crawl() {
// craw and store data in $crawledData variable
$this->save($crawledData);
}
function saveData($crawledData) {
if (!$conn = $this->dbPool->getConnection('write_conn') {
// doh! couldn't retrieve DB connection ... handle it
} else {
try {
// perform query on the $conn database connection
} catch (Exception $e) {
$msg = $e->getMessage();
if (strstr($msg, 'MySQL server has gone away') {
$this->dbPool->close('write_conn');
$this->saveData($val);
} else {
// some other error occurred
}
}
}
}
}
I have another answer that deals with what I think is a similar problem, and it would require a similar answer. Basically, you can use the mysql_ping()
function to test the connection before your insert. Before MySQL 5.0.14, mysql_ping()
would automatically reconnect the server, but now you have to build your own reconnect logic. Something similar to this should work for you:
function check_dbconn($connection) {
if (!mysql_ping($connection)) {
mysql_close($connection);
$connection = mysql_connect('server', 'username', 'password');
mysql_select_db('db',$connection);
}
return $connection;
}
foreach($array as $value) {
$dbconn = check_dbconn($dbconn);
$sql="insert into collected values('".$value."')";
$res=mysql_query($sql, $dbconn);
//then some extra code.
}
I was facing "Mysql server has gone away" error while using Mysql connector 5.X
, replacing dll to the last version solved the problem.
Are you opening a single DB connection and reusing it? Is it possible that its a simple timeout? You might be better served by opening a new DB connection for each of your read/write operations (IE contact .edu, get text, open DB, write text, close db, repeat).
Also how are you using the handle? Is it possible that it has hit an error and has 'gone away' for that reason?
Well This is what I am doing now based on rdlowrey's suggestion and I guess this is also right.
public function url_db_html($sourceLink = NULL, $source) {
$source = mysql_real_escape_string($source);
$query = "INSERT INTO html (id, sourceLink, sourceCode)
VALUES (NULL,('$sourceLink') , ('$source'))";
try {
if(mysql_query($query, $this->connection)==FALSE) {
$msg = mysql_errno($this->connection) . ": " . mysql_error($this->connection);
throw new DbException($msg);
}
} catch (DbException $e) {
echo "<br><br>Catched!!!<br><br>";
if(strstr($e->getMessage(), 'MySQL server has gone away')) {
$this->connection = mysql_connect("localhost", "root", "");
mysql_select_db("crawler1", $this->connection);
}
}
}
So once the query has failed to execute, the script will skip it but will make sure the connection is re-established.
However, my web crawler is crashing when files such as .jpg, .bmp, .pdf, etc are encountered. Is there a way to skip those urls containing these extensions. I am using preg_match and has given pdf and doc to match. Yet I want the function to skip all links containing extensions such as mp3, pdf, etc. Is this possible??