可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am running many instances of a webcrawler in parallel.

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.

Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.

I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).

How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.

This code seems like a good solution (see the error below, however):

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

but I keep getting the following mysql error:

You can't specify target table 'crawlLog' for update in FROM clause

Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)

回答1:

You can lock tables using the MySQL LOCK TABLES command like this:

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

See:

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

回答2:

You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

Or something close to that.

[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.

回答3:

Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

回答4:

I wouldn't use locking, or transactions.

The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.

You would run the following query if crawler 1 wants to start crawling url 2:

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

The next step is to check whether this record has been inserted.

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

回答5:

I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1