I'm pretty new to python and I'm working on a web scraping project using the Scrapy library. I'm not using the built in domain restriction because I want to check if any of the links to pages outside the domain are dead. However, I still want to treat pages within the domain differently from those outside it and am trying to manually determine if a site is within the domain before parsing the response.
Response URL:
http://www.siteSection1.domainName.com
If Statement:
if 'domainName.com' and ('siteSection1' or 'siteSection2' or 'siteSection3') in response.url:
parsePageInDomain()
The above statement is true (the page is parsed) if 'siteSection1' is the first to appear in the list of or's but it will not parse the page if the response url is the same but the if statement were the following:
if 'domainName.com' and ('siteSection2' or 'siteSection1' or 'siteSection3') in response.url:
parsePageInDomain()
What am I doing wrong here? I haven't been able to think through what is going on with the logical operators very clearly and any guidance would be greatly appreciated. Thanks!
or
doesn't work that way. Tryany
:What's going on here is that
or
returns a logicalor
of its two arguments -x or y
returnsx
ifx
evaluates toTrue
, which for a string means it's not empty, ory
ifx
does not evaluate toTrue
. So('siteSection1' or 'siteSection2' or 'siteSection3')
evaluates to'siteSection1'
because'siteSection1'
isTrue
when considered as a boolean.Moreover, you're also using
and
to combine your criteria.and
returns its first argument if that argument evaluates toFalse
, or its second if the first argument evaluates toTrue
. Therefore,if x and y in z
does not test to see whether bothx
andy
are inz
.in
has higher precedence thanand
- and I had to look that up - so that testsif x and (y in z)
. Again,domainName.com
evaluates as True, so this will return justy in z
.any
, conversely, is a built in function that takes an iterable of booleans and returnsTrue
orFalse
-True
if any of them areTrue
,False
otherwise. It stops its work as soon as it hits aTrue
value, so it's efficient. I'm using a generator expression to tell it to keep checking your three different possible strings to see if any of them are in your response url.