Can't differentiate the two expressions suppos

2019-08-30 05:04发布

问题:

Few days back I created this post, to seek any solution as to how I can let my script loop in such a way so that the script will use few links to check whether my defined title (supposed to be extracted from each link) is nothing for four times. If the title is still nothing then the script will break the loop and go for another link to repeat the same.

This is how I got success--► By changing fetch_data(link) to return fetch_data(link) and defining counter=0 outside while loop but inside if statement.

Rectified script:

import time
import requests
from bs4 import BeautifulSoup

links = [
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2",
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3",
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4"
]
counter = 0

def fetch_data(link):
    global counter
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    try:
        title = soup.select_one("p.tcode").text
    except AttributeError: title = ""

    if not title:
        while counter<=3:
            time.sleep(1)
            print("trying {} times".format(counter))
            counter += 1
            return fetch_data(link) #First fix
        counter=0 #Second fix

    print("tried with this link:",link)

if __name__ == '__main__':
    for link in links:
        fetch_data(link)

This is the output the above script produces (as desired):

trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4

I used wrong selector within my script so that I can let it meet the condition I've defined above.

Why should I use return fetch_data(link) instead of fetch_data(link) as the expressions work identically most of the times?

回答1:

The while loop inside your function will initiate a recursive call if it fails to fetch the title. It works when you use return fetch_data(link) since whenever the counter is less than or equal to 3 while counter<=3, it will exit the function immediately at the end of the while loop, thus not going down to the lower line that will reset the counter to 0 counter=0. Since the counter is a global variable and only increases by 1 for each recursion depth, you will only have a maximum 4 recursion depths as anytime the counter is larger than 3, it won't go into the while loop that will call another fetch_data(link).

fetch_data (counter=0)
  --> fetch_data (counter=1)
    --> fetch_data (counter=2)
      --> fetch_data (counter=3)
        --> fetch_data (counter=4) 
        - not go into while loop, reset counter, print url
        - return to above function
      - return to above function
    - return to above function
  - return to above function

If you use fetch_data(link), the function will still initiate a recursive call in the while loop. However, not exit immediately and will reset the counter to 0. This is dangerous because after your counter goes to 4, the function and go back to the while loop of the previous function call inside the while loop, the while loop will not break and continue to initiate additional recursive calls because the counter is currently set to 0 which is <= 3. This will eventually reach the maximum recursion depth and will crash the program.

fetch_data (counter=0)
  --> fetch_data (counter=1)
    --> fetch_data (counter=2)
      --> fetch_data (counter=3)
        --> fetch_data (counter=4) 
        - not go into while loop, !!!reset counter!!!, print url
        - return to above function
      - not return to above function call
      - since counter = 0, continue the while loop
        --> fetch_data (counter=1)
          --> fetch_data (counter=2)
            --> fetch_data (counter=3)
...