I am using/learning scrapy
, python framework to scrape few of my interested web pages. In that go I extract the links in a page. But those links are relative in most of the case. I used urljoin_rfc
which is present in scrapy.utils.url
to get the absolute path. It worked fine.
In a process of learning I came across a feature called Item Loader
. Now I want to do the same using Item loader. My urljoin_rfc()
is in a user defined function function _urljoin(url,response)
. I want my loader to refer the function _urljoin
now. So in my loader class I do link_in = _urljoin()
. So I canged my _urljoin declaration to _urljoin(url, response = loader_context.response)
.
But I get a error saying NameError: name 'loader_context' is not defined
I need help here. I do this because, not just while loading I call _urljoin(), other part of my code too call the function _urljoin. If i am terribly doing bad please bring it to my notice.
If you're using
_urljoin(url, response)
elsewhere, you can keep as it is, accepting a response as 2nd argument.Now, processors for
Item Loaders
can accept a context, but the context is a dict of arbitrary key/values which is shared among all input and output processors (from the docs).So you could have wrapping function calling your
_urljoin(url, response)
:and in your
ItemLoader
definition:and finally in your callback code, when you instantiate your
ItemLoader
, pass the response reference: