自定义方法的参数有变动
scrapy升级到2.14.0以后,自定义下载器中间件的process_request(), process_response() and process_exception()方法参数有变化,取消了spider参数。
- For the following user-defined functions and methods requiring a
spiderargument is deprecated, if you need aSpiderinstance inside them you should get it from theCrawlerinstance (you may need to refactor your code to save that instance in e.g. thefrom_crawler()method):(issue 6927, issue 6984, issue 7006, issue 7037)
- the
process_request(),process_response()andprocess_exception()methods of custom downloader middlewarestheprocess_spider_input(),process_spider_output(),process_spider_output_async()andprocess_spider_exception()methods of custom spider middlewarestheprocess_item()method of custom pipelinesthefetch()method of a customDOWNLOADER
因此,以前从爬虫文件spider.py向下载中间件传递变量的方法有变化。
以前传递变量的方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # 爬虫spider.py class SimpleSpider(scrapy.Spider): name = "simple" # 内置变量,覆盖默认值 allowed_domains = ["simple.com"] # 内置变量,覆盖默认值 start_urls = ["https://simple.com"] # 内置变量,覆盖默认值 custom_settings = { # 内置变量,预留为空 'var1': 1, 'var2': 2, 'var3': 3, } html_downloaded = 0 # 额外增加的自定义变量 def parse(self, response): pass |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | # 中间件middlewares.py class ADownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called var1 = spider.settings.attributes['var1'].value # 读取custom_settings的var1的值 spider.html_downloaded = spider.html_downloaded + 1 # 更新自定义变量的值 return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info("Spider opened: %s" % spider.name) |
process_request方法通过参数spider,就可以使用spider.py中定义的变量。
新的传递变量的方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | # 中间件middlewares.py的下载器中间件类 class ADownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. # 新增__init__方法 def __init__(self, crawler): self.crawler = crawler # 新增self.crawler类实例属性,并赋值为crawler @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. # cls实际上就是ADownloaderMiddleware类。此处类实例化,因此调用上面的__init__方法。也因此将from_crawler方法的参数crawler传递给__init__。 s = cls(crawler) crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called # 读取custom_settings的var1的值 var1 = self.crawler.spider.settings.attributes['var1'].value # 更新自定义变量的值 self.crawler.spider.html_downloaded = self.crawler.spider.html_downloaded + 1 return None def process_response(self, request, response): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info("Spider opened: %s" % spider.name) |
新增__init__方法。
修改from_crawler方法的s = cls(crawler)语句。
删除process_request, process_response and process_exception方法的spider参数。
文章评论