wking's blog

  • 文章分类
    • 日常随笔
    • IT技术
    • 系统封装
    • 大航海时代
  • 关于博主
  1. 首页
  2. 正文

scrapy更新2.14.0版本以后给下载中间件传递spider变量的方法

2026-05-21 9点热度 0人点赞 0条评论

自定义方法的参数有变动

scrapy升级到2.14.0以后,自定义下载器中间件的process_request(), process_response() and process_exception()方法参数有变化,取消了spider参数。

  • For the following user-defined functions and methods requiring a spider argument is deprecated, if you need a Spider instance inside them you should get it from the Crawler instance (you may need to refactor your code to save that instance in e.g. the from_crawler() method):
    • the process_request(), process_response() and process_exception() methods of custom downloader middlewaresthe process_spider_input(), process_spider_output(), process_spider_output_async() and process_spider_exception() methods of custom spider middlewaresthe process_item() method of custom pipelinesthe fetch() method of a custom DOWNLOADER
    (issue 6927, issue 6984, issue 7006, issue 7037)

因此,以前从爬虫文件spider.py向下载中间件传递变量的方法有变化。

以前传递变量的方法

spider.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 爬虫spider.py
 
class SimpleSpider(scrapy.Spider):
    name = "simple" # 内置变量,覆盖默认值
    allowed_domains = ["simple.com"] # 内置变量,覆盖默认值
    start_urls = ["https://simple.com"] # 内置变量,覆盖默认值
    custom_settings = {   # 内置变量,预留为空
        'var1': 1,
        'var2': 2,
        'var3': 3,
        }
 
    html_downloaded = 0    # 额外增加的自定义变量
 
    def parse(self, response):
        pass
middlewares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# 中间件middlewares.py
 
class ADownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
 
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        
        var1 = spider.settings.attributes['var1'].value  # 读取custom_settings的var1的值
        spider.html_downloaded = spider.html_downloaded + 1  # 更新自定义变量的值
 
        return None
 
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

process_request方法通过参数spider,就可以使用spider.py中定义的变量。

新的传递变量的方法

middlewares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 中间件middlewares.py的下载器中间件类
 
class ADownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    # 新增__init__方法
    def __init__(self, crawler):
        self.crawler = crawler  # 新增self.crawler类实例属性,并赋值为crawler
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
 
        # cls实际上就是ADownloaderMiddleware类。此处类实例化,因此调用上面的__init__方法。也因此将from_crawler方法的参数crawler传递给__init__。
        s = cls(crawler)
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request):
        # Called for each request that goes through the downloader
        # middleware.
 
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
 
        # 读取custom_settings的var1的值
        var1 = self.crawler.spider.settings.attributes['var1'].value
 
        # 更新自定义变量的值
        self.crawler.spider.html_downloaded = self.crawler.spider.html_downloaded + 1  
        return None
 
    def process_response(self, request, response):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

新增__init__方法。

修改from_crawler方法的s = cls(crawler)语句。

删除process_request, process_response and process_exception方法的spider参数。

本作品采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可
标签: 暂无
最后更新:2026-05-21

wking

不管博客型博主

点赞

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

目录
  • 自定义方法的参数有变动
  • 以前传递变量的方法
  • 新的传递变量的方法
标签聚合
大航海时代 wordpress 一支红杏 C++ win10 linux R6300V2 OneNote

COPYRIGHT © 2010-2025 wkings.blog. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang