First Taste of Scrapy
Scrapy is an useful python screen scraping and web crawling framework. Before you read this article I assume you already know what scrapy is and what it can do. If not you should first read the document in scrapy’s website http://doc.scrapy.org. If you really don’t want to read all of them, read some of the sections listed below may also can help you understood what I am going to say.
Basic Concepts http://doc.scrapy.org/en/0.16/#basic-concepts
You should familiar with Items, Spiders, Selectors, ItemPipeline. Items are data structures, Spiders are how you can crawl an website, selectors are tools you can parse web page to generate Items and ItemPipeline is the persistent tool for you to save data.
Extending Scrapy http://doc.scrapy.org/en/0.16/#extending-scrapy
Architecture Overview, DownloadMiddleWare, SpiderMiddleWare are three most important sections you should read if you want hack scrapy and add features to it.
Ok, Let’s jump into the journey of exploring scrapy
I draw an picture to help me explain the flow how scrapy handle an crawl task. You can see Item, Spider, Selector and PipeLine in the picture.
I will start from Spider, Spider is the place you decide which website/urls you want to crawl and how you parse its pages. In Spider there is an important method start_requests which can generate Request objects, and the Request Object can be specified with URL, request headers, and callback method. When scrapy fetched an web page according to the Request Object It will pass the response to callback method specified in Request object, callback method can generate an Item from the web content or generate another Request object. Selector can be used to parse an web page to generate an Item. Item object may have several fields like database columns. When Items are been created, they can be passed to Pipeline to persist in Database or to FeedExport to be save in filesystem.
Understanding Scrapy’s architecture
In order to explain how scrapy’s architecture looks like I remove most part of details and only leave some significant components to simplify the problem.
In the above picture there are two components which I don’t mentioned before: crawler and engine. Every time we start an scrapy task we start an crawler to do it, and crawler has an engine to drive its flow. When crawler started, it will get spider from its queue which means crawler can have more than one spiders. spider will be started by crawler and scheduled to crawl webpage by engine. In order to drive the flow of crawler there are many middlewares in engine.Those middleware are organized in chains to process request and response. There are three different kind of MiddleWare: SpiderMiddleWare, DownloaderMiddleWare and SchedulerMiddleWare.
- SpiderMiddlerWare used to handle input and output of the request. Input means pages/screens crawled by spider while output means Items generated by spider. You can add your own SpiderMiddlerWare in settings file with SPIDER_MIDDLEWARES parameter
- DownloaderMiddleWare used to monitor the download process which include send an request and fetch the response. DOWNLOADER_MIDDLEWARES parameter is used to set downloader middleware.
- SchedulerMiddleWare used to decide whether an request should be applied. An useful example is that scrapy has an DuplicatesFilterMiddleware in its default settings and it can sweep duplicate requests generated by spider. SCHEDULER_MIDDLEWARES is used for this setting.
Another tips for understanding middleware is middlewares are executed with priorities. We can assign an number to an middleware in the setting file. In scrapy’s document It says the greater the number is the closer the middleware close to the downloader. But I’d like to remember it in another way: when comes to handle request the greater the number is the later the middleware will be executed and when comes to handle response/exception the greater the number is the sooner the middleware will be executed.
Lessons I have learned
1) Https through proxies are not supported
If you try to scrape https URL through proxies, You probably won’t success. Because scrapy doesn’t support it yet. But there are two ways to overcome it.
- a) apply an patch to scrapy https://github.com/scrapy/scrapy/pull/45
- b) write you own downloader client factory to support this feature and set it in settings with DOWNLOADER_HTTPCLIENTFACTORY.
There are three kind of Settings in scrapy. Settings, CrawlerSettings and SpiderSettings. In some cases we use overrides settings to override some global settings. Keep that in mind SpiderSettings doesn’t have overrides property and you should not hack spider’s settings because When spider been bundled to crawler, spider will generate its own settings from crawler’s.
3) Don’t mix all logic in only one spider, write more spiders and make them simply do one thing
When we have an huge spider and do all stuff It turn out to be difficult for maintenance, when change happens the effect would be spilt everywhere.
4) MiddleWares and Pipelines
When we add an custom Middleware and Pipeline, Maybe it better to treat all spiders and items the same way than a lot of if clauses in the MiddleWare or Pipeline. If we treat it different in middleware and pipeline when we add an new spider or item it’s hard to extends. Let the difference stay closer to its origin seems better.