Short tutorial: crawling websites in Python with Scrapy (2024)

I recently applied Scrapy in a Python application to crawl the web and extract data - here are some steps and thoughts of mine.

Author

Affiliation

Sid Metcalfe

Cartesian Mathematics Foundation

Published

December 5, 2023

Introduction

I’ve spent a good chunk of my coding life figuring out the best ways to collect and use data from the web. It’s an adventure that started out with simple scripts and has since evolved into more complex projects involving web crawling frameworks like Scrapy. Through trial and error, I’ve learned not only how to extract the data I need but also the best practices for doing so respectfully (i.e. respecting webmasters wishes - through inspecting one’s robots.txt for instance) and efficiently.

Web Crawling and Scrapy

Gathering data from websites is a common task for many projects, such as data analysis, machine learning, or keeping an eye on competitors’ pricing. This is where web crawling comes into play. Web crawling, put simply, is the process of programmatically traversing the web and collecting data from websites. While you could do this manually, for larger scale projects it would be like trying to fill a swimming pool with a teaspoon. And when you’re ready to move on from basic scripts using requests and BeautifulSoup in Python, Scrapy is the tool that takes things to the next level.

Scrapy, an open-source web crawling framework created by Scrapinghub, is powerful yet user-friendly. It’s designed to handle the heavy lifting of web scraping, leaving you, the developer, to focus on the intricacies of the data you’re interested in. You can check out the Scrapy GitHub repository to see how actively it’s maintained and improved by the community: Scrapy GitHub Repo.

Here’s a quick insight into getting started with Scrapy. Install Scrapy using pip (make sure you’re working within a virtual environment for any Python development):

pip install scrapy

Afterwards, you can begin by creating a new Scrapy project where you’ll do all of your crawling work.

scrapy startproject myproject

My first encounter with Scrapy was when I needed to extract data from a site that listed local events. To capture that data, I used Spiders, which are classes that you define and that Scrapy uses to scrape information from websites. You have immense control over what gets fetched.

Creating a Spider involves subclassing scrapy.Spider and defining the initial requests to make. As a brief example, below is how you’d start to scrape quotes from ‘http://quotes.toscrape.com’:

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}

Once you have your Spider, you tell Scrapy to crawl using your spider:

scrapy crawl quotes

Scrapy is efficient because it’s asynchronous; it handles requests in a non-blocking fashion, allowing you to make multiple requests concurrently, which is a huge time-saver. However, respect websites’ robots.txt policies and their servers’ capacities—you don’t want to be the one crashing someone’s website with your bot.

A good tip for anyone crawling for the first time: scraping and crawling responsibly is more than just good manners—it’s essential for ensuring your spiders aren’t blocked or, worse, the source of legal issues. Research a site’s rules before you set your spiders loose on it.

Through Scrapy, you can also easily extract data into various formats, such as CSV, JSON, or XML. The framework includes a powerful way to follow links within pages to crawl the entire site. But remember, we aren’t covering that part here; that’s for you to explore in the ‘Following Links and Recursive Crawling’ section.

Starting with Scrapy might be overwhelming at first, but once you get the hang of it, you’ll likely find it’s one of the most robust and rewarding web crawling frameworks out there. For any beginner, patience will be a valuable virtue as you navigate through the practicalities of web scraping.

Remember, this guide is just your launching pad. There’s a whole community out there creating and sharing inspirational Scrapy projects. Good luck on your scraping journey!

Setting Up Your Development Environment

Setting up your development environment is a critical first step in your journey into web crawling with Scrapy. For beginners, it can feel daunting, but by breaking it down into bite-sized steps, you’ll find it to be straightforward. I’ve been through this process many times, so I’ll guide you through it.

Step 1: Install Python

Scrapy runs on Python, so that’s our starting point. If you don’t have Python installed, download the latest version from the official Python website (https://www.python.org/downloads/). Make sure to tick the box that says ‘Add Python to PATH’ during installation to avoid some common pitfalls.

# You can check Python installation using
python --version

Step 2: Set Up a Virtual Environment

Using a virtual environment keeps dependencies required by different projects separate. Trust me, it’s a lifesaver. I prefer using venv, which comes bundled with Python.

# Create a new virtual environment
python -m venv scrapy_env

# Activate the environment
source scrapy_env/bin/activate  # On Unix or MacOS
scrapy_env\Scripts\activate    # On Windows

# Your command prompt should now reflect the active environment

Step 3: Install Scrapy

With your virtual environment active, install Scrapy with pip, Python’s package installer. It’s always a good idea to check Scrapy’s GitHub repository for the latest installation instructions or potential troubleshooting advice.

# This will install Scrapy within your virtual environment
pip install Scrapy

You shouldn’t encounter any errors, but if you do, check your internet connection or the console log for clues.

Step 4: IDE Setup

Your development environment is not complete without an IDE or code editor. PyCharm is a great choice for Python development, but for simplicity and quick editing, Visual Studio Code (VS Code) with the Python extension is tough to beat.

You can download VS Code from here.

After installation, install the Python extension by searching within the Extensions view (Ctrl+Shift+X in VS Code).

Step 5: Create a Scrapy Project

It’s finally time to initialize your first Scrapy project. In your command line, navigate to a directory where you want your project to reside, and create a new Scrapy project.

# Navigate to the directory of your choice
cd path_to_your_desired_directory

# Create a new Scrapy project
scrapy startproject tutorial

# This will create a 'tutorial' directory with the following structure
# tutorial/
#     scrapy.cfg            # deploy configuration file
#     tutorial/             # Python module where you’ll later add your code
#         __init__.py
#         items.py          # project items definition file
#         middlewares.py    # project middlewares file
#         pipelines.py      # project pipelines file
#         settings.py       # project settings file
#         spiders/          # a directory where you’ll later put your spiders
#             __init__.py

Navigate into your project’s directory and ensure everything works by listing the available commands.

cd tutorial
scrapy list  # You should see this command output

Congratulations! You’re now ready to start your web crawling journey with Scrapy. Remember, if you stumble along the way, don’t hesitate to look for answers from the Scrapy documentation or ask for help from Scrapy’s active community forums and Stack Overflow. Happy coding!

Creating Your First Scrapy Spider

After setting up the Scrapy framework, it’s time to get our hands dirty by creating our first Scrapy spider. I remember the excitement when I wrote my first one; the idea of automating data collection from large websites captivated me. If you are also thrilled by this, let’s kick off the journey together.

In technical terms, a spider is a class that you define and that Scrapy uses to scrape information from a website or a group of websites. To create a new spider, I usually navigate to the project’s main directory and execute the following command:

scrapy genspider example example.com

This creates a spider named “example” which will target the domain “example.com”. Scrapy automatically generates a spider file under the “spiders” directory in our project.

Now, let’s define the spider with its first basic features. I open the generated example.py file and start sketching out the spider’s behaviors. Below is a simple version of what the contents might look like:

import scrapy

class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']

def parse(self, response):
self.log('Visited %s' % response.url)
# Here we will extract and process the data later on

In this spider, name is a unique identifier for the spider, allowed_domains serves as a filter for the requests, and start_urls includes the list of URLs where the spider will begin to crawl from.

The parse method is the callback invoked by Scrapy when a response is received from a request made to URLs defined in start_urls. Inside this method, I usually start by logging the URL that was just visited—helpful for debugging later on.

Next, we’ll focus on extracting data. The response parameter is of immense importance as it holds the page’s content. Let’s say we want to scrape quotes from ‘http://quotes.toscrape.com’. The parse method might look something like this:

# ... other parts of the ExampleSpider remain the same

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}

In this snippet, we are using CSS selectors to isolate elements that contain quotes and their authors. The ::text syntax gets the text inside a CSS-selected element, and get() extracts it. With yield, we’re telling Scrapy that this is the final data structure we want.

I ensure that my spider runs correctly by performing a quick scrape:

scrapy crawl example

And there you have it, our first data scrape is alive and kicking! Remember, this is your starting point—scraping is as simple or complex as you make it. We can always refine data extraction and navigate to different pages to scale up our spider.

One final piece of advice: be respectful with the crawling and adhere to the website’s robots.txt rules and Terms of Service. Happy scraping!

Extracting and Storing Data

Once you have your Scrapy spider skimming through web pages, extracting the data is the real jackpot. I’ve found that handling the raw data properly can make or break your project. Let’s get into how to extract and store data efficiently.

First, you’ll need to define the data structure. In Scrapy, you do this by creating an Item class. It’s where you outline the fields you intend to scrape. Here’s a simple example:

import scrapy

class MyItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()

With your item structure laid out, head over to your spider’s parse method to fill in these fields. A typical pattern is using the response.xpath or response.css selectors to pinpoint the data.

def parse(self, response):
for product in response.css('div.product-container'):
yield MyItem(
title=product.css('div.product-title::text').get(),
price=product.css('p.price::text').get(),
stock=product.xpath('./p[@class="stock"]/text()').get(),
)

Notice I’m using yield here, Scrapy will take care of processing the items.

Next, consider where to store the data. Scrapy supports JSON, CSV, and XML directly through the command line. But, bumping up the complexity slightly, you might want to push this data to a database. For simplicity, let’s stick with SQLite here.

You will need to set up a pipeline for storing items. Add this class in your pipelines.py file:

import sqlite3

class SQLitePipeline(object):

def open_spider(self, spider):
self.connection = sqlite3.connect('mydatabase.db')
self.c = self.connection.cursor()
self.c.execute('''
CREATE TABLE IF NOT EXISTS products(
title TEXT,
price TEXT,
stock TEXT
)
''')
self.connection.commit()

def close_spider(self, spider):
self.connection.close()

def process_item(self, item, spider):
self.c.execute('''
INSERT INTO products (title, price, stock) VALUES (?, ?, ?)
''', (item['title'], item['price'], item['stock']))
self.connection.commit()
return item

Activate the pipeline by adding it to the ITEM_PIPELINES setting in your settings.py:

ITEM_PIPELINES = {
'myproject.pipelines.SQLitePipeline': 300,
}

The open_spider method sets up the database; here, it creates a table, but this is where you could add connection details for a remote server or a more complex setup.

The process_item method gets called with each item that your spider yields. In our case, it’s inserting the data into SQLite. Always remember to commit after making changes, so they’re saved.

This SQLite pipeline is stripped down but should get you started. More complex scenarios may require handling duplicates, data validation, or even setting up an ORM like SQLAlchemy.

Remember, there’s tons that can go wrong when storing data – lost connections, handling duplicates, blocking – so error handling and logging are crucial. Start simple, check your data, iterate, and scale up when you’re confident everything is smooth.

One last tip: While you’re developing, keep the feedback loop tight. Scrape a few items, check your database, rinse, and repeat. Troubleshooting a few rows of data beats untangling a database with thousands of messy entries.

So that’s it – your data’s extracted and stored, ready for analysis or to fuel your application. Next, we’ll talk about scaling your spider to handle larger websites, but you’ve got the gist of data handling in Scrapy.

Following Links and Recursive Crawling

Building on what we’ve covered so far, I’ll guide you through the next stages of web crawling with Scrapy – following links and recursive crawling. This part of the process can be quite intriguing, as you’ll see the spider you’ve created autonomously navigate through the web, collecting data as it goes along. The objective here is to traverse the target website comprehensively and efficiently.

When I first approached this topic, I understood that following links signifies instructing the spider to move from one page to another within a website. This is especially handy when you want to collect data from pages with a similar structure but different content, like product listings on an e-commerce site or articles from a blog.

Here’s a snippet that showcases how to follow links in Scrapy:

import scrapy

class MySpider(scrapy.Spider):
name = 'myspider'

def start_requests(self):
yield scrapy.Request('http://example.com/categories/', self.parse_category)

def parse_category(self, response):
# Extract the links to the individual items
for href in response.css('.item a::attr(href)'):
yield response.follow(href, self.parse_item)

def parse_item(self, response):
# Extract data from the item page
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
# More fields...
}

Each response.follow generates a new request which uses a callback, in this case, parse_item, responsible for processing the data on the linked page.

Recursive crawling goes a step further. It’s about following links, then following more links found on those pages, creating a loop of sorts that gathers data until no more new links are found, or until your spider reaches a condition that tells it to stop.

And here’s code incorporating recursion:

def parse_category(self, response):
# Extract the links to the individual items
for href in response.css('.item a::attr(href)'):
yield response.follow(href, self.parse_item)

# Follow pagination links and repeat the process for the next page
next_page = response.css('.pagination-next::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse_category)

With the above code, your spider will now follow pagination links and continue crawling through category pages.

The real power of Scrapy comes into play when you realize how automated the process can be. Yet, there’s responsibility that rides alongside it. Always remember to respect the robots.txt file of websites and their terms of service. If you’re crawling aggressively, this can have implications and you might get blocked, so do it with consideration for the website’s resources.

As you experiment with your Scrapy spiders, you might find certain challenges or need to use middlewares to handle cookies, user agents, and different kinds of redirects. This is all part of the learning process, and there’s an active community and resources available to help. Check the Scrapy documentation and the related GitHub repository for more in-depth guidance and to troubleshoot any issues that arise.

Crawling websites and following links recursively with Scrapy can genuinely be a rewarding experience, not only in the data you harvest but also in the growth of your coding skills and understanding of the web. Go ahead, try it out, and let the data reveal its stories!