minimal-web-scraper#

Objectif#

This example project scrapes informations from this website: https://books.toscrape.com/. We want to collect the name of the products on that site, their price and availability to display them in a table.

The project name for this example is my_first_scraper, feel free to replace this by your own project name.

Note

This tutorial will not explain how to use BeautifulSoup, see BeautifulSoup documentation.

Parsers#

The file parser_example.py hosts the code of our parsers.

The purpose of this function is to grab the informations we want from the page and to pack it in a dictionary we can easily manipulate.

This is the minimal required in a parser from the library.

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    scope_urls = []

    @classmethod
    def parse(cls, html_content, encoding):
        return []

In our case, the parser parses the data from https://books.toscrape.com/, so it is our scope_urls.

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    scope_urls = ["https://books.toscrape.com"]

    @classmethod
    def parse(cls, html_content, encoding):
        return []

The parser method receives a html_content as an argument. It is the text file that contains the informations we want to extract. To easily navigate in this file, we use the secrets ingredients: BeautifulSoup and re. It will do most of the work for us.

from bs4 import BeautifulSoup
import re

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    scope_urls = ["https://books.toscrape.com"]

    @classmethod
    def parse(cls, html_content, encoding):
        soup = BeautifulSoup(html_content, "html.parser", from_encoding=encoding)
        return []

The page https://books.toscrape.com/ contains multiple books. BeautifulSoup give us a list books. Each element book of this list contains informations on one book. In a for loop, we create a dictionary item that contains the informations we want to return.

from bs4 import BeautifulSoup
import re

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    scope_urls = ["https://books.toscrape.com"]

    @classmethod
    def parse(cls, html_content, encoding):
        scraped_items = []
        soup = BeautifulSoup(html_content, "html.parser", from_encoding=encoding)
        books = soup.find_all("article", class_="product_pod")

        for book in books:
            item = {
                "name": None,
                "url": None,
                "availability": None,
                "img_url": None,
                "price": None,
            }
            scraped_items.append(item)

        return scraped_items

We extract the informations we want.

from bs4 import BeautifulSoup
import re

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    base_url = "https://books.toscrape.com/"
    scope_urls = ["https://books.toscrape.com"]

    @classmethod
    def parse(cls, html_content, encoding):
        scraped_items = []
        soup = BeautifulSoup(html_content, "html.parser", from_encoding=encoding)
        books = soup.find_all("article", class_="product_pod")

        for book in books:
            item = {
                "name": None,
                "url": None,
                "availability": None,
                "img_url": None,
                "price": None,
            }
            item["name"] = book.h3.a["title"]
            item["url"] = cls.base_url + book.h3.a["href"]
            item["availability"] = book.find("p", class_="availability").get_text()
            item["img_url"] = cls.base_url + book.img["src"]
            item["price"] = book.find("p", class_="price_color").get_text()
            scraped_items.append(item)

        return scraped_items

Optionally, we format the data:

from bs4 import BeautifulSoup
import re

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    base_url = "https://books.toscrape.com/"
    scope_urls = ["https://books.toscrape.com"]

    @classmethod
    def parse(cls, html_content, encoding):
        scraped_items = []
        soup = BeautifulSoup(html_content, "html.parser", from_encoding=encoding)
        books = soup.find_all("article", class_="product_pod")

        for book in books:
            item = {
                "name": None,
                "url": None,
                "availability": None,
                "img_url": None,
                "price": None,
            }
            item["name"] = book.h3.a["title"]
            item["url"] = cls.base_url + book.h3.a["href"]
            item["availability"] = book.find("p", class_="availability").get_text()
            item["img_url"] = cls.base_url + book.img["src"]
            item["price"] = book.find("p", class_="price_color").get_text()

            if item["price"]:
                item["price"] = float(item["price"][2:])
            if item["availability"]:
                matched = re.search("(In stock|Out of stock)", item["availability"])
                item["availability"] = matched[0]

            scraped_items.append(item)

        return scraped_items

Here what it looks like at the end:

from bs4 import BeautifulSoup
import re

from minimal_web_scraper.parsers import BaseParser

class Parser(BaseParser):
    base_url = "https://books.toscrape.com/"
    scope_urls = ["https://books.toscrape.com"]

    @classmethod
    def parse(cls, html_content, encoding):
        scraped_items = []
        soup = BeautifulSoup(html_content, "html.parser", from_encoding=encoding)
        books = soup.find_all("article", class_="product_pod")

        for book in books:
            item = {
                "name": None,
                "url": None,
                "availability": None,
                "img_url": None,
                "price": None,
            }
            item["name"] = book.h3.a["title"]
            item["url"] = cls.base_url + book.h3.a["href"]
            item["availability"] = book.find("p", class_="availability").get_text()
            item["img_url"] = cls.base_url + book.img["src"]
            item["price"] = book.find("p", class_="price_color").get_text()

            if item["price"]:
                item["price"] = float(item["price"][2:])
            if item["availability"]:
                matched = re.search("(In stock|Out of stock)", item["availability"])
                item["availability"] = matched[0]

            scraped_items.append(item)

        return scraped_items

Scraper#

The file main.py hosts the script that call the library and the parser.

# example.py

import pandas as pd

import minimal_web_scraper as scraper
from minimal_web_scraper import parsers

# import the parsers modules to add them to the scraper list of parsers
import parser_example

# add all parsers imported (which are subclass of BaseParser)
parsers.add_parser()

# or add them manually
# parsers.add_parser(parser_example.BookParser)
# parsers.add_parser(parser_example.BooksParser)

# scrape the URL in argument and return a dictionary of parsed data
data = scraper.scrape("https://books.toscrape.com/")

# Pretty output formatting with pandas
books = pd.DataFrame(data=data, columns=["name", "price"])
print(books.head(5))

Execution#

Now the project structure is:

my_first_scraper
|
|- main.py
|- parser_example.py
|- venv

Head back to the terminal, we need to install the libraries we used.

(venv)$ pip install bs4 pandas

Then, we can finally run our program:

(venv)$ python main.py

Here is the result:

                                    name  price
0                   A Light in the Attic  51.77
1                     Tipping the Velvet  53.74
2                             Soumission  50.10
3                          Sharp Objects  47.82
4  Sapiens: A Brief History of Humankind  54.23

Next#

Now that you have a working parser, you can try to write a new one for single products pages of the same site (example). If you need help, you can check a possible answer at the Github repository.