minimal-web-scraper 0.1.1 documentation#
minimal-web-scraper is a Python library that provides you the tools to scrape data from the web. The library supports Python >=3.10,<4.0.
Note
Please be aware of the legal implication of scraping information from internet: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
Note
The library is not published on Pypi, this is why we use the github repository URL and you must have git installed. See VCS support on pip documentation.
Dependencies#
Those are the libraries that minimal-web-scraper uses to work:
requests (>=2.20,<3.0)
Quick Start#
Here is a small script that demonstrate how to use the library. Notice that the library doesn’t not provide any parser. See the example at the Github repository for a working scraper.
# example.py
import pandas as pd
import minimal_web_scraper as scraper
from minimal_web_scraper import parsers
# import the parsers modules to add them to the scraper list of parsers
import parser_example
# add all parsers imported (which are subclass of BaseParser)
parsers.add_parser()
# or add them manually
# parsers.add_parser(parser_example.BookParser)
# parsers.add_parser(parser_example.BooksParser)
# scrape the URL in argument and return a dictionary of parsed data
data = scraper.scrape("https://books.toscrape.com/")
# Pretty output formatting with pandas
books = pd.DataFrame(data=data, columns=["name", "price"])
print(books.head(5))
Note
This project is in development state, and the author doesn’t guarantee the stability of its API.
This is a really small library, if you need a comprehensive and proven web scraper in your favorite language, check out Scrapy framework.
Get started#
Check out Get started for a step-by-step instruction to set up a project using minimal-web-scraper.
How-to guides#
Check out How-to guides for specific guides on the library.