minimal_web_scraper.parsers#

Subpackage of minimal_web_scraper. Provide the API to build parsers and manage them.

Overview#

Classes#

BaseParser

Base class for creating custom parsers.

Function#

add_parser(parser)

Add the given parser in the list that the scraper checks for parsing data.

Exceptions#

ParserNotFound

Exception raised when the scraper does not find a parser associated to a URL.

Classes#

class minimal_web_scraper.parsers.BaseParser#

Base class for creating custom parsers.

Subclasses must override parse() and scope_urls. Parsers are not intended to be instantiated.

Overview

Attributes#

scope_urls

Define which URLs the parser is intended to parse.

Methods#

parse(html_content, encoding)

class Abstract method to parse HTML chunks.

Members

scope_urls: list[str]#

Define which URLs the parser is intended to parse.

abstract classmethod parse(html_content: bytes, encoding: str | None) Any#

Abstract method to parse HTML chunks.

Parameters:
  • html_content – the raw HTML to parse

  • encoding – the associated encoding of the HTML

Returns:

return the extracted elements

Functions#

minimal_web_scraper.parsers.add_parser(parser: Type[BaseParser] | None = None) None#

Add the given parser in the list that the scraper checks for parsing data.

Parameters:

parser – the parser must be a subclass of BaseParser. If no parser is provided, it will add all parsers imported (default None).

Raises:

TypeError – raised when the argument parser is not a subclass of BaseParser

Exceptions#

exception minimal_web_scraper.parsers.ParserNotFound#

Bases: BaseParserException

Exception raised when the scraper does not find a parser associated to a URL.