woob.browser.pages

class woob.browser.pages.AbstractPage(browser, *args, **kwargs)

Bases: woob.browser.pages.Page

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

BROWSER_ATTR = None
PARENT = None
PARENT_URL = None
exception woob.browser.pages.AbstractPageError

Bases: Exception

class woob.browser.pages.ChecksumPage

Bases: object

Compute a checksum of raw content before parsing it.

build_doc(content)
checksum = None
hashfunc(*, usedforsecurity=True)

Returns a md5 hash object; optionally initialized with a string

hashlib = <module 'hashlib' from '/usr/lib/python3.9/hashlib.py'>
class woob.browser.pages.CsvPage(*args, **kwargs)

Bases: woob.browser.pages.Page

Page which parses CSV files.

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

DIALECT = 'excel'

Dialect given to the csv module.

ENCODING = 'utf-8'

Encoding of the file.

FMTPARAMS = {}

Parameters given to the csv module.

HEADER = None

If not None, will consider the line represented by this index as a header. This means the rows will be also available as dictionaries.

NEWLINES_HACK = True

Convert all strange newlines to unix ones.

build_doc(content)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

decode_row(row, encoding)

Method called by CsvPage.parse() to decode a row using the given encoding.

parse(data, encoding=None)

Method called by the constructor of CsvPage to parse the document.

Parameters
  • data (BytesIO) – file stream

  • encoding (str) – if given, use it to decode cell strings

class woob.browser.pages.Form(page, el, submit_el=None)

Bases: collections.OrderedDict

Represents a form of an HTML page.

It is used as a dict with pre-filled values from HTML. You can set new values as strings by setting an item value.

It is recommended to not use this class by yourself, but call HTMLPage.get_form().

Parameters
  • page (Page) – the page where the form is located

  • el – the form element on the page

  • submit_el – allows you to only consider one submit button (which is what browsers do). If set to None, it takes all of them, and if set to False, it takes none.

property request

Get the Request object from the form.

submit(**kwargs)

Submit the form and tell browser to be located to the new page.

Parameters

data_encoding (basestring) – force encoding used to submit form data (defaults to the current page encoding)

exception woob.browser.pages.FormNotFound

Bases: Exception

Raised when HTMLPage.get_form() can’t find a form.

exception woob.browser.pages.FormSubmitWarning

Bases: UserWarning

A form has more than one submit element selected, and will likely generate an invalid request.

class woob.browser.pages.GWTPage(*args, **kwargs)

Bases: woob.browser.pages.Page

GWT page where the “doc” attribute is a list

More info about GWT protcol here : https://goo.gl/GP5dv9

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

build_doc(content)

Reponse starts with “//” followed by “OK” or “EX”. 2 last elements in list are protocol and flag. We need to read the list in reversed order.

get_date(data)

Get date from string

get_elements(type='String')

Get elements of specified type

class woob.browser.pages.HTMLPage(*args, **kwargs)

Bases: woob.browser.pages.Page

HTML page.

Parameters
  • browser (woob.browser.browsers.Browser) – browser used to go on the page

  • response (Response) – response object

  • params (dict) – optional dictionary containing parameters given to the page (see woob.browser.url.URL)

  • encoding (basestring) – optional parameter to force the encoding of the page

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

Make links URLs absolute.

FORM_CLASS

The class to instanciate when using HTMLPage.get_form(). Default to Form.

alias of woob.browser.pages.Form

REFRESH_MAX = None

When handling a “Refresh” meta header, the page considers it only if the sleep time in lesser than this value.

Default value is None, means refreshes aren’t handled.

REFRESH_XPATH = '//head/meta[lower-case(@http-equiv)="refresh"]'

Default xpath, which is also the most commun, override it if needed

build_doc(content)

Method to build the lxml document from response and given encoding.

define_xpath_functions(ns)

Define XPath functions on the given lxml function namespace.

This method is called in constructor of HTMLPage and can be overloaded by children classes to add extra functions.

detect_encoding()

Look for encoding in the document “http-equiv” and “charset” meta nodes.

get_form(xpath='//form', name=None, id=None, nr=None, submit=None)

Get a Form object from a selector. The form will be analyzed and its parameters extracted. In the case there is more than one “submit” input, only one of them should be chosen to generate the request.

Parameters
  • xpath (str) – xpath string to select forms

  • name (str) – if supplied, select a form with the given name

  • nr (int) – if supplied, take the n+1 th selected form

  • submit (str) – if supplied, xpath string to select the submit element from the form

Return type

Form

Raises

FormNotFound if no form is found

handle_refresh()
on_load()

Event called when browser loads this page.

class woob.browser.pages.JsonPage(*args, **kwargs)

Bases: woob.browser.pages.Page

Json Page.

Notes on JSON format: JSON must be UTF-8 encoded when used for open systems interchange (https://tools.ietf.org/html/rfc8259). So it can be safely assumed all JSON to be UTF-8. A little subtlety is that JSON Unicode surrogate escape sequence (used for characters > U+FFFF) are UTF-16 style, but that should be handled by libraries (some don’t… Even if JSON is one of the simplest formats around…).

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

ENCODING = 'utf-8-sig'
build_doc(text)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

property data

Data passed to build_doc().

get(path, default=None)
path(path, context=None)
class woob.browser.pages.LoggedPage

Bases: object

A page that only logged users can reach. If we did not get a redirection for this page, we are sure that the login is still active.

Do not use this class for page with mixed content (logged/anonymous) or for pages with a login form.

logged = True
class woob.browser.pages.LoginPage

Bases: object

on_load()
exception woob.browser.pages.NextPage(request)

Bases: Exception

Exception used for example in a Page to tell PagesBrowser.pagination to go on the next page.

See PagesBrowser.pagination() or decorator pagination().

class woob.browser.pages.PDFPage(*args, **kwargs)

Bases: woob.browser.pages.Page

Parse a PDF and write raw data in the “doc” attribute as a string.

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

build_doc(content)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

class woob.browser.pages.Page(*args, **kwargs)

Bases: object

Represents a page.

Encoding can be forced by setting the ENCODING class-wide attribute, or by passing an encoding keyword argument, which overrides ENCODING. Finally, it can be manually changed by assigning a new value to encoding instance attribute. A unicode version of the response content is accessible in text, decoded with specified encoding.

Parameters
  • browser (woob.browser.browsers.Browser) – browser used to go on the page

  • response (Response) – response object

  • params (dict) – optional dictionary containing parameters given to the page (see woob.browser.url.URL)

  • encoding (basestring) – optional parameter to force the encoding of the page, overrides ENCODING

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

ENCODING = None

Force a page encoding. It is recommended to use None for autodetection.

absurl(url)

Get an absolute URL from an a partial URL, relative to the Page URL

build_doc(content)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

property content

Raw content from response.

property data

Data passed to build_doc().

detect_encoding()

Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).

property encoding
logged = False

If True, the page is in a restricted area of the website. Useful with LoginBrowser and the need_login() decorator.

normalize_encoding(encoding)

Make sure we can easily compare encodings by formatting them the same way.

on_leave()

Event called when browser leaves this page.

on_load()

Event called when browser loads this page.

property text

Content of the response, in unicode, decoded with encoding.

class woob.browser.pages.PartialHTMLPage(*args, **kwargs)

Bases: woob.browser.pages.HTMLPage

HTML page for broken pages with multiple roots.

This class should typically be used for requests which return only a part of a full document, to insert in another document. Such a sub-document can have multiple root tags, so this class is required in this case.

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

build_doc(content)

Method to build the lxml document from response and given encoding.

class woob.browser.pages.RawPage(*args, **kwargs)

Bases: woob.browser.pages.Page

Raw page where the “doc” attribute is the content string.

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

build_doc(content)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

class woob.browser.pages.XLSPage(*args, **kwargs)

Bases: woob.browser.pages.Page

XLS Page.

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

HEADER = None

If not None, will consider the line represented by this index as a header.

SHEET_INDEX = 0

Specify the index of the worksheet to use.

build_doc(content)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

parse(data)

Method called by the constructor of XLSPage to parse the document.

class woob.browser.pages.XMLPage(*args, **kwargs)

Bases: woob.browser.pages.Page

XML Page.

Accept any arguments, necessary for AbstractPage __new__ override.

AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).

build_doc(content)

Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from data property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned to doc.

detect_encoding()

Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).

woob.browser.pages.pagination(func)

This helper decorator can be used to handle pagination pages easily.

When the called function raises an exception NextPage, it goes on the wanted page and recall the function.

NextPage constructor can take an url or a Request object.

>>> class Page(HTMLPage):
...     @pagination
...     def iter_values(self):
...         for el in self.doc.xpath('//li'):
...             yield el.text
...         for next in self.doc.xpath('//a'):
...             raise NextPage(next.attrib['href'])
...
>>> from .browsers import PagesBrowser
>>> from .url import URL
>>> class Browser(PagesBrowser):
...     BASEURL = 'https://woob.tech'
...     list = URL('/tests/list-(?P<pagenum>\d+).html', Page)
...
>>> b = Browser()
>>> b.list.go(pagenum=1) 
<woob.browser.pages.Page object at 0x...>
>>> list(b.page.iter_values())
['One', 'Two', 'Three', 'Four']