woob.browser.pages
¶
-
class
woob.browser.pages.
AbstractPage
(browser, *args, **kwargs)¶ Bases:
woob.browser.pages.Page
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
BROWSER_ATTR
= None¶
-
PARENT
= None¶
-
PARENT_URL
= None¶
-
-
exception
woob.browser.pages.
AbstractPageError
¶ Bases:
Exception
-
class
woob.browser.pages.
ChecksumPage
¶ Bases:
object
Compute a checksum of raw content before parsing it.
-
build_doc
(content)¶
-
checksum
= None¶
-
hashfunc
(*, usedforsecurity=True)¶ Returns a md5 hash object; optionally initialized with a string
-
hashlib
= <module 'hashlib' from '/usr/lib/python3.9/hashlib.py'>¶
-
-
class
woob.browser.pages.
CsvPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
Page which parses CSV files.
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
DIALECT
= 'excel'¶ Dialect given to the
csv
module.
-
ENCODING
= 'utf-8'¶ Encoding of the file.
-
FMTPARAMS
= {}¶ Parameters given to the
csv
module.
-
HEADER
= None¶ If not None, will consider the line represented by this index as a header. This means the rows will be also available as dictionaries.
-
NEWLINES_HACK
= True¶ Convert all strange newlines to unix ones.
-
build_doc
(content)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
decode_row
(row, encoding)¶ Method called by
CsvPage.parse()
to decode a row using the given encoding.
-
-
class
woob.browser.pages.
Form
(page, el, submit_el=None)¶ Bases:
collections.OrderedDict
Represents a form of an HTML page.
It is used as a dict with pre-filled values from HTML. You can set new values as strings by setting an item value.
It is recommended to not use this class by yourself, but call
HTMLPage.get_form()
.- Parameters
page (
Page
) – the page where the form is locatedel – the form element on the page
submit_el – allows you to only consider one submit button (which is what browsers do). If set to None, it takes all of them, and if set to False, it takes none.
-
property
request
¶ Get the Request object from the form.
-
submit
(**kwargs)¶ Submit the form and tell browser to be located to the new page.
- Parameters
data_encoding (
basestring
) – force encoding used to submit form data (defaults to the current page encoding)
-
exception
woob.browser.pages.
FormNotFound
¶ Bases:
Exception
Raised when
HTMLPage.get_form()
can’t find a form.
-
exception
woob.browser.pages.
FormSubmitWarning
¶ Bases:
UserWarning
A form has more than one submit element selected, and will likely generate an invalid request.
-
class
woob.browser.pages.
GWTPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
GWT page where the “doc” attribute is a list
More info about GWT protcol here : https://goo.gl/GP5dv9
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
build_doc
(content)¶ Reponse starts with “//” followed by “OK” or “EX”. 2 last elements in list are protocol and flag. We need to read the list in reversed order.
-
get_date
(data)¶ Get date from string
-
get_elements
(type='String')¶ Get elements of specified type
-
-
class
woob.browser.pages.
HTMLPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
HTML page.
- Parameters
browser (
woob.browser.browsers.Browser
) – browser used to go on the pageresponse (
Response
) – response objectparams (
dict
) – optional dictionary containing parameters given to the page (seewoob.browser.url.URL
)encoding (
basestring
) – optional parameter to force the encoding of the page
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
ABSOLUTE_LINKS
= False¶ Make links URLs absolute.
-
FORM_CLASS
¶ The class to instanciate when using
HTMLPage.get_form()
. Default toForm
.alias of
woob.browser.pages.Form
-
REFRESH_MAX
= None¶ When handling a “Refresh” meta header, the page considers it only if the sleep time in lesser than this value.
Default value is None, means refreshes aren’t handled.
-
REFRESH_XPATH
= '//head/meta[lower-case(@http-equiv)="refresh"]'¶ Default xpath, which is also the most commun, override it if needed
-
build_doc
(content)¶ Method to build the lxml document from response and given encoding.
-
define_xpath_functions
(ns)¶ Define XPath functions on the given lxml function namespace.
This method is called in constructor of
HTMLPage
and can be overloaded by children classes to add extra functions.
-
detect_encoding
()¶ Look for encoding in the document “http-equiv” and “charset” meta nodes.
-
get_form
(xpath='//form', name=None, id=None, nr=None, submit=None)¶ Get a
Form
object from a selector. The form will be analyzed and its parameters extracted. In the case there is more than one “submit” input, only one of them should be chosen to generate the request.- Parameters
xpath (
str
) – xpath string to select formsname (
str
) – if supplied, select a form with the given namenr (
int
) – if supplied, take the n+1 th selected formsubmit (
str
) – if supplied, xpath string to select the submit element from the form
- Return type
- Raises
FormNotFound
if no form is found
-
handle_refresh
()¶
-
on_load
()¶ Event called when browser loads this page.
-
class
woob.browser.pages.
JsonPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
Json Page.
Notes on JSON format: JSON must be UTF-8 encoded when used for open systems interchange (https://tools.ietf.org/html/rfc8259). So it can be safely assumed all JSON to be UTF-8. A little subtlety is that JSON Unicode surrogate escape sequence (used for characters > U+FFFF) are UTF-16 style, but that should be handled by libraries (some don’t… Even if JSON is one of the simplest formats around…).
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
ENCODING
= 'utf-8-sig'¶
-
build_doc
(text)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
property
data
¶ Data passed to
build_doc()
.
-
get
(path, default=None)¶
-
path
(path, context=None)¶
-
-
class
woob.browser.pages.
LoggedPage
¶ Bases:
object
A page that only logged users can reach. If we did not get a redirection for this page, we are sure that the login is still active.
Do not use this class for page with mixed content (logged/anonymous) or for pages with a login form.
-
logged
= True¶
-
-
exception
woob.browser.pages.
NextPage
(request)¶ Bases:
Exception
Exception used for example in a Page to tell PagesBrowser.pagination to go on the next page.
See
PagesBrowser.pagination()
or decoratorpagination()
.
-
class
woob.browser.pages.
PDFPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
Parse a PDF and write raw data in the “doc” attribute as a string.
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
build_doc
(content)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
-
class
woob.browser.pages.
Page
(*args, **kwargs)¶ Bases:
object
Represents a page.
Encoding can be forced by setting the
ENCODING
class-wide attribute, or by passing an encoding keyword argument, which overridesENCODING
. Finally, it can be manually changed by assigning a new value toencoding
instance attribute. A unicode version of the response content is accessible intext
, decoded with specifiedencoding
.- Parameters
browser (
woob.browser.browsers.Browser
) – browser used to go on the pageresponse (
Response
) – response objectparams (
dict
) – optional dictionary containing parameters given to the page (seewoob.browser.url.URL
)encoding (
basestring
) – optional parameter to force the encoding of the page, overridesENCODING
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
ENCODING
= None¶ Force a page encoding. It is recommended to use None for autodetection.
-
absurl
(url)¶ Get an absolute URL from an a partial URL, relative to the Page URL
-
build_doc
(content)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
property
content
¶ Raw content from response.
-
property
data
¶ Data passed to
build_doc()
.
-
detect_encoding
()¶ Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).
-
property
encoding
¶
-
logged
= False¶ If True, the page is in a restricted area of the website. Useful with
LoginBrowser
and theneed_login()
decorator.
-
normalize_encoding
(encoding)¶ Make sure we can easily compare encodings by formatting them the same way.
-
on_leave
()¶ Event called when browser leaves this page.
-
on_load
()¶ Event called when browser loads this page.
-
class
woob.browser.pages.
PartialHTMLPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.HTMLPage
HTML page for broken pages with multiple roots.
This class should typically be used for requests which return only a part of a full document, to insert in another document. Such a sub-document can have multiple root tags, so this class is required in this case.
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
build_doc
(content)¶ Method to build the lxml document from response and given encoding.
-
-
class
woob.browser.pages.
RawPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
Raw page where the “doc” attribute is the content string.
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
build_doc
(content)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
-
class
woob.browser.pages.
XLSPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
XLS Page.
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
HEADER
= None¶ If not None, will consider the line represented by this index as a header.
-
SHEET_INDEX
= 0¶ Specify the index of the worksheet to use.
-
build_doc
(content)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
-
class
woob.browser.pages.
XMLPage
(*args, **kwargs)¶ Bases:
woob.browser.pages.Page
XML Page.
Accept any arguments, necessary for AbstractPage __new__ override.
AbstractPage, in its overridden __new__, removes itself from class hierarchy so its __new__ is called only once. In python 3, default (object) __new__ is then used for next instantiations but it’s a slot/”fixed” version supporting only one argument (type to instanciate).
-
build_doc
(content)¶ Abstract method to be implemented by subclasses to build structured data (HTML, Json, CSV…) from
data
property. It also can be overriden in modules pages to preprocess or postprocess data. It must return an object – that will be assigned todoc
.
-
detect_encoding
()¶ Override this method to implement detection of document-level encoding declaration, if any (eg. html5’s <meta charset=”some-charset”>).
-
-
woob.browser.pages.
pagination
(func)¶ This helper decorator can be used to handle pagination pages easily.
When the called function raises an exception
NextPage
, it goes on the wanted page and recall the function.NextPage
constructor can take an url or a Request object.>>> class Page(HTMLPage): ... @pagination ... def iter_values(self): ... for el in self.doc.xpath('//li'): ... yield el.text ... for next in self.doc.xpath('//a'): ... raise NextPage(next.attrib['href']) ... >>> from .browsers import PagesBrowser >>> from .url import URL >>> class Browser(PagesBrowser): ... BASEURL = 'https://woob.tech' ... list = URL('/tests/list-(?P<pagenum>\d+).html', Page) ... >>> b = Browser() >>> b.list.go(pagenum=1) <woob.browser.pages.Page object at 0x...> >>> list(b.page.iter_values()) ['One', 'Two', 'Three', 'Four']