woob.browser.filters.standard

class woob.browser.filters.standard.Async(name, selector=None)

Bases: woob.browser.filters.base.Filter

Selector that uses another page fetched earlier.

Often used in combination with AsyncLoad filter. Requires that the other page’s URL is matched with a Page by the Browser.

Example:

class item(ItemElement):
    load_details = Field('url') & AsyncLoad

    obj_description = Async('details') & CleanText('//h3')
Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(*args)

This method has to be overridden by children classes.

loaded_page(item)
class woob.browser.filters.standard.AsyncLoad(selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Load a page asynchronously for later use.

Often used in combination with Async filter.

Parameters

default – default value in case the filter fails to find or parse the requested value

class woob.browser.filters.standard.Base(base, selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Change the base element used in filters.

>>> Base(Env('header'), CleanText('./h1'))  
Parameters

default – default value in case the filter fails to find or parse

the requested value

class woob.browser.filters.standard.BrowserURL(url_name, **kwargs)

Bases: woob.browser.filters.standard.MultiFilter

Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(values)

This method has to be overridden by children classes.

class woob.browser.filters.standard.CleanDecimal(selector=None, replace_dots=False, sign=None, legacy=True, default=NO_DEFAULT)

Bases: woob.browser.filters.standard.CleanText

Get a cleaned Decimal value from an element.

replace_dots is False by default. A dot is interpreted as a decimal separator.

If replace_dots is set to True, we remove all the dots. The ‘,’ is used as decimal separator (often useful for French values)

If replace_dots is a tuple, the first element will be used as the thousands separator, and the second as the decimal separator.

See http://en.wikipedia.org/wiki/Thousands_separator#Examples_of_use

For example, for the UK style (as in 1,234,567.89):

>>> CleanDecimal('./td[1]', replace_dots=(',', '.'))  
Parameters

sign – function accepting the text as param and returning the sign

classmethod French(*args, **kwargs)
classmethod Italian(*args, **kwargs)
classmethod SI(*args, **kwargs)
classmethod US(*args, **kwargs)
filter(text)

This method has to be overridden by children classes.

class woob.browser.filters.standard.CleanText(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)

Bases: woob.browser.filters.base.Filter

Get a cleaned text from an element.

It first replaces all tabs and multiple spaces (including newlines if newlines is True) to one space and strips the result string.

The result is coerced into unicode, and optionally normalized according to the normalize argument.

Then it replaces all symbols given in the symbols argument.

>>> CleanText().filter('coucou ') == u'coucou'
True
>>> CleanText().filter(u'coucou coucou') == u'coucou coucou'
True
>>> CleanText(newlines=True).filter(u'coucou\r\n coucou ') == u'coucou coucou'
True
>>> CleanText(newlines=False).filter(u'coucou\r\n coucou ') == u'coucou\ncoucou'
True
Parameters
  • symbols (list) – list of strings to remove from text

  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform

  • children (bool) – whether to get text from children elements of the select elements

  • newlines (bool) – if True, newlines will be converted to space too

  • normalize (str or None) – Unicode normalization to perform

  • transliterate (bool) – Transliterates unicode characters into ASCII characters

classmethod clean(txt, children=True, newlines=True, normalize='NFC', transliterate=False)
filter(txt)

This method has to be overridden by children classes.

classmethod remove(txt, symbols)
classmethod replace(txt, replace)
class woob.browser.filters.standard.Coalesce(*args, **kwargs)

Bases: woob.browser.filters.standard.MultiFilter

Returns the first value that is not falsy, or default if all values are falsy.

Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(values)

This method has to be overridden by children classes.

class woob.browser.filters.standard.CombineDate(date, time)

Bases: woob.browser.filters.standard.MultiFilter

Combine separate Date and Time filters into a single datetime.

filter(values)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Currency(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)

Bases: woob.browser.filters.standard.CleanText

Parameters
  • symbols (list) – list of strings to remove from text

  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform

  • children (bool) – whether to get text from children elements of the select elements

  • newlines (bool) – if True, newlines will be converted to space too

  • normalize (str or None) – Unicode normalization to perform

  • transliterate (bool) – Transliterates unicode characters into ASCII characters

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Date(selector=None, default=NO_DEFAULT, translations=None, parse_func=<function parse>, strict=True, **kwargs)

Bases: woob.browser.filters.standard.DateTime

Parse date.

Parameters
  • dayfirst (bool) – if True, the day is the first element in the string to parse

  • parse_func – the function to use for parsing the datetime

  • translations (list[tuple[str, str]]) – string replacements from site locale to English

  • tzinfo (datetime.tzinfo) – timezone to set if none was parsed

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.DateGuesser(selector, date_guesser, **kwargs)

Bases: woob.browser.filters.base.Filter

Parameters

default – default value in case the filter fails to find or parse

the requested value

class woob.browser.filters.standard.DateTime(selector=None, default=NO_DEFAULT, translations=None, parse_func=<function parse>, strict=True, tzinfo=None, **kwargs)

Bases: woob.browser.filters.base.Filter

Parse date and time.

Parameters
  • dayfirst (bool) – if True, the day is the first element in the string to parse

  • parse_func – the function to use for parsing the datetime

  • translations (list[tuple[str, str]]) – string replacements from site locale to English

  • tzinfo (datetime.tzinfo) – timezone to set if none was parsed

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Decode(selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Filter that aims to decode urlencoded strings

>>> Decode(Env('_id'))  
<woob.browser.filters.standard.Decode object at 0x...>
>>> from .html import Link
>>> Decode(Link('./a'))  
<woob.browser.filters.standard.Decode object at 0x...>
Parameters

default – default value in case the filter fails to find or parse the requested value

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Duration(selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.standard.Time

Parse a duration as timedelta.

Parameters

default – default value in case the filter fails to find or parse

the requested value

klass

alias of datetime.timedelta

kwargs = {'hours': 'hh', 'minutes': 'mm', 'seconds': 'ss'}
class woob.browser.filters.standard.Env(name, default=NO_DEFAULT)

Bases: woob.browser.filters.base._Filter

Filter to get environment value of the item.

It is used for example to get page parameters, or when there is a parse() method on ItemElement.

class woob.browser.filters.standard.Eval(func, *args)

Bases: woob.browser.filters.standard.MultiFilter

Evaluate a function with given ‘deferred’ arguments.

>>> F = Field; Eval(lambda a, b, c: a * b + c, F('foo'), F('bar'), F('baz')) 
>>> Eval(lambda x, y: x * y + 1).filter([3, 7])
22

Example:

obj_ratio = Eval(lambda x: x / 100, Env('percentage'))
Parameters

func – function to apply to all filters. The function should accept as many args as there are filters passed to Eval.

filter(values)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Field(name)

Bases: woob.browser.filters.base._Filter

Get the attribute of object.

Example:

obj_foo = CleanText('//h1')
obj_bar = Field('foo')

will make “bar” field equal to “foo” field.

class woob.browser.filters.standard.Filter(selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base._Filter

Class used to filter on a HTML element given as call parameter to return matching elements.

Filters can be chained, so the parameter supplied to constructor can be either a xpath selector string, or an other filter called before.

>>> from lxml.html import etree
>>> f = CleanDecimal(CleanText('//p'), replace_dots=True)
>>> f(etree.fromstring('<html><body><p>blah: <span>229,90</span></p></body></html>'))
Decimal('229.90')
Parameters

default – default value in case the filter fails to find or parse the requested value

filter(value)

This method has to be overridden by children classes.

select(selector, item)
exception woob.browser.filters.standard.FilterError

Bases: woob.exceptions.ParseError

class woob.browser.filters.standard.Format(fmt, *args)

Bases: woob.browser.filters.standard.MultiFilter

Combine multiple filters with string-format.

Example:

obj_title = Format('%s (%s)', CleanText('//h1'), CleanText('//h2'))

will concatenate the text from all <h1> and all <h2> (but put the latter between parentheses).

Parameters
  • fmt (str) – string format suitable for “%”-formatting

  • args – other filters to insert in fmt string. There should be as many args as there are “%” in fmt.

filter(values)

This method has to be overridden by children classes.

exception woob.browser.filters.standard.FormatError

Bases: woob.browser.filters.base.FilterError

class woob.browser.filters.standard.FromTimestamp(selector, millis=False, tz=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Parse a timestamp into a datetime.

Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Join(pattern, selector=None, textCleaner=<class 'woob.browser.filters.standard.CleanText'>, newline=False, addBefore='', addAfter='', default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Join multiple results from a selector. >>> Join(‘ - ‘, ‘//div/p’) # doctest: +SKIP

>>> Join(pattern=', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour"
True
>>> Join(pattern='-').filter([u"Au", u"revoir", ""]) == u"Au-revoir"
True
>>> Join(pattern='-').filter([]) == u""
True
>>> Join(pattern='-', default=u'empty').filter([]) == u'empty'
True
Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(el)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Lower(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)

Bases: woob.browser.filters.standard.CleanText

Extract text with CleanText and convert to lower-case.

Parameters
  • symbols (list) – list of strings to remove from text

  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform

  • children (bool) – whether to get text from children elements of the select elements

  • newlines (bool) – if True, newlines will be converted to space too

  • normalize (str or None) – Unicode normalization to perform

  • transliterate (bool) – Transliterates unicode characters into ASCII characters

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Map(selector, map_dict, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Map selected value to another value using a dict.

Example:

TYPES = {
    'Concert': CATEGORIES.CONCERT,
    'Cinéma': CATEGORIES.CINE,
}

obj_type = Map(CleanText('./li'), TYPES)
Parameters

selector – key from map_dict to use

filter(txt)
Raises

ItemNotFound if key does not exist in dict

class woob.browser.filters.standard.MapIn(selector, map_dict, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Map the pattern of a selected value to another value using a dict.

Parameters

selector – key from map_dict to use

filter(txt)
Raises

ItemNotFound if key pattern does not exist in dict

class woob.browser.filters.standard.MultiFilter(*args, **kwargs)

Bases: woob.browser.filters.base.Filter

Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(values)

This method has to be overridden by children classes.

class woob.browser.filters.standard.MultiJoin(*args, **kwargs)

Bases: woob.browser.filters.standard.MultiFilter

Join multiple filters. >>> MultiJoin(Field(‘field1’), Field(‘field2’)) # doctest: +SKIP

>>> MultiJoin(pattern=u', ').filter([u"Oui", u"bonjour", ""]) == u"Oui, bonjour"
True
>>> MultiJoin(pattern=u'-').filter([u"Au", u"revoir", ""]) == u"Au-revoir"
True
>>> MultiJoin(pattern=u'-').filter([]) == u""
True
>>> MultiJoin(pattern=u'-', default=u'empty').filter([]) == u'empty'
True
>>> MultiJoin(pattern=u'-').filter([1, 2, 3]) == u'1-2-3'
True
Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(values)

This method has to be overridden by children classes.

exception woob.browser.filters.standard.NumberFormatError

Bases: woob.browser.filters.standard.FormatError, decimal.InvalidOperation

class woob.browser.filters.standard.QueryValue(selector, key, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Extract the value of a parameter from an URL with a query string.

>>> from lxml.html import etree
>>> from .html import Link
>>> f = QueryValue(Link('//a'), 'id')
>>> f(etree.fromstring('<html><body><a href="http://example.org/view?id=1234"></a></body></html>')) == u'1234'
True
Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(url)

This method has to be overridden by children classes.

class woob.browser.filters.standard.RawText(selector=None, children=False, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Get raw text from an element.

Unlike CleanText, whitespace is kept as is.

Parameters

children (bool) – whether to get text from children elements of the select elements

filter(el)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Regexp(selector=None, pattern=None, template=None, nth=0, flags=0, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Apply a regex.

>>> from lxml.html import etree
>>> doc = etree.fromstring('<html><body><p>Date: <span>13/08/1988</span></p></body></html>')
>>> Regexp(CleanText('//p'), r'Date: (\d+)/(\d+)/(\d+)', '\\3-\\2-\\1')(doc) == u'1988-08-13'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=1))(doc) == u'08'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=-1))(doc) == u'1988'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', template='[\\1]', nth='*'))(doc) == [u'[13]', u'[08]', u'[1988]']
True
>>> (Regexp(CleanText('//body'), r'Date:.*'))(doc) == u'Date: 13/08/1988'
True
>>> (Regexp(CleanText('//body'), r'^(?!Date:).*', default=None))(doc)
>>>
Parameters

default – default value in case the filter fails to find or parse

the requested value

expand(m)
filter(txt)
Raises

RegexpError if pattern was not found

exception woob.browser.filters.standard.RegexpError

Bases: woob.browser.filters.base.FilterError

class woob.browser.filters.standard.Slugify(selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Parameters

default – default value in case the filter fails to find or parse the requested value

filter(label)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Time(selector=None, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Parse time.

Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(txt)

This method has to be overridden by children classes.

klass

alias of datetime.time

kwargs = {'hour': 'hh', 'minute': 'mm', 'second': 'ss'}
class woob.browser.filters.standard.Title(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)

Bases: woob.browser.filters.standard.CleanText

Extract text with CleanText and apply title() to it.

Parameters
  • symbols (list) – list of strings to remove from text

  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform

  • children (bool) – whether to get text from children elements of the select elements

  • newlines (bool) – if True, newlines will be converted to space too

  • normalize (str or None) – Unicode normalization to perform

  • transliterate (bool) – Transliterates unicode characters into ASCII characters

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Type(selector=None, type=None, minlen=0, default=NO_DEFAULT)

Bases: woob.browser.filters.base.Filter

Get a cleaned value of any type from an element text. The type_func can be any callable (class, function, etc.). By default an empty string will not be parsed but it can be changed by specifying minlen=False. Otherwise, a minimal length can be specified.

>>> Type(CleanText('./td[1]'), type=int)  
>>> Type(type=int).filter(42)
42
>>> Type(type=int).filter('42')
42
>>> Type(type=int, default='NaN').filter('')
'NaN'
>>> Type(type=list, minlen=False, default=list('ab')).filter('')
[]
>>> Type(type=list, minlen=0, default=list('ab')).filter('')
['a', 'b']
Parameters

default – default value in case the filter fails to find or parse

the requested value

filter(txt)

This method has to be overridden by children classes.

class woob.browser.filters.standard.Upper(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)

Bases: woob.browser.filters.standard.CleanText

Extract text with CleanText and convert to upper-case.

Parameters
  • symbols (list) – list of strings to remove from text

  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform

  • children (bool) – whether to get text from children elements of the select elements

  • newlines (bool) – if True, newlines will be converted to space too

  • normalize (str or None) – Unicode normalization to perform

  • transliterate (bool) – Transliterates unicode characters into ASCII characters

filter(txt)

This method has to be overridden by children classes.