urltools

Some functions to parse and normalize URLs.

NOTE and credit

this is base on the original work that used to be here github.com/rbaier/urltools.git but is now gone

Functions

Normalize

>>> urltools.normalize("Http://exAMPLE.com./foo")
http://example.com/foo

Rules that are applied to normalize a URL:

tolower scheme
tolower host (also works with IDNs)
remove default port
remove ':' without port
remove DNS root label
unquote path, query, fragment
collapse path (remove '//', '/./', '/../')
sort query params and remove params without value

Parse

The result of parse and extract is a ParseResult named tuple that contains scheme, username, password, subdomain, domain, tld, port, path, query and fragment.

>>> urltools.parse("http://example.co.uk/foo/bar?x=1#abc")
ParseResult(scheme='http', username='', password='', subdomain='', domain='example', tld='co.uk', port='', path='/foo/bar', query='x=1', fragment='abc')

If the scheme is missing parse interprets the URL as relative.

>>> urltools.parse("www.example.co.uk/abc")
ParseResult(scheme='', username='', password='', subdomain='', domain='', tld='', port='', path='www.example.co.uk/abc', query='', fragment='')

Extract

extract does not care about relative URLs and always tries to extract as much information as possible.

>>> urltools.extract("www.example.co.uk/abc")
ParseResult(scheme='', username='', password='', subdomain='www', domain='example', tld='co.uk', port='', path='/abc', query='', fragment='')

Additional functions

Besides the already described main functions urltools has some more functions to manipulate segments of a URL.

encode (IDNA, see RFC 3490)

  >>> urltools.encode("http://müller.de")
  'http://xn--mller-kva.de/'

assemble a new URL from a ParseResult
normalize_host
normalize_port

normalize_path

  >>> normalize_path("/a/b/../../c")
  '/c'

normalize_query

  >>> normalize_query("x=1&y=&z=3")
  'x=1&z=3'

normalize_fragment
unquote

split (basically the same as urlparse.urlparse)

  >>> split("http://www.example.com/abc?x=1&y=2#foo")
  SplitResult(scheme='http', netloc='www.example.com', path='/abc', query='x=1&y=2', fragment='foo')

split_netloc

  >>> split_netloc("foo:bar@www.example.com:8080")
  ('foo', 'bar', 'www.example.com', '8080')

split_host

  >>> split_host("www.example.ac.at")
  ('www', 'example', 'ac.at')

Installation

pip is not working yet You can install urltools from the Python Package Index (PyPI):

pip install urltools

... or get the newest version directly from GitHub:

pip install -e git://github.com/itzik-h/urltools.git#egg=urltools

Public Suffix List

urltools uses the Public Suffix List to split domain names correctly. E.g. the TLD of example.co.uk would be .co.uk and not .uk.

I recommend to use a local copy of this list. Otherwise it will be downloaded after each import of urltools.

export PUBLIC_SUFFIX_LIST="/path/to/effective_tld_names.dat"

For more information see http://publicsuffix.org/

Tests

To run the tests I use pytest:

py.test -vrxs

urltools
Release 0.4.0

Release 0.4.0

0.4.0

0.3.2

0.3.1

0.2.2

0.1.15

0.1.12

0.1.10

0.1.9

0.1.8

0.1.7

Documentation

urltools

NOTE and credit

Functions

Normalize

Parse

Extract

Additional functions

Installation

Public Suffix List

Tests

Stats

Development practices

Releases

urltools Release 0.4.0

Release 0.4.0 Toggle Dropdown 0.4.0 0.3.2 0.3.1 0.2.2 0.1.15 0.1.12 0.1.10 0.1.9 0.1.8 0.1.7

Documentation

urltools

NOTE and credit

Functions

Normalize

Parse

Extract

Additional functions

Installation

Public Suffix List

Tests

Stats

Development practices

Releases

urltools
Release 0.4.0

Release 0.4.0

0.4.0

0.3.2

0.3.1

0.2.2

0.1.15

0.1.12

0.1.10

0.1.9

0.1.8

0.1.7