urlup
Urlup is a utility program and Python 3 package to dereference URLs and determine their final destinations after following redirections. Urlup can be pronounced "urrrl-up".
Authors: Michael Hucka
Repository: https://github.com/caltechlibrary/urlup
License: BSD 3-clause license – see the LICENSE file for more information
☀ Introduction
Sometimes we have a list of URLs and we need to find out the ultimate destinations after any redirections have taken place. Urlup is a simple program to dereference a list of URLs for that purpose. It handles EZproxy proxied URLs (a common type of proxy used by institutional libraries). It provides diagnostics and HTTP status codes if desired. It can be used from the command line, and it also provides a Python 3 module that can be called programmatically.
✺ Installation instructions
The instructions below assume you have a Python interpreter installed on your computer; if that's not the case, please first install Python version 3 and familiarize yourself with running Python programs on your system.
On Linux, macOS, and Windows operating systems, you should be able to install urlup
with pip
. To install urlup
from the Python package repository (PyPI), run the following command:
python3 -m pip install urlup
As an alternative to getting it from PyPI, you can use pip
to install urlup
directly from GitHub, like this:
python3 -m pip install git+https://github.com/caltechlibrary/urlup.git
▶︎ Basic operation
Urlup provides a command-line utility as well as a library.
Command-line use
The command-line utility is called urlup
and can be used from a terminal shell. It prints help text when given the -h
option (/h
on Windows). For a simple, quick check of a few URLs, you can simply provide the URLs on the command line:
# urlup http://sbml.org
No output file given -- results won't be saved.
http://sbml.org ==> http://sbml.org/Main_Page [301]
Done.
The output produced by urlup
will consist of a line for each URL given, with the original and final URLs printed along with the HTTP status code received when the original URL is first contacted. If given the -e
option (/e
on Windows), it will also print more details about the meaning of the HTTP status received. For example:
# urlup -e caltech.edu www.caltech.edu
No output file specified; results won't be saved.
caltech.edu ==> http://www.caltech.edu/
[status code 302 = This response code means that the URI of
requested resource has been changed temporarily. New changes in the
URI might be made in the future. Therefore, this same URI should be
used by the client in future requests.]
www.caltech.edu ==> http://www.caltech.edu
[status code 200 = The request has succeeded.]
The typical usage for urlup
is to provide it with a list of URLs in a file (one per line) with the -i
option (/i
on Windows), and to tell it to write the results to a CSV file with the -o
option (/o
on Windows).
# urlup -i original_urls.txt -o final_urls.csv
Here is a screen cast to demonstrate. Click on the following image:
Proxy handling
If the URLs to be dereference involve a proxy server (such as EZproxy, a common type of proxy used by libraries), it will be necessary for Urlup to obtain login credentials for the proxy server. Urlup uses your operating system's keyring/keychain functionality to ask for and store the credentials. The first time Urlup needs to get proxy credentials, it will prompt you for the credentials and then store them using your operating system's keyring/keychain functionality, so that it does not have to prompt again in the future. You can disable the use of the keyring/keychain by running Urlup with the -X
(or /X
on Windows) command-line flag. It is also possible to supply the information directly on the command line using the -u
and -p
options (or /u
and /p
on Windows), but this is discouraged because it is insecure on multiuser computer systems.
Module API
Urlup provides a single function, updated_urls()
, that can be called from other Python programs to dereference one or more URLs. If given a single URL, it returns a single result; if give a list, it returns a list of results. Each result is in the form of a named tuple called UrlData
. The tuple has 4 fields:
-
original
: the given URL -
final
: the URL after dereferencing and following redirections -
status
integer HTTP status code obtained on theoriginal
URL -
error
: the error (if any) encountered while trying to dereference the URL
Here is a simple example of using updated_urls()
in Python:
from urlup import updated_urls
for result in updated_urls(['http://caltech.edu', 'http://notarealurl.nowhere']):
print('Original URL: ' + result.original)
if result.error:
print('Error: ' + result.error)
else:
print('Final URL: ' + result.final)
print('Status code: ' + str(result.status))
print('')
The code above will print the following output when run:
Original URL: http://caltech.edu
Final URL: http://www.caltech.edu/
Status code: 302
Original URL: http://notarealurl.nowhere
Error: Cannot resolve host name
The function updated_urls
takes the following arguments:
- urls: a single URL or a list of URLs
- cookies: a dictionary of key-value pairs representing cookies to be set when making network connections
- headers: a dictionary of headers to add to every URL lookup
- proxy_user_: a user login for a proxy, if a proxy will be encountered
- proxy_pswd: a password for the proxy, if a proxy will be encountered
-
use_keyring: whether the system keyring/keychain should be used (default is
False
) -
quiet: whether to print messages while working (default is
True
, meaning, don't print a lot of messages) -
explain: whether to explain HTTP status codes encountered (default is
False
, meaning, don't print explanations) -
colorize: whether to color-code any messages printed (default is
False
)
⁇ Getting help and support
If you find an issue, please submit it in the GitHub issue tracker for this repository.
★ Do you like it?
If you like this software, don't forget to give this repo a star on GitHub to show your support!
☺︎ Acknowledgments
The vector artwork used as a logo for Urlup was created by Eynav Raphael and obtained from The Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license.
Urlup was makes use of numerous open-source packages, without which it would have been effectively impossible to develop Urlup with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:
- colorama – makes ANSI escape character sequences work under MS Windows terminals
- ipdb – the IPython debugger
- keyring – access the system keyring service from Python
- plac – a command line argument parser
- PyInstaller – a packaging program that creates standalone applications from Python programs for Windows, macOS, Linux and other platforms
- requests – an HTTP library for Python
-
setuptools – library for
setup.py
- termcolor – ANSI color formatting for output in terminal
- uritools – RFC 3986 compliant, Unicode-aware, scheme-agnostic replacement for urlparse
- validators – Python data validators for humans
☮︎ Copyright and license
Copyright (C) 2018-2021, Caltech. This software is freely distributed under a BSD 3-clause license. Please see the LICENSE file for more information.