urlup

Dereference HTTP addresses to determine ultimate destinations


Keywords
http, redirects, URL, redirection
License
BSD-1-Clause
Install
pip install urlup==1.5.1

Documentation

urlup

Urlup is a utility program and Python 3 package to dereference URLs and determine their final destinations after following redirections. Urlup can be pronounced "urrrl-up".

Authors: Michael Hucka
Repository: https://github.com/caltechlibrary/urlup
License: BSD 3-clause license – see the LICENSE file for more information

License Python PyPI Latest release DOI

Introduction

Sometimes we have a list of URLs and we need to find out the ultimate destinations after any redirections have taken place. Urlup is a simple program to dereference a list of URLs for that purpose. It handles EZproxy proxied URLs (a common type of proxy used by institutional libraries). It provides diagnostics and HTTP status codes if desired. It can be used from the command line, and it also provides a Python 3 module that can be called programmatically.

✺ Installation instructions

The instructions below assume you have a Python interpreter installed on your computer; if that's not the case, please first install Python version 3 and familiarize yourself with running Python programs on your system.

On Linux, macOS, and Windows operating systems, you should be able to install urlup with pip. To install urlup from the Python package repository (PyPI), run the following command:

python3 -m pip install urlup

As an alternative to getting it from PyPI, you can use pip to install urlup directly from GitHub, like this:

python3 -m pip install git+https://github.com/caltechlibrary/urlup.git

▶︎ Basic operation

Urlup provides a command-line utility as well as a library.

Command-line use

The command-line utility is called urlup and can be used from a terminal shell. It prints help text when given the -h option (/h on Windows). For a simple, quick check of a few URLs, you can simply provide the URLs on the command line:

# urlup http://sbml.org
No output file given -- results won't be saved.
http://sbml.org ==> http://sbml.org/Main_Page [301]
Done.

The output produced by urlup will consist of a line for each URL given, with the original and final URLs printed along with the HTTP status code received when the original URL is first contacted. If given the -e option (/e on Windows), it will also print more details about the meaning of the HTTP status received. For example:

# urlup -e caltech.edu www.caltech.edu
No output file specified; results won't be saved.
caltech.edu ==> http://www.caltech.edu/
   [status code 302 = This response code means that the URI of
   requested resource has been changed temporarily. New changes in the
   URI might be made in the future. Therefore, this same URI should be
   used by the client in future requests.]
www.caltech.edu ==> http://www.caltech.edu
   [status code 200 = The request has succeeded.]

The typical usage for urlup is to provide it with a list of URLs in a file (one per line) with the -i option (/i on Windows), and to tell it to write the results to a CSV file with the -o option (/o on Windows).

# urlup  -i original_urls.txt  -o final_urls.csv

Here is a screen cast to demonstrate. Click on the following image:

demo

Proxy handling

If the URLs to be dereference involve a proxy server (such as EZproxy, a common type of proxy used by libraries), it will be necessary for Urlup to obtain login credentials for the proxy server. Urlup uses your operating system's keyring/keychain functionality to ask for and store the credentials. The first time Urlup needs to get proxy credentials, it will prompt you for the credentials and then store them using your operating system's keyring/keychain functionality, so that it does not have to prompt again in the future. You can disable the use of the keyring/keychain by running Urlup with the -X (or /X on Windows) command-line flag. It is also possible to supply the information directly on the command line using the -u and -p options (or /u and /p on Windows), but this is discouraged because it is insecure on multiuser computer systems.

Module API

Urlup provides a single function, updated_urls(), that can be called from other Python programs to dereference one or more URLs. If given a single URL, it returns a single result; if give a list, it returns a list of results. Each result is in the form of a named tuple called UrlData. The tuple has 4 fields:

  • original: the given URL
  • final: the URL after dereferencing and following redirections
  • status integer HTTP status code obtained on the original URL
  • error: the error (if any) encountered while trying to dereference the URL

Here is a simple example of using updated_urls() in Python:

from urlup import updated_urls

for result in updated_urls(['http://caltech.edu', 'http://notarealurl.nowhere']):
     print('Original URL: ' + result.original)
     if result.error:
         print('Error: ' + result.error)
     else:
         print('Final URL: ' + result.final)
         print('Status code: ' + str(result.status))
     print('')

The code above will print the following output when run:

Original URL: http://caltech.edu
Final URL: http://www.caltech.edu/
Status code: 302

Original URL: http://notarealurl.nowhere
Error: Cannot resolve host name

The function updated_urls takes the following arguments:

  • urls: a single URL or a list of URLs
  • cookies: a dictionary of key-value pairs representing cookies to be set when making network connections
  • headers: a dictionary of headers to add to every URL lookup
  • proxy_user_: a user login for a proxy, if a proxy will be encountered
  • proxy_pswd: a password for the proxy, if a proxy will be encountered
  • use_keyring: whether the system keyring/keychain should be used (default is False)
  • quiet: whether to print messages while working (default is True, meaning, don't print a lot of messages)
  • explain: whether to explain HTTP status codes encountered (default is False, meaning, don't print explanations)
  • colorize: whether to color-code any messages printed (default is False)

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

★ Do you like it?

If you like this software, don't forget to give this repo a star on GitHub to show your support!

☺︎ Acknowledgments

The vector artwork used as a logo for Urlup was created by Eynav Raphael and obtained from The Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license.

Urlup was makes use of numerous open-source packages, without which it would have been effectively impossible to develop Urlup with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

  • colorama – makes ANSI escape character sequences work under MS Windows terminals
  • ipdb – the IPython debugger
  • keyring – access the system keyring service from Python
  • plac – a command line argument parser
  • PyInstaller – a packaging program that creates standalone applications from Python programs for Windows, macOS, Linux and other platforms
  • requests – an HTTP library for Python
  • setuptools – library for setup.py
  • termcolor – ANSI color formatting for output in terminal
  • uritools – RFC 3986 compliant, Unicode-aware, scheme-agnostic replacement for urlparse
  • validators – Python data validators for humans

☮︎ Copyright and license

Copyright (C) 2018-2021, Caltech. This software is freely distributed under a BSD 3-clause license. Please see the LICENSE file for more information.