ekrhizoc
ekrhizoc (E6c): A web crawler
Contents
Definition
εκρίζωση (Greek) ekrízosi / uprooting, eradication
Also known as E6c.
Use Case
Implementation of a simple python web crawler.
Input: URL (seed).
Output: Simple textual sitemap (to show links between pages).
Requirements
- The crawler is limited to one subdomain (exclude external links).
- No use of web crawling libraries/frameworks (e.g. scrapy).
- (Optional) Use of HTML handling Libraries/Frameworks.
- Production-ready code.
Assumptions
- The input URL (seed) is limited to only one at every run.
- The targeted URL(s) are static pages (no backend javascript parsing required).
- Links to be extracted from HTML anchor
<a>
elements. - Valid links include
- Valid URL
- Non empty
- Matches a valid url pattern
- Does not exceed the
E6C_MAX_URL_LENGTH
length in characters - Possible to convert a relative urls to a full url
- Link is not visited before
- Link is not part of an ignored file type
- Link has the same domain as the seed url
- Link is not restricted by the robots.txt file
- Valid URL
Design
This project implements a Basic Universal Crawler based on breadth first search graph traversal.
Configuration
Behaviour of the application can be configured via Environment Variables.
Environment Variable | Description | Type | Default Value |
---|---|---|---|
E6C_LOG_LEVEL |
Level of logging - overrides verbose/quiet flag | string | - |
E6C_LOG_DIR |
Directory to save logs | string | - |
E6C_BIN_DIR |
Directory to save any output (bin) | string | bin |
E6C_IGNORE_FILETYPES |
File types of websites to ignore (e.g. ".filetype1,.filetype2") | string | ".png,.pdf,.txt,.doc,.jpg,.gif" |
E6C_URL_REQUEST_TIMER |
Time (in seconds) to wait per request (not to populate server with multiple requests) | float | 0.1 |
E6C_MAX_URLS |
The maximum number of urls to fetch/crawl | integer | 10000 |
E6C_MAX_URL_LENGTH |
The maximum length (character count) of a url to fetch/crawl | integer | 300 |
Development
Configure your local development
- Clone repo on your local machine
- Install
conda
orminiconda
- Create your local project environment (based on
conda
,poetry
,pre-commit
):
$ make env
- (Optional) Update existing local project environment:
$ make env-update
Run locally
On a terminal, run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- Run the CLI using
poetry
:
$ ekrhizoc
Contribute
[ Not Available ]
Testing
(part of CI/CD)
[ Work in progress... ]
To run the tests, open a terminal and run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- To run pytest:
$ make test
- To check test coverage:
$ make test-coverage
Versioning
Increment the version number:
$ poetry version {bump rule}
where valid bump rules are:
- patch
- minor
- major
- prepatch
- preminor
- premajor
- prerelease
Changelog
Use CHANGELOG.md
to track the evolution of this package.
The [UNRELEASED]
tag at the top of the file should always be there to log the work until a release occurs.
Work should be logged under one of the following subtitles:
- Added
- Changed
- Fixed
- Removed
On a release, a version of the following format should be added to all the current unreleased changes in the file.
## [major.minor.patch] - YYYY-MM-DD
Deployment
Pip package
On a terminal, run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- To build pip package:
$ make build-package
- To publish pip package (requires credentials to PyPi):
$ make publish-package
Docker image
On a terminal, run the following (execute on project's root directory):
- Activate project environment:
$ . ./scripts/helpers/environment.sh
- To build docker image:
$ make build-docker
Production
For production, a Docker image is used. This image is published publicly on docker hub.
- First pull image from docker hub:
$ docker pull nichelia/ekrhizoc:{version}
- Execute CLI via docker run:
$ docker run --rm -it -v ~/ekrhizoc_bin:/tmp/bin nichelia/ekrhizoc:{version} {command}
This command mounts the application's bin (outcome) to user's root directory under ekrhizoc_bin folder.
where version is the published application version