Alt Job
Scraping alternative websites for jobs.
Atl Job scrapes a bunch of green/social/alternative websites to send digest of new job postings by email. Also generates an Excel file with job postings informations.
The scraped data include: job title, type, salary, week_hours, date posted, apply before date and full description. Additionnaly, a set of keywords matches are automatically checked against all jobs and added as a new column. (See screens)
🔥
Job postings mailling lists - Montréal / Québec:
alt_job_mtl Google Group. Join to receive a daily digest of new Montréal and Province of Québec job postings.
Supported websites
Alt Job is wrote in an extensible way, only 30 lines of code are required to support a new job posting site! Focused on Canada/Québec for now, please contribute to improve the software or expand the scope
Supports the following websites:
- arrondissement.com: Québec (full parsing)
- cdeacf.ca: Québec (full job PDFs parsing)
- chantier.qc.ca: Québec (full parsing)
- goodwork.ca: Québec and Canada (full parsing, form search still TODO)
- engages.ca: Québec (paging TODO)
- enviroemplois.org: Québec (full parsing)
- charityvillage.com: Québec and Canada (full parsing, require chromedriver)
- aqoci.qc.ca: Québec, Internationnal (full parsing)
The support of the following websites is on the TODO:
- undpjobs.net: World Wide
- eco.ca
- novae.ca
- quebecmunicipal.qc.ca
Install
Install all requirements (see setup.py for more details)
python3 -m pip install 'alt_job[all]'
Require Python >= 3.6
Configure
Sample config file
[alt_job]
##### General config #####
# Logging
log_level=INFO
scrapy_log_level=ERROR
# Jobs data file, default is ~/jobs.json
# jobs_datafile=/home/user/Jobs/jobs-mtl.json
# Asynchronous workers, number of site to scan at the same time
# Default to 5.
# workers=10
##### Mail sender #####
# Email server settings
smtphost=smtp.gmail.com
mailfrom=you@gmail.com
smtpuser=you@gmail.com
smtppass=password
smtpport=587
smtptls=Yes
# Email notif settings
mailto=["you@gmail.com"]
##### Scrapers #####
# Website domain
[goodwork.ca]
# URL to start the scraping, required for all scrapers
url=https://www.goodwork.ca/jobs.php?prov=QC
[cdeacf.ca]
url=http://cdeacf.ca/recherches?f%5B0%5D=type%3Aoffre_demploi
# Load full jobs details: If supported by the scraper,
# this will follow each job posting link in listing and parse full job description.
# turn on to parse all job informations
# Default to False!
load_full_jobs=True
[arrondissement.com]
url=https://www.arrondissement.com/tout-list-emplois/
# Load all new pages: If supported by the scraper,
# this will follow each "next page" links and parse next listing page
# until older (in database) job postings are found.
# Default to False!
load_all_new_pages=True
[chantier.qc.ca]
url=https://chantier.qc.ca/decouvrez-leconomie-sociale/offres-demploi
load_full_jobs=Yes
# Disabled scraper
# [engages.ca]
# url=https://www.engages.ca/emplois?search%5Bkeyword%5D=&search%5Bjob_sector%5D=&search%5Bjob_city%5D=Montr%C3%A9al
[enviroemplois.org]
# Multiple start URLs crawl
start_urls=["https://www.enviroemplois.org/offres-d-emploi?sector=®ion=6&job_kind=&employer=",
"https://www.enviroemplois.org/offres-d-emploi?sector=®ion=3&job_kind=&employer="]
Run it
python3 -m alt_job -c /home/user/Jobs/alt_job.conf
Arguments
Some of the config options can be overwritten with CLI arguments.
-c <File path> [<File path> ...], --config_file <File path> [<File path> ...]
configuration file(s). Default locations will be
checked and loaded if file exists:
`~/.alt_job/alt_job.conf`, `~/alt_job.conf` or
`./alt_job.conf` (default: [])
-t, --template_conf print a template config file and exit. (default:
False)
-V, --version print Alt Job version and exit. (default: False)
-x <File path>, --xlsx_output <File path>
Write all NEW jobs to Excel file (default: None)
-s <Website> [<Website> ...], --enabled_scrapers <Website> [<Website> ...]
List of enabled scrapers. By default it's all scrapers
configured in config file(s) (default: [])
-j <File path>, --jobs_datafile <File path>
JSON file to store ALL jobs data. Default is
'~/jobs.json'. Use 'null' keyword to disable the
storage of the datafile, all jobs will be considered
as new and will be loaded (default: )
--workers <Number> Number of websites to scrape asynchronously (default:
5)
--full, --load_all_jobs
Load the full job description page to parse
additionnal data. This settings is applied to all
scrapers (default: False)
--all, --load_all_new_pages
Load new job listing pages until older jobs are found.
This settings is applied to all scrapers (default:
False)
--quick, --no_load_all_jobs
Do not load the full job description page to parse
additionnal data (Much more faster). This settings is
applied to all scrapers (default: False)
--first, --no_load_all_new_pages
Load only the first job listing page. This settings is
applied to all scrapers (default: False)
--mailto <Email> [<Email> ...]
Emails to notify of new job postings (default: [])
--log_level <String> Alt job log level. Exemple: DEBUG (default: INFO)
--scrapy_log_level <String>
Scrapy log level. Exemple: DEBUG (default: ERROR)