Big news! Sonar has entered a definitive agreement to acquire Tidelift!

hnscraper
Release 0.1.0

Hackernews Scraper

Homepage Repository PyPI Jupyter Notebook

Keywords: hacktoberfest, hacktoberfest2020, hacktoberfest2021, hacktoberfest2022, jupyter-notebooks, python, python3, scraper, webscraping
License: MIT
Install: pip install hnscraper==0.1.0

Documentation

Hackernews-Scraping

Business Requirements:

Scrape TheHackernews.com and store the result (Description, Image, Title, Url) in mongo db
Maintain two relations - 1 with the url and title of the blog and other one with url and its meta data like (Description, Image, Title, Author)

Requirements:

python3
pip
python libraries: _ requests _ BeautifulSoup4 _ pymongo _ jupyterlab * notebook
MongoDB
git

To run the application on your local machine:

Clone the repository:

Type the following in your terminal

git clone https://github.com/pushp1997/Hackernews-Scraping.git
Change the directory into the repository

cd ./Hackernews-Scraping
Create python virtual environment

python3 -m venv ./scrapeVenv
Activate the virtual environment created
- On linux / MacOS : source ./scrapeVenv/bin/activate
- On Windows (cmd) : "./scrapeVenv/Scripts/activate.bat"
- On Windows (powershell) : "./scrapeVenv/Scripts/activate.ps1"
Install python requirements

pip install -r requirements.txt
Open the ipynb using jupyter notebook

jupyter notebook "Hackernews Scraper.ipynb"
Run the notebook, you will be asked to provide inputs for no of pages to scrape to get the post and your MongoDB database URI to store the posts data.
Open mongodb shell connecting to the same URI you provided to the ipynb notebook while running it.
Change the database

use hackernews
Print the documents in the 'url-title' collection

db["url-title"].find().pretty()
Print the documents in the 'url-others' collection

db["url-others"].find().pretty()

Dependencies: 6
Dependent packages: 0
Dependent repositories: 0
Total releases: 1
Latest release: Jun 20, 2022
First release: Jun 20, 2022
Stars: 1
Forks: 8
Watchers: 2
Contributors: 3
Repository size: 205 KB
SourceRank: 6

Source repo 2FA enabled: TEXT!
Package manager 2FA enabled: TEXT!
Is security responsive: TEXT!
Dependencies are managed: TEXT!
Issue-free release available: TEXT!
Succession plan available: TEXT!

Releases

0.1.0: Jun 20, 2022

Contributors

See all contributors

Login to resync this project