je-suis-tm/web-scraping — repository offers Python web scrapers and tutorials to

Version	Commit	Size	Downloads	Date
latestLatest	HEAD	1.3 MB	0	17d ago

Web Scraping

Intro

My understanding of web scraping is patience and attention to details. Scraping is not rocket science (deep learning is). When I do scraping, I typically spend 50% of my time in analyzing the source (navigate through HTML parse tree or inspect element to find the post form) and the rest 50% in ETL. The most useful tools for me are requests, bs4 and re. Some people may recommend selenium for non-static website. To be honest, I have never used selenium throughout my career, but dynamic websites like Facebook and Twitter are still within my grasp. You see? patience and attention to details matter.

This repository contains a couple of python web scrapers. These scrapers mainly target at different commodity future exchanges and influential media websites (or so-called fake news, lol). Most scripts were written during my early days of Python learning. Since this repository gained unexpected popularity, I have restructured everything to make it more user-friendly. All the scripts featured in this repository are ready for use. Each script is designed to feature a unique technique that I found useful throughout my experience of data engineering.

Scripts inside this repository are classified into two groups, beginner and advanced. At the beginning, the script is merely about some technique to extract the data. As you progress, the script leans more towards data architect and other functions to improve the end product. If you are experienced or simply come to get scrapers for free, you may want to skip the content and just look at <a href= https://github.com/je-suis-tm/web-scraping#available-scrapers>available scrapers. If you are here to learn, you may look at <a href= https://github.com/je-suis-tm/web-scraping#table-of-contents>table of contents to determine which suits you best. In addition, there are some <a href= https://github.com/je-suis-tm/web-scraping#notes>notes on the annoying issues such as proxy authentication (usually corporate or university network) and legality (hopefully you won't come to that).

Beginner

<a href=https://github.com/je-suis-tm/web-scraping#1-html-parse-tree-search-cme1>1. HTML Parse Tree Search (CME1)

<a href=https://github.com/je-suis-tm/web-scraping#2-json-cme2>2. JSON (CME2)

<a href=https://github.com/je-suis-tm/web-scraping#3-regular-expression-shfe>3. Regular Expression (SHFE)

Advanced

<a href=https://github.com/je-suis-tm/web-scraping#1-sign-in-cqf>1. Sign-in (CQF)

<a href=https://github.com/je-suis-tm/web-scraping#2-database-lme>2. Database (LME)

web-scraping

Quick Overview

What is this?

What problem does it solve?

Who should use it?

Pros

Cons

Scores

Trust Score

Maintenance

Popularity

Star History

Snapshot Versions

Alternatives

listmonk

BillionMail

responsive-html-email-template

flowy

mautic

mailtrain

Community Reviews

README

Web Scraping

Intro

Table of Contents

Beginner

Advanced