Primer on Web-scrapping
March 12, 2025
“If you can see things in your web browser, you can scrape them.”
Clients are the typical web user’s internet-connected devices (e.g., a computer connected to Wi-Fi) and web-accessing software available on those devices (e.g., Firefox, Chrome).
Servers are computers that store webpages, sites, or apps.
When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user’s web browser.
Hypertext Transfer Protocol (HTTP) is a language for clients and servers to speak to each other.
Hypertext Transfer Protocol Secure (HTTPS) is an encrypted version of HTTP that provides secure communication between them.
When we type a web address with “https://” into our browser:
An uniform resource locator (URL)—commonly know as a “web address”, specifies the location of a resource (such as a web page) on the internet.
An URL is usually composed of 5 parts
id
and cat
.<!DOCTYPE html>
declaration defines that this document is an HTML document.<html>
element is the root element of an HTML page.<head>
element contains meta information about the HTML page.<title>
element specifies a title for the HTML page.
<title>
is shown in the browser’s title bar or in the page’s tab.<body>
element defines the document’s body, and is a container for all the visible contents, such as headings, paragraphs, images, hyperlinks, tables, lists, etc.
<h1>
element defines a large heading.<p>
element defines a paragraph.<a>
<a>
tag defines an HTML link.<img>
<img>
tag defines an HTML image.<table>
tag defines an HTML table.
<tr>
tag defines each table row.<th>
tag defines each table header.<td>
tag defines each data/cell.<ul>
tag defines an unordered lists.
<li>
tag defines each list item.<ol>
tag defines an ordered lists.
<li>
tag defines each list item.<div>
– Block-Level Container<div style="background-color:black;color:white;padding:20px;">
<h2>London</h2>
<p>London is the capital city of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
</div>
style
, class
, and id
are commonly used.<div>
is used when grouping multiple elements together.<span>
– Inline Container<p>My mother has <span style="color:blue;font-weight:bold">blue</span> eyes and my father has <span style="color:darkolivegreen;font-weight:bold">dark green</span> eyes.</p>
<span>
is used when styling or modifying a small part of text inside a block.View Page Source
(in Chrome), or similar in other browsers.Inspect
or Inspect Element
to see what elements are made up of.To parse HTML, it is convenient to represent our HTML document as a tree-like structure that contains information in nodes and links information through branches.
This tree-like structure is called the Document Object Model (DOM).