Lecture 15

Primer on Web-scrapping

Byeong-Hak Choe

SUNY Geneseo

March 12, 2025

Premier on Web-scrapping

Ethical considerations

“If you can see things in your web browser, you can scrape them.”

  • Just because you can scrape them doesn’t mean you should.
  • It is important to realize that the web scrapping tools are very powerful.
    • It is pretty straightforward to write up a program that can overwhelm a host server.
    • The host server tends to have built-in safeguards that will block you in case of a suspected malicious attack.

Ethical considerations

  • Respect for website owners’ terms of service.
  • Respect for copyright and intellectual property.
  • Respect for client servers.
  • Respect for privacy.

Clients and servers

  • Computers connected to the web are called clients and servers.
  • Clients are the typical web user’s internet-connected devices (e.g., a computer connected to Wi-Fi) and web-accessing software available on those devices (e.g., Firefox, Chrome).

  • Servers are computers that store webpages, sites, or apps.

  • When a client device wants to access a webpage, a copy of the webpage is downloaded from the server onto the client machine to be displayed in the user’s web browser.

Hypertext Transfer Protocol Secure (HTTPS)

  • Hypertext Transfer Protocol (HTTP) is a language for clients and servers to speak to each other.

  • Hypertext Transfer Protocol Secure (HTTPS) is an encrypted version of HTTP that provides secure communication between them.

  • When we type a web address with “https://” into our browser:

    1. The browser finds the address of the server that hosts the website.
    2. The browser and server establish a secure connection.
    3. The browser sends an HTTPS request to the server, asking for the website’s content.
    4. If the server approves the request, it responds with a 200 OK message and sends the encrypted website data to the client.
    5. The browser decrypts and displays the website securely.

HTTP Status Codes

# library for making HTTPS requests in Python
import requests  
p = 'https://bcdanl.github.io/210'
response = requests.get(p)  
print(response.status_code)  
print(response.reason)      
  • 200 OK → The request is successful, and the server sends the requested content.
p = 'https://bcdanl.github.io/2100'
response = requests.get(p)  
print(response.status_code)  
print(response.reason)       
  • 404 Not Found → The server cannot find the requested webpage.
  • This may happen due to a broken link, mistyped URL, or deleted page.

URL

  • An uniform resource locator (URL)—commonly know as a “web address”, specifies the location of a resource (such as a web page) on the internet.

  • An URL is usually composed of 5 parts

    • The 4th part, the “query string”, contains one or more parameters — in this case, there are two parameters: id and cat.
    • The 5th part, the “fragment”, is an internal page reference and may not be present.

HTML

  • Hyper Text Markup Language (HTML) is the standard markup language for creating Web pages.
    • HTML describes the structure of a Web page.
    • HTML consists of a series of elements.
    • HTML elements tell the browser how to display the content.
  • HTML elements label pieces of content such as “this is a heading”, “this is a paragraph”, “this is a link”, etc.

HTML Example

  • Below is a simple HTML document:
<!DOCTYPE html>
  <html>
    <head>
      <title>Page Title</title>
    </head>
    <body>
      <h1>My First Heading</h1>
      <p>My first paragraph.</p>
    </body>
  </html>

HTML Elements

tagname

  • An HTML element is defined by a start tag, some content, and an end tag:
<tagname>Content goes here...</tagname>

HTML Elements

  • <!DOCTYPE html> declaration defines that this document is an HTML document.
  • <html> element is the root element of an HTML page.
  • <head> element contains meta information about the HTML page.
  • <title> element specifies a title for the HTML page.
    • <title> is shown in the browser’s title bar or in the page’s tab.
  • <body> element defines the document’s body, and is a container for all the visible contents, such as headings, paragraphs, images, hyperlinks, tables, lists, etc.
    • <h1> element defines a large heading.
    • <p> element defines a paragraph.

HTML body

<a>

<a href="https://www.w3schools.com">This is a link</a>
  • The <a> tag defines an HTML link.

<img>

<img src="w3schools.jpg" alt="W3Schools.com" width="104" height="142">
  • The <img> tag defines an HTML image.

HTML Tables

  • <table> tag defines an HTML table.
    • <tr> tag defines each table row.
    • <th> tag defines each table header.
    • <td> tag defines each data/cell.

HTML Table Examples

  • Below is the example HTML table:
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
</table>

HTML Lists

Unordered lists

<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
  • <ul> tag defines an unordered lists.
    • <li> tag defines each list item.
    • The list items will be marked with bullets (small black circles) by default.

HTML Lists

Ordered lists

<ol>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>
  • <ol> tag defines an ordered lists.
    • <li> tag defines each list item.
    • The list items will be marked with numbers by default.

<div> – Block-Level Container

<div style="background-color:black;color:white;padding:20px;">
  <h2>London</h2>
  <p>London is the capital city of England. It is the most populous city in the United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
</div>
  • Block-level element → Starts on a new line and stretches to the full width.
    • Used as a container for structuring content.
    • Has no required attributes, but style, class, and id are commonly used.
  • <div> is used when grouping multiple elements together.

<span> – Inline Container

<p>My mother has <span style="color:blue;font-weight:bold">blue</span> eyes and my father has <span style="color:darkolivegreen;font-weight:bold">dark green</span> eyes.</p>
  • Inline element → Does not start on a new line; only takes up as much space as necessary.
    • Used to apply styling or scripting to specific text portions.
    • Has no required attributes, but style, class, and id are commonly used.
  • <span> is used when styling or modifying a small part of text inside a block.

HTML source code

  • HTML elements can be nested (this means that elements can contain other elements).
    • All HTML documents consist of nested HTML elements.
  • We can view HTML source for web page.
    • Hit F12 key (in Chrome or FireFox).
    • Right-click in an HTML page and select View Page Source (in Chrome), or similar in other browsers.
    • Right-click on an element (or a blank area), and choose Inspect or Inspect Element to see what elements are made up of.

HTML Source Code: DOM

  • To parse HTML, it is convenient to represent our HTML document as a tree-like structure that contains information in nodes and links information through branches.

  • This tree-like structure is called the Document Object Model (DOM).

    • DOM is a cross-platform and language-independent interface that treats an XML (eXtensible Markup Language) or HTML document as a tree structure wherein each node is an object representing a part of the document.

HTML Source Code: DOM

HTML Source Code: DOM

JavaScript (JS) – Adding Interactivity

  • JavaScript (JS) is a client-side programming language that enhances web interactivity.
    • It runs directly in the user’s browser, reducing server load.
    • JS is often used with third-party libraries to expand website functionality.
    • It enables dynamic content, user interactions, and animations.

Cascading Style Sheets (CSS) – Styling the Web

  • CSS is used to design and format the layout of a webpage.
    • It controls colors, fonts, text size, background, spacing, and display properties.
    • Analogy:
      • HTML = The structure of a house (walls, roof, foundation).
      • CSS = The decoration (paint, carpet, wallpapers).
      • JS = The interactive elements (automatic doors, lights, smart devices).
  • Together, HTML, CSS, and JS form the foundation of modern web development.