Web Scraping Using Python

Web scraping is data scraping used for extracting data from websites. Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page. Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. An example would be to find and copy names and telephone numbers, or companies and their URLs, or e-mail addresses to a list.

To extract data using web scraping with python, you need to follow these basic steps:

  1. Find the URL that you want to scrape.
  2. Inspecting the Page.
  3. Find the data you want to extract.
  4. Write the code.
  5. Run the code and extract the data.
  6. Store the data in the required format.

Python Libraries Used For Web Scraping:

  1. Requests is a HTTP library for the Python programming language. The goal of the project is to make HTTP requests simpler and more human-friendly. Allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data.
  2. Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
  3. BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Import the required python libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Create Empty variables to store scraped data:

Text = []

Now enter the URL from where you want the data. Requests library is used to make html requests to the server.

res = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
soup = BeautifulSoup(res.text, 'html. Parser')

After extracting the data, you might want to store it in a format. This format varies depending on your requirement. For this, we will store the extracted data in a CSV format.

df = pd.DataFrame({'Text':Text}) 
df.to_csv('Web_Scrap.csv', index=False, encoding='utf-8')

By running the above code. Here is the snapshot of the csv file generated after running the code.

Hence, we get to learn how to scrap data from the internet and format it for further analysis. You can see whole code here: