How to extract data from internet by web scraping using BeautifulSoup4 with Python

Updated on Oct. 16, 2017 | 462 |   python,

Python has always been amazing to extract information from internet. In this article, I am going to describe how to extract information using BeautifulSoup4 library with Python. I am expecting you to have basic to intermediate level of Python knowledge to understand this tutorial. 

 

Step 1: We will first install requests and bs4 libraries using pip from our windows terminal. Linux and Mac users can proceed on how to install these libraries by searching into Google. It’s easy!

pip install requests

pip install bs4


Step 2: Import the requests library and BeautifulSoup class. 

import requests

from bs4 import BeautifulSoup

 

Step 3: Create a function called CrawlWords(url) which is responsible for having all the scraping activities. Here, url is the parameter which will be the link of the website’s html page we want to scrap.

def CrawlWords(url):



Step 4: Request the information from the html page and store them into text format inside a variable.   


    resource = requests.get(url).text

 


Step 5: Make the text data ready to analyze by using BeautifulSoup() function.

 

    souped_resource = BeautifulSoup(resource,"html.parser")

 

 

Step 6: Find all the data with h1 tag inside the formatted data.


    all_h1 = souped_resource.find_all('h1')

 

 

Step 7: Now, we want to escape all the html tags from the data that we found with specific h1 tag and separate the lines of the h1 tags by a loop.

 

    for each_result in all_h1:

        each_result_str = each_result.string

 


Step 8: Now, we want to lowercase the words and split them into the loop. Finally, we will print the separated words individually. Don’t forget to call our function!

 

        word_result = each_result_str.lower().split()

        print(word_result)


CrawlWords(r'https://www.ygencoder.com'

You are done! You can run your program now.   You can also customize this program as much as you want to suit your needs.

 


Full Source Code:

import requests

from bs4 import BeautifulSoup

 

def CrawlWords(url):

    resource = requests.get(url).text

    souped_resource = BeautifulSoup(resource,"html.parser")

    all_h1 = souped_resource.find_all('h1')


    for each_result in all_h1:

        each_result_str = each_result.string

        word_result = each_result_str.lower().split()

        print(word_result)


CrawlWords(r'https://www.ygencoder.com')


If you like this tutorial, don't forget to subscribe my youtube channel: 

https://www.youtube.com/YGenCoder