Beautiful Soup Primer: How To Scrape Data From A Website

Beautiful Soup is a HTML and XML parser available on Python 2.6+. Soup is named after the unstructured HTML documents which are hard to understand and noisy. It parses the data from the HTML and XML documents from where the it can be extracted.

In this article we will be going through functions which help us extract data from the HTML document. We will be using a toy HTML to explain how Beautiful Soup works, and walk through the steps involved in Scraping — one of the techniques of data mining — data from a website’s HTML format.

With the help of headless browser such as Selenium and PhanthomJS, one can easily practice how to scrape data out of a website. With these browsers, it will be easy to scrape through multiple pages or extract a large amount of data from the websites. Using a headless browser will also increase the computation speed which will result in consumption of less memory. In fact, PhanthomJS assigns unique processes to each browser as well.

Installing Beautiful Soup 4

Beautiful Soup library can be installed using PIP with a very simple command. It is available on almost all platforms. Here is a way to install it using Jupyter Notebook.

We can import this library with the following code and assign it to an object.

Installing An Alternative Parser

Beautiful Soup has a default parser available in the standard Python Library. We can use a different parser depending on the objective. The most common alternative parsers are “lxml” and “html5lib”. This can be installed with the help of the following code:

Below is a tabular representation of various parsers with their advantages and disadvantages:

Getting Started

We will be using this basic, and default, HTML doc to parse the data using Beautiful Soup.

The following code will expand HTML into its hierarchy:

Exploring The Parse Tree

To navigate through the tree, we can use the following commands:

Beautiful Soup has many attributes which can be accessed and edited. This extracted parsed data can be saved onto a text file.

To extract the text from the string, we can use the get_text() command.

Strings: How To Remove White spaces

The string can be accessed using the strings command. But it also includes white space which can be stripped easily.

Since the above output has a lot of white space, the striped.strings command will help us remove it.

Parent And Siblings

We can obtain the parent of a particular HTML with .parent attribute, like here:

To access the siblings — previous as well as the next — we can use the following commands:

Find And FindAll

This function is used to search for a very particular field throughout the HTML document. It is one of the key features required while data mining or scraping a data from a website with the help of Selenium and PhanthomJS.

Conclusion

Since finding the right tags from the HTML source is hard, scraping the data takes a lot of time. It can also depend on the amount of data extracted from a page. That is why, the wait time is necessary for the browser to load the data. Depending on their computation speed and availability of resources one can scrape data from almost any website using the right tools.

The post Beautiful Soup Primer: How To Scrape Data From A Website appeared first on Analytics India Magazine.

Beautiful Soup Primer: How To Scrape Data From A Website

Installing Beautiful Soup 4

Installing An Alternative Parser

Getting Started

Exploring The Parse Tree

Strings: How To Remove White spaces

Parent And Siblings

Find And FindAll

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112