top of page

Building a Web Scraper for citypopulation.de

  • Rohan Dawar
  • Jul 27, 2021
  • 2 min read

ree

Introduction

In this project I will be scraping the website citypopulation.de with BeautifulSoup to create a dataframe & csv file of populations for sub-national entities. What is scraping? Web scraping is the process of extracting content and data from a website through it's HTML code. What is https://www.citypopulation.de/ ? This website provides up to date data on population and areas for all countries of the world, including territories and subdivisions.

ree

What is Beautiful Soup? Beautiful Soup (AKA BS4) is a Python library for pulling data out of HTML pages. In this project, I will be using BS4 to get the country pages within a continent, as well as parsing the population data from the subdivisions of that country. What are sub-national entities? Sub-national entities are any administrative or census division within a country such as provinces, states, territories, municipalities, etc. citypopulation.de tries to keep up to date population data for all national and sub-national entities on Earth



Outline & Narrative

Download webpage using requests

To begin, we'll use the requests library to download the webpage and create some simple functions to help parse URL strings:

Next we will create a class for countries to easily access the attributes: url, Name, if it is an html suffix and the continent it belongs to:

Next, a helper function ContinentDict that takes a list of continents, and returns a dictionary with continent keys and values are list of countries (belonging to that continent), parsed from the html page of the continent on citypopulation.de :

Next, a simple function to create country objects from the dictionary object returned from the ContinentDict function:


Test Parsing

Using beautiful soup objects to inform our function building:

To find the date we can use the find function to search for 'rpop prio1':


Pandas Dataframe

Writing functions to handle beautiful soup objects and return a dataframe:

We can start with the 'deepest' function that adds a city (or any subdivision) to the passed dataframe:

Our next 'layer' up is adding a table (ie. set of subdivisions) to the passed dataframe:

Next we can write a function that takes the Country object and finds all the subdivisions to be added to the passed dataframe:

And finally our 'top' layer is a function that takes the list of country objects and the passed dataframe and passes it to the above functions:

Testing our functions, we can see in the output that our dataframe is built:

Now we can reset the index to get our final dataframe:


Example Analysis

Preliminary analyses on our dataframe using pandas:

As we can see this dataframe accurately indexed the data from the html page:

ree

Conclusion

Exporting, Summary, Future Work & References:


 
 
 
  • LinkedIn
  • GitHub
  • Jovian

©2020 Rohan Dawar

bottom of page