Pandas Data Analysis short tutorial:

This is an assignment, a part of the course "Data Analysis with Python: Zero to Pandas"

This tutorial demonstrates data analysis using example from two data sets using the Pandas Library. All the important operations are described in markdown cells.

!pip install pandas --upgrade
Requirement already satisfied: pandas in /opt/conda/lib/python3.9/site-packages (1.3.4)
Requirement already satisfied: numpy>=1.17.3 in /opt/conda/lib/python3.9/site-packages (from pandas) (1.20.3)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.9/site-packages (from pandas) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
import pandas as pd

In this tutorial, we're going to analyze an operate on data from a CSV file. Let's begin by downloading the CSV file.

from urllib.request import urlretrieve

urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/countries.csv', 
            'countries.csv')
('countries.csv', <http.client.HTTPMessage at 0x7ff1c18db130>)

Let's load the data from the CSV file into a Pandas data frame.

countries_df = pd.read_csv('countries.csv')
countries_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita
0 Afghanistan Asia 38928341.0 64.83 0.50 1803.987
1 Albania Europe 2877800.0 78.57 2.89 11803.431
2 Algeria Africa 43851043.0 76.88 1.90 13913.839
3 Andorra Europe 77265.0 83.73 NaN NaN
4 Angola Africa 32866268.0 61.15 NaN 5819.495
... ... ... ... ... ... ...
205 Vietnam Asia 97338583.0 75.40 2.60 6171.884
206 Western Sahara Africa 597330.0 70.26 NaN NaN
207 Yemen Asia 29825968.0 66.12 0.70 1479.147
208 Zambia Africa 18383956.0 63.89 2.00 3689.251
209 Zimbabwe Africa 14862927.0 61.49 1.70 1899.775

210 rows × 6 columns

Q1: How many countries does the dataframe contain?

num_countries,colparameters = countries_df.shape
print('There are {} countries in the dataset'.format(num_countries))
There are 210 countries in the dataset

Q2: Retrieve a list of continents from the dataframe?

continents = pd.Series(countries_df.continent).unique()    

    
continents
array(['Asia', 'Europe', 'Africa', 'North America', 'South America',
       'Oceania'], dtype=object)

Q3: What is the total population of all the countries listed in this dataset?

total_population = countries_df.population.sum()
print('The total population is {}.'.format(int(total_population)))
The total population is 7757980095.

Q: What is the overall life expectancy across in the world?

print((countries_df.population*countries_df.life_expectancy).sum()/countries_df.population.sum())
72.72165193409664
overall_life = countries_df['life_expectancy'].mean()
overall_life
73.52985507246376

Q4: Create a dataframe containing 10 countries with the highest population.

most_populous_df = (countries_df.sort_values(by=['population'],ascending = False)).head(10)
most_populous_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita
41 China Asia 1.439324e+09 76.91 4.34 15308.712
90 India Asia 1.380004e+09 69.66 0.53 6426.674
199 United States North America 3.310026e+08 78.86 2.77 54225.446
91 Indonesia Asia 2.735236e+08 71.72 1.04 11188.744
145 Pakistan Asia 2.208923e+08 67.27 0.60 5034.708
27 Brazil South America 2.125594e+08 75.88 2.20 14103.452
141 Nigeria Africa 2.061396e+08 54.69 NaN 5338.454
15 Bangladesh Asia 1.646894e+08 72.59 0.80 3523.984
157 Russia Europe 1.459345e+08 72.58 8.05 24765.954
125 Mexico North America 1.289328e+08 75.05 1.38 17336.469

Q5: Add a new column in countries_df to record the overall GDP per country (product of population & per capita GDP).

countries_df['gdp'] = countries_df.population * countries_df.gdp_per_capita
countries_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp
0 Afghanistan Asia 38928341.0 64.83 0.50 1803.987 7.022622e+10
1 Albania Europe 2877800.0 78.57 2.89 11803.431 3.396791e+10
2 Algeria Africa 43851043.0 76.88 1.90 13913.839 6.101364e+11
3 Andorra Europe 77265.0 83.73 NaN NaN NaN
4 Angola Africa 32866268.0 61.15 NaN 5819.495 1.912651e+11
... ... ... ... ... ... ... ...
205 Vietnam Asia 97338583.0 75.40 2.60 6171.884 6.007624e+11
206 Western Sahara Africa 597330.0 70.26 NaN NaN NaN
207 Yemen Asia 29825968.0 66.12 0.70 1479.147 4.411699e+10
208 Zambia Africa 18383956.0 63.89 2.00 3689.251 6.782303e+10
209 Zimbabwe Africa 14862927.0 61.49 1.70 1899.775 2.823622e+10

210 rows × 7 columns

Q6: Create a data frame that counts the number countries in each continent?

country_counts_df = countries_df.groupby('continent').count()
country_counts_df
location population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp
continent
Africa 55 55 55 40 53 53
Asia 47 47 47 43 45 45
Europe 51 51 48 43 42 42
North America 36 36 36 23 27 27
Oceania 8 8 8 3 4 4
South America 13 13 13 12 12 12

Q7: Create a data frame showing the total population of each continent.

continent_populations_df = countries_df.groupby('continent').sum()
continent_populations_df = continent_populations_df['population']
continent_populations_df
continent
Africa           1.339424e+09
Asia             4.607388e+09
Europe           7.485062e+08
North America    5.912425e+08
Oceania          4.095832e+07
South America    4.304611e+08
Name: population, dtype: float64

Let's download another CSV file containing overall Covid-19 stats for various countires, and read the data into another Pandas data frame.

urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/covid-countries-data.csv', 
            'covid-countries-data.csv')
('covid-countries-data.csv', <http.client.HTTPMessage at 0x7ff1789d6c10>)
covid_data_df = pd.read_csv('covid-countries-data.csv')
covid_data_df
location total_cases total_deaths total_tests
0 Afghanistan 38243.0 1409.0 NaN
1 Albania 9728.0 296.0 NaN
2 Algeria 45158.0 1525.0 NaN
3 Andorra 1199.0 53.0 NaN
4 Angola 2729.0 109.0 NaN
... ... ... ... ...
207 Western Sahara 766.0 1.0 NaN
208 World 26059065.0 863535.0 NaN
209 Yemen 1976.0 571.0 NaN
210 Zambia 12415.0 292.0 NaN
211 Zimbabwe 6638.0 206.0 97272.0

212 rows × 4 columns

Q8: Count the number of countries for which the total_tests data is missing.

Hint: Use the .isna method.

total_tests_miss = covid_data_df.isna()
#total_tests_missing = (total_tests_miss[total_tests_miss.total_tests == True]).count()
total_tests_missing = total_tests_miss['total_tests'].values.sum()
print("The data for total tests is missing for {} countries.".format(int(total_tests_missing)))
The data for total tests is missing for 122 countries.

Let's merge the two data frames, and compute some more metrics.

Q9: Merge countries_df with covid_data_df on the location column.

combined_df = countries_df.merge(covid_data_df,on='location')
combined_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp total_cases total_deaths total_tests
0 Afghanistan Asia 38928341.0 64.83 0.50 1803.987 7.022622e+10 38243.0 1409.0 NaN
1 Albania Europe 2877800.0 78.57 2.89 11803.431 3.396791e+10 9728.0 296.0 NaN
2 Algeria Africa 43851043.0 76.88 1.90 13913.839 6.101364e+11 45158.0 1525.0 NaN
3 Andorra Europe 77265.0 83.73 NaN NaN NaN 1199.0 53.0 NaN
4 Angola Africa 32866268.0 61.15 NaN 5819.495 1.912651e+11 2729.0 109.0 NaN
... ... ... ... ... ... ... ... ... ... ...
205 Vietnam Asia 97338583.0 75.40 2.60 6171.884 6.007624e+11 1046.0 35.0 261004.0
206 Western Sahara Africa 597330.0 70.26 NaN NaN NaN 766.0 1.0 NaN
207 Yemen Asia 29825968.0 66.12 0.70 1479.147 4.411699e+10 1976.0 571.0 NaN
208 Zambia Africa 18383956.0 63.89 2.00 3689.251 6.782303e+10 12415.0 292.0 NaN
209 Zimbabwe Africa 14862927.0 61.49 1.70 1899.775 2.823622e+10 6638.0 206.0 97272.0

210 rows × 10 columns

Q10: Add columns tests_per_million, cases_per_million and deaths_per_million into combined_df.

combined_df['tests_per_million'] = combined_df['total_tests'] * 1e6 / combined_df['population']
combined_df['cases_per_million'] = combined_df['total_cases'] * 1e6 / combined_df['population']
combined_df['deaths_per_million'] = combined_df['total_deaths'] * 1e6 / combined_df['population']
combined_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp total_cases total_deaths total_tests tests_per_million cases_per_million deaths_per_million
0 Afghanistan Asia 38928341.0 64.83 0.50 1803.987 7.022622e+10 38243.0 1409.0 NaN NaN 982.394806 36.194710
1 Albania Europe 2877800.0 78.57 2.89 11803.431 3.396791e+10 9728.0 296.0 NaN NaN 3380.359997 102.856349
2 Algeria Africa 43851043.0 76.88 1.90 13913.839 6.101364e+11 45158.0 1525.0 NaN NaN 1029.804468 34.776824
3 Andorra Europe 77265.0 83.73 NaN NaN NaN 1199.0 53.0 NaN NaN 15518.022390 685.950948
4 Angola Africa 32866268.0 61.15 NaN 5819.495 1.912651e+11 2729.0 109.0 NaN NaN 83.033462 3.316470
... ... ... ... ... ... ... ... ... ... ... ... ... ...
205 Vietnam Asia 97338583.0 75.40 2.60 6171.884 6.007624e+11 1046.0 35.0 261004.0 2681.403324 10.745996 0.359570
206 Western Sahara Africa 597330.0 70.26 NaN NaN NaN 766.0 1.0 NaN NaN 1282.373228 1.674116
207 Yemen Asia 29825968.0 66.12 0.70 1479.147 4.411699e+10 1976.0 571.0 NaN NaN 66.250993 19.144391
208 Zambia Africa 18383956.0 63.89 2.00 3689.251 6.782303e+10 12415.0 292.0 NaN NaN 675.317108 15.883415
209 Zimbabwe Africa 14862927.0 61.49 1.70 1899.775 2.823622e+10 6638.0 206.0 97272.0 6544.605918 446.614587 13.859989

210 rows × 13 columns

Q11: Create a dataframe with 10 countires that have highest number of tests per million people.

highest_tests_df = combined_df.nlargest(10,["tests_per_million"])
highest_tests_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp total_cases total_deaths total_tests tests_per_million cases_per_million deaths_per_million
197 United Arab Emirates Asia 9890400.0 77.97 1.200 67293.483 6.655595e+11 71540.0 387.0 7177430.0 725696.635121 7233.276713 39.128852
14 Bahrain Asia 1701583.0 77.29 2.000 43290.705 7.366273e+10 52440.0 190.0 1118837.0 657527.137965 30818.361490 111.660730
115 Luxembourg Europe 625976.0 82.25 4.510 94277.965 5.901574e+10 7928.0 124.0 385820.0 616349.508607 12665.022301 198.090662
122 Malta Europe 441539.0 82.53 4.485 36513.323 1.612206e+10 1931.0 13.0 188539.0 427004.183096 4373.339614 29.442473
53 Denmark Europe 5792203.0 80.90 2.500 46682.515 2.703946e+11 17195.0 626.0 2447911.0 422621.755488 2968.645954 108.076323
96 Israel Asia 8655541.0 82.97 2.990 33132.320 2.867782e+11 122539.0 969.0 2353984.0 271962.665303 14157.289533 111.951408
89 Iceland Europe 341250.0 82.99 2.910 46482.958 1.586231e+10 2121.0 10.0 88829.0 260304.761905 6215.384615 29.304029
157 Russia Europe 145934460.0 72.58 8.050 24765.954 3.614206e+12 1005000.0 17414.0 37176827.0 254750.159763 6886.653091 119.327539
199 United States North America 331002647.0 78.86 2.770 54225.446 1.794877e+13 6114406.0 185744.0 83898416.0 253467.507769 18472.377957 561.155633
10 Australia Oceania 25499881.0 83.44 3.840 44648.710 1.138537e+12 25923.0 663.0 6255797.0 245326.517406 1016.592979 26.000121

Q12: Create a dataframe with 10 countires that have highest number of positive cases per million people.

highest_cases_df =  combined_df.nlargest(10,["cases_per_million"])
highest_cases_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp total_cases total_deaths total_tests tests_per_million cases_per_million deaths_per_million
155 Qatar Asia 2881060.0 80.23 1.20 116935.600 3.368985e+11 119206.0 199.0 634745.0 220316.480740 41375.743650 69.071800
14 Bahrain Asia 1701583.0 77.29 2.00 43290.705 7.366273e+10 52440.0 190.0 1118837.0 657527.137965 30818.361490 111.660730
147 Panama North America 4314768.0 78.51 2.30 22267.037 9.607710e+10 94084.0 2030.0 336345.0 77952.047480 21805.112117 470.477208
40 Chile South America 19116209.0 80.18 2.11 22767.037 4.352194e+11 414739.0 11344.0 2458762.0 128621.841287 21695.671982 593.423100
162 San Marino Europe 33938.0 84.97 3.80 56861.470 1.929765e+09 735.0 42.0 NaN NaN 21657.139490 1237.550828
9 Aruba North America 106766.0 76.29 NaN 35973.781 3.840777e+09 2211.0 12.0 NaN NaN 20708.839893 112.395332
105 Kuwait Asia 4270563.0 75.49 2.00 65530.537 2.798523e+11 86478.0 535.0 621616.0 145558.325682 20249.789079 125.276222
150 Peru South America 32971846.0 76.74 1.60 12236.706 4.034668e+11 663437.0 29259.0 584232.0 17719.117092 20121.318048 887.393445
27 Brazil South America 212559409.0 75.88 2.20 14103.452 2.997821e+12 3997865.0 123780.0 4797948.0 22572.268255 18808.224105 582.331314
199 United States North America 331002647.0 78.86 2.77 54225.446 1.794877e+13 6114406.0 185744.0 83898416.0 253467.507769 18472.377957 561.155633

Q13: Create a dataframe with 10 countires that have highest number of deaths cases per million people?

highest_deaths_df = combined_df.nlargest(10,["deaths_per_million"])
highest_deaths_df
location continent population life_expectancy hospital_beds_per_thousand gdp_per_capita gdp total_cases total_deaths total_tests tests_per_million cases_per_million deaths_per_million
162 San Marino Europe 33938.0 84.97 3.80 56861.470 1.929765e+09 735.0 42.0 NaN NaN 21657.139490 1237.550828
150 Peru South America 32971846.0 76.74 1.60 12236.706 4.034668e+11 663437.0 29259.0 584232.0 17719.117092 20121.318048 887.393445
18 Belgium Europe 11589616.0 81.63 5.64 42658.576 4.943965e+11 85817.0 9898.0 2281853.0 196887.713967 7404.645676 854.040375
3 Andorra Europe 77265.0 83.73 NaN NaN NaN 1199.0 53.0 NaN NaN 15518.022390 685.950948
177 Spain Europe 46754783.0 83.56 2.97 34272.360 1.602397e+12 479554.0 29194.0 6416533.0 137238.001939 10256.790198 624.406705
198 United Kingdom Europe 67886004.0 81.32 2.54 39753.244 2.698689e+12 338676.0 41514.0 13447568.0 198090.434075 4988.892850 611.525168
40 Chile South America 19116209.0 80.18 2.11 22767.037 4.352194e+11 414739.0 11344.0 2458762.0 128621.841287 21695.671982 593.423100
97 Italy Europe 60461828.0 83.51 3.18 35220.084 2.129471e+12 271515.0 35497.0 5214766.0 86248.897403 4490.684602 587.097697
27 Brazil South America 212559409.0 75.88 2.20 14103.452 2.997821e+12 3997865.0 123780.0 4797948.0 22572.268255 18808.224105 582.331314
182 Sweden Europe 10099270.0 82.80 2.22 46949.283 4.741535e+11 84532.0 5820.0 NaN NaN 8370.109919 576.279276
Back to top of page