Pandas Data Analysis short tutorial:

This is an assignment, a part of the course "Data Analysis with Python: Zero to Pandas"

This tutorial demonstrates data analysis using example from two data sets using the Pandas Library. All the important operations are described in markdown cells.

!pip install pandas --upgrade

Requirement already satisfied: pandas in /opt/conda/lib/python3.9/site-packages (1.3.4)
Requirement already satisfied: numpy>=1.17.3 in /opt/conda/lib/python3.9/site-packages (from pandas) (1.20.3)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.9/site-packages (from pandas) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)

import pandas as pd

In this tutorial, we're going to analyze an operate on data from a CSV file. Let's begin by downloading the CSV file.

from urllib.request import urlretrieve

urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/countries.csv', 
            'countries.csv')

('countries.csv', <http.client.HTTPMessage at 0x7ff1c18db130>)

Let's load the data from the CSV file into a Pandas data frame.

countries_df = pd.read_csv('countries.csv')

countries_df

Q1: How many countries does the dataframe contain?

num_countries,colparameters = countries_df.shape

print('There are {} countries in the dataset'.format(num_countries))

There are 210 countries in the dataset

Q2: Retrieve a list of continents from the dataframe?

continents = pd.Series(countries_df.continent).unique()

continents

array(['Asia', 'Europe', 'Africa', 'North America', 'South America',
       'Oceania'], dtype=object)

Q3: What is the total population of all the countries listed in this dataset?

total_population = countries_df.population.sum()

print('The total population is {}.'.format(int(total_population)))

The total population is 7757980095.

Q: What is the overall life expectancy across in the world?

print((countries_df.population*countries_df.life_expectancy).sum()/countries_df.population.sum())

72.72165193409664

overall_life = countries_df['life_expectancy'].mean()

overall_life

73.52985507246376

Q4: Create a dataframe containing 10 countries with the highest population.

most_populous_df = (countries_df.sort_values(by=['population'],ascending = False)).head(10)

most_populous_df

Q5: Add a new column in countries_df to record the overall GDP per country (product of population & per capita GDP).

countries_df['gdp'] = countries_df.population * countries_df.gdp_per_capita

countries_df

Q6: Create a data frame that counts the number countries in each continent?

country_counts_df = countries_df.groupby('continent').count()

country_counts_df

Q7: Create a data frame showing the total population of each continent.

continent_populations_df = countries_df.groupby('continent').sum()

continent_populations_df = continent_populations_df['population']

continent_populations_df

continent
Africa           1.339424e+09
Asia             4.607388e+09
Europe           7.485062e+08
North America    5.912425e+08
Oceania          4.095832e+07
South America    4.304611e+08
Name: population, dtype: float64

Let's download another CSV file containing overall Covid-19 stats for various countires, and read the data into another Pandas data frame.

urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/covid-countries-data.csv', 
            'covid-countries-data.csv')

('covid-countries-data.csv', <http.client.HTTPMessage at 0x7ff1789d6c10>)

covid_data_df = pd.read_csv('covid-countries-data.csv')

covid_data_df

Q8: Count the number of countries for which the total_tests data is missing.

Hint: Use the .isna method.

total_tests_miss = covid_data_df.isna()
#total_tests_missing = (total_tests_miss[total_tests_miss.total_tests == True]).count()
total_tests_missing = total_tests_miss['total_tests'].values.sum()

print("The data for total tests is missing for {} countries.".format(int(total_tests_missing)))

The data for total tests is missing for 122 countries.

Let's merge the two data frames, and compute some more metrics.

Q9: Merge countries_df with covid_data_df on the location column.

combined_df = countries_df.merge(covid_data_df,on='location')

combined_df

Q10: Add columns tests_per_million, cases_per_million and deaths_per_million into combined_df.

combined_df['tests_per_million'] = combined_df['total_tests'] * 1e6 / combined_df['population']

combined_df['cases_per_million'] = combined_df['total_cases'] * 1e6 / combined_df['population']

combined_df['deaths_per_million'] = combined_df['total_deaths'] * 1e6 / combined_df['population']

combined_df

Q11: Create a dataframe with 10 countires that have highest number of tests per million people.

highest_tests_df = combined_df.nlargest(10,["tests_per_million"])

highest_tests_df

Q12: Create a dataframe with 10 countires that have highest number of positive cases per million people.

highest_cases_df =  combined_df.nlargest(10,["cases_per_million"])

highest_cases_df

Q13: Create a dataframe with 10 countires that have highest number of deaths cases per million people?

highest_deaths_df = combined_df.nlargest(10,["deaths_per_million"])

highest_deaths_df

	location	continent	population	life_expectancy	hospital_beds_per_thousand	gdp_per_capita
0	Afghanistan	Asia	38928341.0	64.83	0.50	1803.987
1	Albania	Europe	2877800.0	78.57	2.89	11803.431
2	Algeria	Africa	43851043.0	76.88	1.90	13913.839
3	Andorra	Europe	77265.0	83.73	NaN	NaN
4	Angola	Africa	32866268.0	61.15	NaN	5819.495
...	...	...	...	...	...	...
205	Vietnam	Asia	97338583.0	75.40	2.60	6171.884
206	Western Sahara	Africa	597330.0	70.26	NaN	NaN
207	Yemen	Asia	29825968.0	66.12	0.70	1479.147
208	Zambia	Africa	18383956.0	63.89	2.00	3689.251
209	Zimbabwe	Africa	14862927.0	61.49	1.70	1899.775

	location	continent	population	life_expectancy	hospital_beds_per_thousand	gdp_per_capita
41	China	Asia	1.439324e+09	76.91	4.34	15308.712
90	India	Asia	1.380004e+09	69.66	0.53	6426.674
199	United States	North America	3.310026e+08	78.86	2.77	54225.446
91	Indonesia	Asia	2.735236e+08	71.72	1.04	11188.744
145	Pakistan	Asia	2.208923e+08	67.27	0.60	5034.708
27	Brazil	South America	2.125594e+08	75.88	2.20	14103.452
141	Nigeria	Africa	2.061396e+08	54.69	NaN	5338.454
15	Bangladesh	Asia	1.646894e+08	72.59	0.80	3523.984
157	Russia	Europe	1.459345e+08	72.58	8.05	24765.954
125	Mexico	North America	1.289328e+08	75.05	1.38	17336.469

	location	continent	population	life_expectancy	hospital_beds_per_thousand	gdp_per_capita	gdp
0	Afghanistan	Asia	38928341.0	64.83	0.50	1803.987	7.022622e+10
1	Albania	Europe	2877800.0	78.57	2.89	11803.431	3.396791e+10
2	Algeria	Africa	43851043.0	76.88	1.90	13913.839	6.101364e+11
3	Andorra	Europe	77265.0	83.73	NaN	NaN	NaN
4	Angola	Africa	32866268.0	61.15	NaN	5819.495	1.912651e+11
...	...	...	...	...	...	...	...
205	Vietnam	Asia	97338583.0	75.40	2.60	6171.884	6.007624e+11
206	Western Sahara	Africa	597330.0	70.26	NaN	NaN	NaN
207	Yemen	Asia	29825968.0	66.12	0.70	1479.147	4.411699e+10
208	Zambia	Africa	18383956.0	63.89	2.00	3689.251	6.782303e+10
209	Zimbabwe	Africa	14862927.0	61.49	1.70	1899.775	2.823622e+10

	location	total_cases	total_deaths	total_tests
0	Afghanistan	38243.0	1409.0	NaN
1	Albania	9728.0	296.0	NaN
2	Algeria	45158.0	1525.0	NaN
3	Andorra	1199.0	53.0	NaN
4	Angola	2729.0	109.0	NaN
...	...	...	...	...
207	Western Sahara	766.0	1.0	NaN
208	World	26059065.0	863535.0	NaN
209	Yemen	1976.0	571.0	NaN
210	Zambia	12415.0	292.0	NaN
211	Zimbabwe	6638.0	206.0	97272.0

	location	continent	population	life_expectancy	hospital_beds_per_thousand	gdp_per_capita	gdp	total_cases	total_deaths	total_tests
0	Afghanistan	Asia	38928341.0	64.83	0.50	1803.987	7.022622e+10	38243.0	1409.0	NaN
1	Albania	Europe	2877800.0	78.57	2.89	11803.431	3.396791e+10	9728.0	296.0	NaN
2	Algeria	Africa	43851043.0	76.88	1.90	13913.839	6.101364e+11	45158.0	1525.0	NaN
3	Andorra	Europe	77265.0	83.73	NaN	NaN	NaN	1199.0	53.0	NaN
4	Angola	Africa	32866268.0	61.15	NaN	5819.495	1.912651e+11	2729.0	109.0	NaN
...	...	...	...	...	...	...	...	...	...	...
205	Vietnam	Asia	97338583.0	75.40	2.60	6171.884	6.007624e+11	1046.0	35.0	261004.0
206	Western Sahara	Africa	597330.0	70.26	NaN	NaN	NaN	766.0	1.0	NaN
207	Yemen	Asia	29825968.0	66.12	0.70	1479.147	4.411699e+10	1976.0	571.0	NaN
208	Zambia	Africa	18383956.0	63.89	2.00	3689.251	6.782303e+10	12415.0	292.0	NaN
209	Zimbabwe	Africa	14862927.0	61.49	1.70	1899.775	2.823622e+10	6638.0	206.0	97272.0

	location	continent	population	life_expectancy	hospital_beds_per_thousand	gdp_per_capita	gdp	total_cases	total_deaths	total_tests	tests_per_million	cases_per_million	deaths_per_million
197	United Arab Emirates	Asia	9890400.0	77.97	1.200	67293.483	6.655595e+11	71540.0	387.0	7177430.0	725696.635121	7233.276713	39.128852
14	Bahrain	Asia	1701583.0	77.29	2.000	43290.705	7.366273e+10	52440.0	190.0	1118837.0	657527.137965	30818.361490	111.660730
115	Luxembourg	Europe	625976.0	82.25	4.510	94277.965	5.901574e+10	7928.0	124.0	385820.0	616349.508607	12665.022301	198.090662
122	Malta	Europe	441539.0	82.53	4.485	36513.323	1.612206e+10	1931.0	13.0	188539.0	427004.183096	4373.339614	29.442473
53	Denmark	Europe	5792203.0	80.90	2.500	46682.515	2.703946e+11	17195.0	626.0	2447911.0	422621.755488	2968.645954	108.076323
96	Israel	Asia	8655541.0	82.97	2.990	33132.320	2.867782e+11	122539.0	969.0	2353984.0	271962.665303	14157.289533	111.951408
89	Iceland	Europe	341250.0	82.99	2.910	46482.958	1.586231e+10	2121.0	10.0	88829.0	260304.761905	6215.384615	29.304029
157	Russia	Europe	145934460.0	72.58	8.050	24765.954	3.614206e+12	1005000.0	17414.0	37176827.0	254750.159763	6886.653091	119.327539
199	United States	North America	331002647.0	78.86	2.770	54225.446	1.794877e+13	6114406.0	185744.0	83898416.0	253467.507769	18472.377957	561.155633
10	Australia	Oceania	25499881.0	83.44	3.840	44648.710	1.138537e+12	25923.0	663.0	6255797.0	245326.517406	1016.592979	26.000121

	location	continent	population	life_expectancy	hospital_beds_per_thousand	gdp_per_capita	gdp	total_cases	total_deaths	total_tests	tests_per_million	cases_per_million	deaths_per_million
155	Qatar	Asia	2881060.0	80.23	1.20	116935.600	3.368985e+11	119206.0	199.0	634745.0	220316.480740	41375.743650	69.071800
14	Bahrain	Asia	1701583.0	77.29	2.00	43290.705	7.366273e+10	52440.0	190.0	1118837.0	657527.137965	30818.361490	111.660730
147	Panama	North America	4314768.0	78.51	2.30	22267.037	9.607710e+10	94084.0	2030.0	336345.0	77952.047480	21805.112117	470.477208
40	Chile	South America	19116209.0	80.18	2.11	22767.037	4.352194e+11	414739.0	11344.0	2458762.0	128621.841287	21695.671982	593.423100
162	San Marino	Europe	33938.0	84.97	3.80	56861.470	1.929765e+09	735.0	42.0	NaN	NaN	21657.139490	1237.550828
9	Aruba	North America	106766.0	76.29	NaN	35973.781	3.840777e+09	2211.0	12.0	NaN	NaN	20708.839893	112.395332
105	Kuwait	Asia	4270563.0	75.49	2.00	65530.537	2.798523e+11	86478.0	535.0	621616.0	145558.325682	20249.789079	125.276222
150	Peru	South America	32971846.0	76.74	1.60	12236.706	4.034668e+11	663437.0	29259.0	584232.0	17719.117092	20121.318048	887.393445
27	Brazil	South America	212559409.0	75.88	2.20	14103.452	2.997821e+12	3997865.0	123780.0	4797948.0	22572.268255	18808.224105	582.331314
199	United States	North America	331002647.0	78.86	2.77	54225.446	1.794877e+13	6114406.0	185744.0	83898416.0	253467.507769	18472.377957	561.155633