Pandas Data Analysis Basics Tutorial
short tutorial on basic pandas operations using python
Pandas Data Analysis short tutorial:
This is an assignment, a part of the course "Data Analysis with Python: Zero to Pandas"
This tutorial demonstrates data analysis using example from two data sets using the Pandas Library. All the important operations are described in markdown cells.
!pip install pandas --upgrade
import pandas as pd
In this tutorial, we're going to analyze an operate on data from a CSV file. Let's begin by downloading the CSV file.
from urllib.request import urlretrieve
urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/countries.csv',
'countries.csv')
Let's load the data from the CSV file into a Pandas data frame.
countries_df = pd.read_csv('countries.csv')
countries_df
Q1: How many countries does the dataframe contain?
num_countries,colparameters = countries_df.shape
print('There are {} countries in the dataset'.format(num_countries))
Q2: Retrieve a list of continents from the dataframe?
continents = pd.Series(countries_df.continent).unique()
continents
Q3: What is the total population of all the countries listed in this dataset?
total_population = countries_df.population.sum()
print('The total population is {}.'.format(int(total_population)))
Q: What is the overall life expectancy across in the world?
print((countries_df.population*countries_df.life_expectancy).sum()/countries_df.population.sum())
overall_life = countries_df['life_expectancy'].mean()
overall_life
Q4: Create a dataframe containing 10 countries with the highest population.
most_populous_df = (countries_df.sort_values(by=['population'],ascending = False)).head(10)
most_populous_df
Q5: Add a new column in countries_df
to record the overall GDP per country (product of population & per capita GDP).
countries_df['gdp'] = countries_df.population * countries_df.gdp_per_capita
countries_df
Q6: Create a data frame that counts the number countries in each continent?
country_counts_df = countries_df.groupby('continent').count()
country_counts_df
Q7: Create a data frame showing the total population of each continent.
continent_populations_df = countries_df.groupby('continent').sum()
continent_populations_df = continent_populations_df['population']
continent_populations_df
Let's download another CSV file containing overall Covid-19 stats for various countires, and read the data into another Pandas data frame.
urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/covid-countries-data.csv',
'covid-countries-data.csv')
covid_data_df = pd.read_csv('covid-countries-data.csv')
covid_data_df
Q8: Count the number of countries for which the total_tests
data is missing.
Hint: Use the .isna
method.
total_tests_miss = covid_data_df.isna()
#total_tests_missing = (total_tests_miss[total_tests_miss.total_tests == True]).count()
total_tests_missing = total_tests_miss['total_tests'].values.sum()
print("The data for total tests is missing for {} countries.".format(int(total_tests_missing)))
Let's merge the two data frames, and compute some more metrics.
Q9: Merge countries_df
with covid_data_df
on the location
column.
combined_df = countries_df.merge(covid_data_df,on='location')
combined_df
Q10: Add columns tests_per_million
, cases_per_million
and deaths_per_million
into combined_df
.
combined_df['tests_per_million'] = combined_df['total_tests'] * 1e6 / combined_df['population']
combined_df['cases_per_million'] = combined_df['total_cases'] * 1e6 / combined_df['population']
combined_df['deaths_per_million'] = combined_df['total_deaths'] * 1e6 / combined_df['population']
combined_df
Q11: Create a dataframe with 10 countires that have highest number of tests per million people.
highest_tests_df = combined_df.nlargest(10,["tests_per_million"])
highest_tests_df
Q12: Create a dataframe with 10 countires that have highest number of positive cases per million people.
highest_cases_df = combined_df.nlargest(10,["cases_per_million"])
highest_cases_df
Q13: Create a dataframe with 10 countires that have highest number of deaths cases per million people?
highest_deaths_df = combined_df.nlargest(10,["deaths_per_million"])
highest_deaths_df