Charalambos Themistocleous
1 Introduction to Python
2 Data Manipulation with Pandas
2.1 Series
2.2 DataFrame
2.3 Merging DataFrames
2.4 Date and Time
2.5 Importing data
2.6 Save Data
3 Creating Plots
4 Basic Descriptive Statistics using Pandas
This chapter provides and introduction to Python.
Python 2.7 (or higher) (including Python 3)
pandas 0.11.1 (or higher) and its dependencies
NumPy 1.6.1 (or higher)
matplotlib 1.0.0 (or higher)
IPython 0.12 (or higher)
NLTK
**
import pandas as pd import numpy as np import matplotlib.pyplot as plt import nltk
A Series is a single vector of data with an index for each element. A similar structure in numpy is the array.
measurements = pd.Series([328259, 22781, 30857, 4164, 328387]) measurements
The printed output is the following:
Out[1]: 0 328259 1 22781 2 30857 3 4164 4 328387 dtype: int64
Series consist of [values ]and [indexes], we can call them separately in the following manner:
measurements.values Out[2]: array([328259, 22781, 30857, 4164, 328387])
measurements.index Out[3]: RangeIndex(start=0, stop=5, step=1)
This says that we have numbers from 0 to 5. The following selects the number in position [3]:
measurements[3] Out[4]: 4164
We can select values based on logical operations as well
measurements[measurements < 20000] Out[5]: Ecuador 4164 dtype: int64
or
measurements[measurements == 22781] Out[6]: Argentina 22781 dtype: int64
These numbers are not very informative so we want to provide labels. So, if we know that these numbers represent the number of books published in 2010 we might want to provide the name of the country as an index.
measurements = pd.Series([328259, 22781, 30857, 4164, 328387], index=[’USA’, ’Argentina’, ’Sweden’, ’Ecuador’, ’China’]) measurements Out[7]: USA 328259 Argentina 22781 Sweden 30857 Ecuador 4164 China 328387 dtype: int64
We can use these names to select the value.
measurements[’USA’] 328259
Also, we can provide both the array of values and the index the own labels:
measurements.name = ’Book␣Counts’ measurements.index.name = ’Countries’ measurements Out[71]: Countries USA 328259 Argentina 22781 Sweden 30857 Ecuador 4164 China 328387 Name: Book Counts, dtype: int64
We might be interested to select only the countries whose name ends in letter ’a’:
measurements[[name.endswith(’a’) or name.endswith(’A’) for name in measurements.index]] USA 328259 Argentina 22781 China 328387 dtype: int64
The following provides information about the position of these numbers:
[name.endswith(’a’) or name.endswith(’A’) for name in measurements.index] [True, True, False, False, True]
NumPy’s math functions and statistics can be applied to Series, e.g.,
np.mean(measurements) 142889.6
Series are very common objects to standard dictionaries (dict) in Python:
Bookpublications = ({’Italy’:59743, ’Argentina’:22781, ’Poland’: 31500, ’Vietnam’: 24589, ’Indonesia’: 24000}) pd.Series(Bookpublications) Argentina 22781 Indonesia 24000 Italy 59743 Poland 31500 Vietnam 24589 dtype: int64
data = pd.DataFrame( ({’counts’:[ 328259, 22781, 30857, 4164, 328387, 59743, 31500, 24589], ’year’:[2010, 2010, 2010, 2010, 2010, 2005, 2010, 2009], ’country’:[’USA’, ’Argentina’, ’Sweden’, ’Ecuador’, ’China’, ’Italy’, ’Poland’, ’Vietnam’]})) data
The output now is a table as we expect it to be:
country counts year 0 USA 328259 2010 1 Argentina 22781 2010 2 Sweden 30857 2010 3 Ecuador 4164 2010 4 China 328387 2010 5 Italy 59743 2005 6 Poland 31500 2010 7 Vietnam 24589 2009
To select the values of the column, we can use its name:
data[’counts’] 0 328259 1 22781 2 30857 3 4164 4 328387 5 59743 6 31500 7 24589 Name: counts, dtype: int64
data.counts Out[95]: 0 328259 1 22781 2 30857 3 4164 4 328387 5 59743 6 31500 7 24589 Name: counts, dtype: int64
By using the order of the columns, we can use their names again:
data[[’country’, ’year’, ’counts’]]
The index of columns is provided by the following:
data.columns Out[91]: Index([’country’, ’counts’, ’year’], dtype=’object’)
Types and selections:
type(data.counts) pandas.core.series.Series
type(data[[’counts’]]) pandas.core.frame.DataFrame
To select a row in a DataFrame, we index its [ix ]attribute in the following way:
data.ix[3] Out[98]: country Ecuador counts 4164 year 2010 Name: 3, dtype: object
We might create DataFrames using dictionaries
Alternatively, we can create a DataFrame with a dict of dicts: In [111]: data = pd.DataFrame( ({0:({’AA’: 1, ’gender’: ’Male’, ’height’: 168}), 1: ({’AA’: 2, ’gender’: ’Male’, ’height’: 180}), 2: ({’AA’: 3, ’gender’: ’Female’, ’height’: 170}), 3: ({’AA’: 4, ’gender’: ’Female’, ’height’: 169}), 4: ({’AA’: 5, ’gender’: ’Female’, ’height’: 170}), 5: ({’AA’: 6, ’gender’: ’Male’, ’height’: 165})})) In [112]: data Out[112]: 0 1 2 3 4 5 AA 1 2 3 4 5 6 gender Male Male Female Female Female Male height 168 180 170 169 170 165
To get a similar output we need to transpose the code:
data = data.T data Out[113]: AA gender height 0 1 Male 168 1 2 Male 180 2 3 Female 170 3 4 Female 169 4 5 Female 170 5 6 Male 165
Series and DataFrames have indexes and values which are called in the following way:
data.values
The output is following
array([[1, 2, 3, 4, 5, 6], [’Male’, ’Male’, ’Female’, ’Female’, ’Female’, ’Male’], [168, 180, 170, 169, 170, 165]], dtype=object)
and the index is called by data.index and the result is:
Index([’AA’, ’gender’, ’height’], dtype=’object’)
We cannot change the index, if we try, e.g., data.index[1] = 5 Python will provide the following message: “Index does not support mutable operations”.
To select a column:
heights = data.height heights Out[116]: 0 168 1 180 2 170 3 169 4 170 5 165 Name: height, dtype: object
To change a value
heights[5] = 191 heights Out[117]: 0 168 1 180 2 170 3 169 4 170 5 191 Name: height, dtype: object
data Out[118]: AA gender height 0 1 Male 168 1 2 Male 180 2 3 Female 170 3 4 Female 169 4 5 Female 170 5 6 Male 191
ht = data.height.copy() ht[5] = 180 data Out[141]: AA gender height 0 1 Male 168 1 2 Male 180 2 3 Female 177 3 4 Female 169 4 5 Female 170 5 6 Male 191
Create/ modify columns by assignment:
data.height[2] = 177 data Out[122]: AA gender height 0 1 Male 168 1 2 Male 180 2 3 Female 177 3 4 Female 169 4 5 Female 170 5 6 Male 180
data[’Status’] = ’Printed’ data Out[143]: AA gender height Status 0 1 Male 168 Printed 1 2 Male 180 Printed 2 3 Female 177 Printed 3 4 Female 169 Printed 4 5 Female 170 Printed 5 6 Male 191 Printed
The following method does not create a column:
data.libraryNo = 999 data Out[146]: AA gender height Status 0 1 Male 168 Printed 1 2 Male 180 Printed 2 3 Female 177 Printed 3 4 Female 169 Printed 4 5 Female 170 Printed 5 6 Male 191 Printed
data.libraryNo 999
We can define a Series object as column in a DataFrame
test = pd.Series([0]*2 + [3]*2) test data[’test’] = test data
We created a Series of 4 numbers. Note however that the DataFrame contains six rows. This is not a problem when we use numbers because Python automatically add NaN to fill the empty rows. Nevertheless, when we employ other data structures such as strings Python will show an error message: ValueError: Length of values does not match length of index.
# Popular Authors authors = [’Stephen␣King’, ’J.K.␣Rowling’, ’Mark␣Twain’, ’George␣R.␣R.␣Martin’] data[’authors’] = authors
To correct the error, we simply add a string Series that has the same length as the DataFrame
authors = [’Stephen␣King’, ’J.K.␣Rowling’, ’Mark␣Twain’, ’George␣R.␣R.␣Martin’, ’Charles␣Dickens’, ’Arthur␣Conan␣Doyle’] data[’favorite_authors’] = authors
This time the output is correct:
data AA gender height Status test favorite_authors 0 1 Male 168 Printed 0.0 Stephen King 1 2 Male 180 Printed 0.0 J.K. Rowling 2 3 Female 177 Printed 3.0 Mark Twain 3 4 Female 169 Printed 3.0 George R. R. Martin 4 5 Female 170 Printed NaN Charles Dickens 5 6 Male 165 Printed NaN Arthur Conan Doyle
To delete the column test from the DataFrame data
del data[’test’] data AA gender height Status authors 0 1 Male 168 Printed Stephen King 1 2 Male 180 Printed J.K. Rowling 2 3 Female 177 Printed Mark Twain 3 4 Female 169 Printed George R. R. Martin 4 5 Female 170 Printed Charles Dickens 5 6 Male 165 Printed Arthur Conan Doyle
To get the data as a simple narray we need to employ the attribute values.
array([[1, ’Male’, 168, ’Printed’, ’Stephen␣King’], [2, ’Male’, 180, ’Printed’, ’J.K.␣Rowling’], [3, ’Female’, 177, ’Printed’, ’Mark␣Twain’], [4, ’Female’, 169, ’Printed’, ’George␣R.␣R.␣Martin’], [5, ’Female’, 170, ’Printed’, ’Charles␣Dickens’], [6, ’Male’, 165, ’Printed’, ’Arthur␣Conan␣Doyle’]], dtype=object)
The dtype here is object because we have numeric and string data and differs when we have numeric or other type of data.
df1 = pd.DataFrame(’A’: [’A0’, ’A1’, ’A2’, ’A3’], ’B’: [’B0’, ’B1’, ’B2’, ’B3’], ’C’: [’C0’, ’C1’, ’C2’, ’C3’], ’D’: [’D0’, ’D1’, ’D2’, ’D3’], index=[0, 1, 2, 3])
Figure1: Example from https://pandas.pydata.org/pandas-docs/stable/merging.html
For more information see http://pandas.pydata.org/pandas-docs/stable/merging.html.
Python can manipulate date and time objects using the datetime module. It allows the production of calculations using time and date objects and also provides classes for controlling the output (see also, https://docs.python.org/2/library/datetime.html)
from datetime import datetime #%% now = datetime.now() now
and the result is
To get the date only
now.date()
and the output in this case is datetime.date(2017, 1, 6). To find the day
now.day
and the output is 6. Also, for the time
now.time()
and the output is datetime.time(14, 41, 4, 481168). We can also ask which is the week day:
now.weekday()
that will generate the output 4
from datetime import date, time
time(3, 24)
age = now - datetime(1980, 8, 16) age/365
days=(datetime(2017, 3, 10) - datetime(2017, 8, 16)) days.days
We suggest that you use comma-separated value or CSV files when interacting with Python and other statistical software. In computing, CSV files stores tabular data (numbers and text) in plain text. Columns are separated by commas; rows are terminated by newlines. This file format is not proprietary, the files can be edited in text editors and spreadsheet software, such as Excel and Calc.
dur = pd.read_csv("data/duration.csv", sep=";") dur Out[153]: experiment duration 0 A 199 1 A 184 2 A 242 3 A 236 4 A 216 5 A 176 6 A 223 7 A 186 8 A 210 9 A 220 .. ... ... 95 C 221 96 C 239 97 C 235 98 C 248 99 C 204 100 C 226 101 C 206 102 C 194 103 C 205 104 C 182 [105 rows x 2 columns]
We can also import another dataframe and add a column titled AA.
fricative = pd.read_table("data/fricatives.csv", sep=’,’) fricative[’AA’] = pd.Series(range(1,8827))
fricative.head() # %%
0 1 0.060398 32.671794 757.605236 1104.704765 13.835014 210.523631 1 2 0.045656 38.906220 732.582945 1065.089424 12.654465 186.856393 2 3 0.050907 47.209304 647.696728 1627.357767 7.647966 61.615315 3 4 0.051049 41.703970 1017.179353 2318.797907 5.570367 33.783925 4 5 0.028408 44.345609 1132.524942 848.894793 7.105495 108.453910 Segment Vowel Variety Stress Voice Position AA 0 d a CG Unstressed Voiced Middle 1.0 1 d a CG Unstressed Voiced Middle 2.0 2 d a CG Unstressed Voiced Middle 3.0 3 d a CG Unstressed Voiced Middle 4.0 4 d a CG Unstressed Voiced Middle 5.0
list(range(1,len(fricative.index)))
We can skip rows if we do not want them in the analysis:
testfric=pd.read_csv("data/fricatives.csv", skiprows=[2,3,4,5,6]) len(testfric.index)
To import a small number of rows from, we can use nrows:
pd.read_csv("data/fricatives.csv", nrows=4)
get_ipython().system(’cat␣data/fricatives.csv’)
pd.read_csv("data/fricatives.csv").head(20)
pd.isnull(pd.read_csv("data/fricatives.csv")).head(20)
When we import data Python identifies empty cells, or NA values as NA data; to designated that specific values or symbols should be considered NA values, we can specify this as follows
pd.read_csv("data/fricatives.csv", na_values=[’?’, -9999999]).head(20)
There are different methods to save data. To save data in CSV format
# ## Writing Data to Files fricative.to_csv("fricative-01.csv")
Using pandas we can also make some basic plotting.
fricative[’duration’].plot()
############################################################## # %% fricative = pd.read_csv("data/fricatives.csv", sep=’,’) fricative[’duration’].plot() # %% fricative[’duration’].plot(kind=’hist’) # %% fricative[’duration’].plot(kind=’box’,showfliers=False)
Duration
Figure2: Duration in sec.
————————————————————————
import matplotlib.pyplot as plt plt.plot([1,2,3,4]) plt.ylabel(’some numbers’) plt.show()
In [208]: fricative.sum() Out[208]: duration 827.811 intensity 346024 cog 5.05981e+07 sdev 2.40776e+07 skew 21699.9 kurt 328392 Segment dddddddddddddddddddddddddddddddddddddddddddddd... Vowel aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa... Variety CGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG... Stress UnstressedUnstressedUnstressedUnstressedUnstre... Voice VoicedVoicedVoicedVoicedVoicedVoicedVoicedVoic... Position MiddleMiddleMiddleMiddleMiddleMiddleMiddleMidd... AA 3.89536e+07 dtype: object
In [209]: fricative.mean() Out[209]: duration 0.093782 intensity 39.254023 cog 5732.200660 sdev 2727.724598 skew 2.458354 kurt 37.203150 AA 4413.500000 dtype: float64
In [211]: fricative.std() Out[211]: duration 0.031759 intensity 8.272744 cog 3425.508087 sdev 1339.636724 skew 4.785687 kurt 138.622132 AA 2547.991071 dtype: float64
In [212]: fricative.count() Out[212]: duration 8827 intensity 8815 cog 8827 sdev 8827 skew 8827 kurt 8827 Segment 8827 Vowel 8827 Variety 8827 Stress 8827 Voice 8827 Position 8827 AA 8826 dtype: int64
fricative.intensity.hasnans Out[215]: True In [221]: fricative.intensity.isnull().sum() Out[221]: 12
Describe:
In [222]: fricative.describe() /Users/haristhemistocleous/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile RuntimeWarning) Out[222]: duration intensity cog sdev skew \ count 8827.000000 8815.000000 8827.000000 8827.000000 8827.000000 mean 0.093782 39.254023 5732.200660 2727.724598 2.458354 std 0.031759 8.272744 3425.508087 1339.636724 4.785687 min 0.020333 5.278827 419.757883 228.697624 -5.250996 25% 0.071596 NaN 2385.869561 1771.421219 -0.113557 50% 0.091452 NaN 6175.724355 2368.203536 0.925865 75% 0.112412 NaN 8344.008050 3595.757817 2.953676 max 0.316844 69.455969 18606.542539 9253.436646 59.853567 kurt AA count 8827.000000 8826.000000 mean 37.203150 4413.500000 std 138.622132 2547.991071 min -1.892874 1.000000 25% 0.512395 NaN 50% 3.432032 NaN 75% 12.453753 NaN max 3999.613892 8826.000000 describe can detect non-numeric data and sometimes yield useful information about it.
fricative.sdev.describe() Out[224]: count 8827.000000 mean 2727.724598 std 1339.636724 min 228.697624 25% 1771.421219 50% 2368.203536 75% 3595.757817 max 9253.436646 Name: sdev, dtype: float64