Charalambos Themistocleous

2.1 Series

2.2 DataFrame

2.6 Save Data

### Introduction to Python

This chapter provides and introduction to Python.

1. Python 2.7 (or higher) (including Python 3)

2. pandas 0.11.1 (or higher) and its dependencies

3. NumPy 1.6.1 (or higher)

4. matplotlib 1.0.0 (or higher)

5. IPython 0.12 (or higher)

6. NLTK

• Pandas, is a package that provides functionality for analyzing data in the form of tables, such as those we have in Excel, LibreOﬃce Calc. The most important data structure is the DataFrame which is very similar to R dataframes. Pandas also provide functionality for reshaping, sorting, manipulating, etc., data.
• The second library we will be using is NumPy, which oﬀers the basic functionality for conducting mathematics, including statistics, linear algebra, and Fourier transformations.
• Matplotlib provides functionality for creating plots and graphs.
• NLTK is a Natural Language Toolkit implemented in Python.
• So, to start an analysis add the following code on your code ﬁle. The code imports the libraries and provide a designated name for each library. So, we will be calling pandas for instance we will use the name pd followed by a period and the name of a function. This will become more clear soon.

**

[Python code:1.0.0.]
``````
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

``````

### Data Manipulation with Pandas

#### Series

A Series is a single vector of data with an index for each element. A similar structure in numpy is the array.

[Python code:2.1.0.]
``````
measurements = pd.Series([328259, 22781, 30857, 4164, 328387])
measurements

``````

The printed output is the following:

[Python code:2.1.1.]
``````
Out:
0    328259
1     22781
2     30857
3      4164
4    328387
dtype: int64

``````

Series consist of [values ]and [indexes], we can call them separately in the following manner:

[Python code:2.1.2.]
``````
measurements.values
Out:
array([328259,  22781,  30857,   4164, 328387])

``````

**

[Python code:2.1.3.]
``````
measurements.index
Out:
RangeIndex(start=0, stop=5, step=1)

``````

This says that we have numbers from 0 to 5. The following selects the number in position :

[Python code:2.1.4.]
``````
measurements
Out:
4164

``````

We can select values based on logical operations as well

[Python code:2.1.5.]
``````
measurements[measurements < 20000]
Out:
dtype: int64

``````

or

[Python code:2.1.6.]
``````
measurements[measurements == 22781]
Out:
Argentina    22781
dtype: int64

``````

These numbers are not very informative so we want to provide labels. So, if we know that these numbers represent the number of books published in 2010 we might want to provide the name of the country as an index.

[Python code:2.1.7.]
``````
measurements = pd.Series([328259, 22781, 30857, 4164, 328387],

measurements
Out:
USA          328259
Argentina     22781
Sweden        30857
China        328387
dtype: int64

``````

We can use these names to select the value.

[Python code:2.1.8.]
``````
measurements[’USA’]
328259

``````

Also, we can provide both the array of values and the index the own labels:

[Python code:2.1.9.]
``````
measurements.name = ’Book␣Counts’
measurements.index.name = ’Countries’
measurements
Out:
Countries
USA          328259
Argentina     22781
Sweden        30857
China        328387
Name: Book Counts, dtype: int64

``````

We might be interested to select only the countries whose name ends in letter ’a’:

[Python code:2.1.10.]
``````
measurements[[name.endswith(’a’) or name.endswith(’A’) for name in measurements.index]]
USA          328259
Argentina     22781
China        328387
dtype: int64

``````

The following provides information about the position of these numbers:

[Python code:2.1.11.]
``````
[name.endswith(’a’) or name.endswith(’A’) for name in measurements.index]

[True, True, False, False, True]

``````

NumPy’s math functions and statistics can be applied to Series, e.g.,

[Python code:2.1.12.]
``````
np.mean(measurements)
142889.6

``````

Series are very common objects to standard dictionaries (dict) in Python:

[Python code:2.1.13.]
``````
Bookpublications = ({’Italy’:59743, ’Argentina’:22781, ’Poland’: 31500, ’Vietnam’: 24589, ’Indonesia’: 24000})

pd.Series(Bookpublications)
Argentina    22781
Indonesia    24000
Italy        59743
Poland       31500
Vietnam      24589
dtype: int64

``````

#### DataFrame

**

[Python code:2.2.0.]
``````
data = pd.DataFrame(
({’counts’:[ 328259, 22781, 30857, 4164, 328387, 59743,  31500, 24589],
’year’:[2010, 2010, 2010, 2010, 2010, 2005, 2010, 2009],
’country’:[’USA’,  ’Argentina’, ’Sweden’, ’Ecuador’, ’China’, ’Italy’,
’Poland’, ’Vietnam’]}))
data

``````

The output now is a table as we expect it to be:

[Python code:2.2.1.]
``````
country  counts  year
0        USA  328259  2010
1  Argentina   22781  2010
2     Sweden   30857  2010
4      China  328387  2010
5      Italy   59743  2005
6     Poland   31500  2010
7    Vietnam   24589  2009

``````

To select the values of the column, we can use its name:

[Python code:2.2.2.]
``````
data[’counts’]

0    328259
1     22781
2     30857
3      4164
4    328387
5     59743
6     31500
7     24589
Name: counts, dtype: int64

``````

or

[Python code:2.2.3.]
``````
data.counts
Out:
0    328259
1     22781
2     30857
3      4164
4    328387
5     59743
6     31500
7     24589
Name: counts, dtype: int64

``````

By using the order of the columns, we can use their names again:

[Python code:2.2.4.]
``````
data[[’country’, ’year’, ’counts’]]

``````

The index of columns is provided by the following:

[Python code:2.2.5.]
``````
data.columns
Out:
Index([’country’, ’counts’, ’year’], dtype=’object’)

``````

Types and selections:

[Python code:2.2.6.]
``````
type(data.counts)
pandas.core.series.Series

``````

**

[Python code:2.2.7.]
``````
type(data[[’counts’]])
pandas.core.frame.DataFrame

``````

To select a row in a DataFrame, we index its [ix ]attribute in the following way:

[Python code:2.2.8.]
``````
data.ix
Out:
counts        4164
year          2010
Name: 3, dtype: object

``````

We might create DataFrames using dictionaries

[Python code:2.2.9.]
``````
Alternatively, we can create a DataFrame with a dict of dicts:
In :

data = pd.DataFrame(
({0:({’AA’: 1, ’gender’: ’Male’, ’height’: 168}),
1: ({’AA’: 2, ’gender’: ’Male’, ’height’: 180}),
2: ({’AA’: 3, ’gender’: ’Female’, ’height’: 170}),
3: ({’AA’: 4, ’gender’: ’Female’, ’height’: 169}),
4: ({’AA’: 5, ’gender’: ’Female’, ’height’: 170}),
5: ({’AA’: 6, ’gender’: ’Male’, ’height’: 165})}))​

In :

data
Out:
0     1       2       3       4     5
AA         1     2       3       4       5     6
gender  Male  Male  Female  Female  Female  Male
height   168   180     170     169     170   165

``````

To get a similar output we need to transpose the code:

[Python code:2.2.10.]
``````
data = data.T
data
Out:
AA  gender height
0  1    Male    168
1  2    Male    180
2  3  Female    170
3  4  Female    169
4  5  Female    170
5  6    Male    165

``````

Series and DataFrames have indexes and values which are called in the following way:

[Python code:2.2.11.]
``````
data.values

``````

The output is following

[Python code:2.2.12.]
``````
array([[1, 2, 3, 4, 5, 6],
[’Male’, ’Male’, ’Female’, ’Female’, ’Female’, ’Male’],
[168, 180, 170, 169, 170, 165]], dtype=object)

``````

and the index is called by data.index and the result is:

[Python code:2.2.13.]
``````
Index([’AA’, ’gender’, ’height’], dtype=’object’)

``````

We cannot change the index, if we try, e.g., data.index = 5 Python will provide the following message: “Index does not support mutable operations”.

To select a column:

[Python code:2.2.14.]
``````
heights = data.height
heights
Out:
0    168
1    180
2    170
3    169
4    170
5    165
Name: height, dtype: object

``````

To change a value

[Python code:2.2.15.]
``````
heights = 191
heights
Out:
0    168
1    180
2    170
3    169
4    170
5    191
Name: height, dtype: object

``````

**

[Python code:2.2.16.]
``````
data
Out:
AA  gender height
0  1    Male    168
1  2    Male    180
2  3  Female    170
3  4  Female    169
4  5  Female    170
5  6    Male    191

``````

**

[Python code:2.2.17.]
``````
ht = data.height.copy()
ht = 180
data
Out:
AA  gender height
0  1    Male    168
1  2    Male    180
2  3  Female    177
3  4  Female    169
4  5  Female    170
5  6    Male    191

``````

Create/ modify columns by assignment:

[Python code:2.2.18.]
``````
data.height = 177
data
Out:
AA  gender height
0  1    Male    168
1  2    Male    180
2  3  Female    177
3  4  Female    169
4  5  Female    170
5  6    Male    180

``````

**

[Python code:2.2.19.]
``````
data[’Status’] = ’Printed’
data
Out:
AA  gender height   Status
0  1    Male    168  Printed
1  2    Male    180  Printed
2  3  Female    177  Printed
3  4  Female    169  Printed
4  5  Female    170  Printed
5  6    Male    191  Printed

``````

The following method does not create a column:

[Python code:2.2.20.]
``````
data.libraryNo = 999
data
Out:
AA  gender height   Status
0  1    Male    168  Printed
1  2    Male    180  Printed
2  3  Female    177  Printed
3  4  Female    169  Printed
4  5  Female    170  Printed
5  6    Male    191  Printed

``````

**

[Python code:2.2.21.]
``````
data.libraryNo
999

``````

We can deﬁne a Series object as column in a DataFrame

[Python code:2.2.22.]
``````
test = pd.Series(*2 + *2)
test

data[’test’] = test
data

``````

We created a Series of 4 numbers. Note however that the DataFrame contains six rows. This is not a problem when we use numbers because Python automatically add NaN to ﬁll the empty rows. Nevertheless, when we employ other data structures such as strings Python will show an error message: ValueError: Length of values does not match length of index.

[Python code:2.2.23.]
``````
# Popular Authors
authors = [’Stephen␣King’, ’J.K.␣Rowling’, ’Mark␣Twain’, ’George␣R.␣R.␣Martin’]
data[’authors’] = authors

``````

To correct the error, we simply add a string Series that has the same length as the DataFrame

[Python code:2.2.24.]
``````
authors = [’Stephen␣King’, ’J.K.␣Rowling’, ’Mark␣Twain’, ’George␣R.␣R.␣Martin’, ’Charles␣Dickens’, ’Arthur␣Conan␣Doyle’]
data[’favorite_authors’] = authors

``````

This time the output is correct:

[Python code:2.2.25.]
``````
data
AA  gender height   Status  test        favorite_authors
0  1    Male    168  Printed   0.0         Stephen King
1  2    Male    180  Printed   0.0         J.K. Rowling
2  3  Female    177  Printed   3.0           Mark Twain
3  4  Female    169  Printed   3.0  George R. R. Martin
4  5  Female    170  Printed   NaN      Charles Dickens
5  6    Male    165  Printed   NaN   Arthur Conan Doyle

``````

To delete the column test from the DataFrame data

[Python code:2.2.26.]
``````
del data[’test’]
data
AA  gender height   Status              authors
0  1    Male    168  Printed         Stephen King
1  2    Male    180  Printed         J.K. Rowling
2  3  Female    177  Printed           Mark Twain
3  4  Female    169  Printed  George R. R. Martin
4  5  Female    170  Printed      Charles Dickens
5  6    Male    165  Printed   Arthur Conan Doyle

``````

To get the data as a simple narray we need to employ the attribute values.

[Python code:2.2.27.]
``````
array([[1, ’Male’, 168, ’Printed’, ’Stephen␣King’],
[2, ’Male’, 180, ’Printed’, ’J.K.␣Rowling’],
[3, ’Female’, 177, ’Printed’, ’Mark␣Twain’],
[4, ’Female’, 169, ’Printed’, ’George␣R.␣R.␣Martin’],
[5, ’Female’, 170, ’Printed’, ’Charles␣Dickens’],
[6, ’Male’, 165, ’Printed’, ’Arthur␣Conan␣Doyle’]], dtype=object)

``````

The dtype here is object because we have numeric and string data and diﬀers when we have numeric or other type of data.

#### Merging DataFrames

df1 = pd.DataFrame(’A’: [’A0’, ’A1’, ’A2’, ’A3’], ’B’: [’B0’, ’B1’, ’B2’, ’B3’], ’C’: [’C0’, ’C1’, ’C2’, ’C3’], ’D’: [’D0’, ’D1’, ’D2’, ’D3’], index=[0, 1, 2, 3]) Figure1: Example from https://pandas.pydata.org/pandas-docs/stable/merging.html

#### Date and Time

Python can manipulate date and time objects using the datetime module. It allows the production of calculations using time and date objects and also provides classes for controlling the output (see also, https://docs.python.org/2/library/datetime.html)

[Python code:2.4.0.]
``````
from datetime import datetime
#%%
now = datetime.now()
now

``````

and the result is

[Python code:2.4.1.]
``````
``````

To get the date only

[Python code:2.4.2.]
``````
now.date()

``````

and the output in this case is datetime.date(2017, 1, 6). To ﬁnd the day

[Python code:2.4.3.]
``````
now.day

``````

and the output is 6. Also, for the time

[Python code:2.4.4.]
``````
now.time()

``````

and the output is datetime.time(14, 41, 4, 481168). We can also ask which is the week day:

[Python code:2.4.5.]
``````
now.weekday()

``````

that will generate the output 4

[Python code:2.4.6.]
``````
from datetime import date, time

``````

**

[Python code:2.4.7.]
``````
time(3, 24)

``````

**

[Python code:2.4.8.]
``````
age = now - datetime(1980, 8, 16)
age/365

``````

**

[Python code:2.4.9.]
``````
days=(datetime(2017, 3, 10) - datetime(2017, 8, 16))
days.days

``````

#### Importing data

We suggest that you use comma-separated value or CSV ﬁles when interacting with Python and other statistical software. In computing, CSV ﬁles stores tabular data (numbers and text) in plain text. Columns are separated by commas; rows are terminated by newlines. This ﬁle format is not proprietary, the ﬁles can be edited in text editors and spreadsheet software, such as Excel and Calc.

[Python code:2.5.0.]
``````
dur
Out:
experiment  duration
0            A       199
1            A       184
2            A       242
3            A       236
4            A       216
5            A       176
6            A       223
7            A       186
8            A       210
9            A       220
..         ...       ...
95           C       221
96           C       239
97           C       235
98           C       248
99           C       204
100          C       226
101          C       206
102          C       194
103          C       205
104          C       182

[105 rows x 2 columns]

``````

We can also import another dataframe and add a column titled AA.

[Python code:2.5.1.]
``````
fricative[’AA’] = pd.Series(range(1,8827))

``````

**

[Python code:2.5.2.]
``````
# %%

``````

**

[Python code:2.5.3.]
``````
0   1  0.060398  32.671794   757.605236  1104.704765  13.835014  210.523631
1   2  0.045656  38.906220   732.582945  1065.089424  12.654465  186.856393
2   3  0.050907  47.209304   647.696728  1627.357767   7.647966   61.615315
3   4  0.051049  41.703970  1017.179353  2318.797907   5.570367   33.783925
4   5  0.028408  44.345609  1132.524942   848.894793   7.105495  108.453910

Segment Vowel Variety      Stress   Voice Position   AA
0       d     a      CG  Unstressed  Voiced   Middle  1.0
1       d     a      CG  Unstressed  Voiced   Middle  2.0
2       d     a      CG  Unstressed  Voiced   Middle  3.0
3       d     a      CG  Unstressed  Voiced   Middle  4.0
4       d     a      CG  Unstressed  Voiced   Middle  5.0

``````

**

[Python code:2.5.4.]
``````
list(range(1,len(fricative.index)))

``````

We can skip rows if we do not want them in the analysis:

[Python code:2.5.5.]
``````
len(testfric.index)

``````

To import a small number of rows from, we can use nrows:

[Python code:2.5.6.]
``````

``````

**

[Python code:2.5.7.]
``````
get_ipython().system(’cat␣data/fricatives.csv’)

``````

**

[Python code:2.5.8.]
``````

``````

**

[Python code:2.5.9.]
``````

``````

When we import data Python identiﬁes empty cells, or NA values as NA data; to designated that speciﬁc values or symbols should be considered NA values, we can specify this as follows

[Python code:2.5.10.]
``````

``````

#### Save Data

There are diﬀerent methods to save data. To save data in CSV format

[Python code:2.6.0.]
``````
# ## Writing Data to Files
fricative.to_csv("fricative-01.csv")

``````

### Creating Plots

Using pandas we can also make some basic plotting.

[Python code:3.0.0.]
``````

fricative[’duration’].plot()

``````

**

[Python code:3.0.1.]
``````
##############################################################

# %%
fricative[’duration’].plot()
# %%
fricative[’duration’].plot(kind=’hist’)

# %%
fricative[’duration’].plot(kind=’box’,showfliers=False)

``````

Duration Figure2: Duration in sec. ———————————————————————— import matplotlib.pyplot as plt plt.plot([1,2,3,4]) plt.ylabel(’some numbers’) plt.show()

### Basic Descriptive Statistics using Pandas

**

[Python code:4.0.0.]
``````
In :

fricative.sum()
Out:
duration                                               827.811
intensity                                               346024
cog                                                5.05981e+07
sdev                                               2.40776e+07
skew                                                   21699.9
kurt                                                    328392
Segment      dddddddddddddddddddddddddddddddddddddddddddddd...
Vowel        aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
Variety      CGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG...
Stress       UnstressedUnstressedUnstressedUnstressedUnstre...
Voice        VoicedVoicedVoicedVoicedVoicedVoicedVoicedVoic...
Position     MiddleMiddleMiddleMiddleMiddleMiddleMiddleMidd...
AA                                                 3.89536e+07
dtype: object

``````

**

[Python code:4.0.1.]
``````
In :

fricative.mean()
Out:
duration        0.093782
intensity      39.254023
cog          5732.200660
sdev         2727.724598
skew            2.458354
kurt           37.203150
AA           4413.500000
dtype: float64

``````

**

[Python code:4.0.2.]
``````
In :

fricative.std()
Out:
duration        0.031759
intensity       8.272744
cog          3425.508087
sdev         1339.636724
skew            4.785687
kurt          138.622132
AA           2547.991071
dtype: float64

``````

**

[Python code:4.0.3.]
``````
In :

fricative.count()
Out:
duration     8827
intensity    8815
cog          8827
sdev         8827
skew         8827
kurt         8827
Segment      8827
Vowel        8827
Variety      8827
Stress       8827
Voice        8827
Position     8827
AA           8826
dtype: int64

``````

**

[Python code:4.0.4.]
``````
fricative.intensity.hasnans
Out:
True
In :

fricative.intensity.isnull().sum()
Out:
12

``````

Describe:

[Python code:4.0.5.]
``````
In :

fricative.describe()
/Users/haristhemistocleous/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
RuntimeWarning)
Out:
duration    intensity           cog         sdev         skew  \
count  8827.000000  8815.000000   8827.000000  8827.000000  8827.000000
mean      0.093782    39.254023   5732.200660  2727.724598     2.458354
std       0.031759     8.272744   3425.508087  1339.636724     4.785687
min       0.020333     5.278827    419.757883   228.697624    -5.250996
25%       0.071596          NaN   2385.869561  1771.421219    -0.113557
50%       0.091452          NaN   6175.724355  2368.203536     0.925865
75%       0.112412          NaN   8344.008050  3595.757817     2.953676
max       0.316844    69.455969  18606.542539  9253.436646    59.853567

kurt           AA
count  8827.000000  8826.000000
mean     37.203150  4413.500000
std     138.622132  2547.991071
min      -1.892874     1.000000
25%       0.512395          NaN
50%       3.432032          NaN
75%      12.453753          NaN
max    3999.613892  8826.000000
describe can detect non-numeric data and sometimes yield useful information about it.

``````

**

[Python code:4.0.6.]
``````
fricative.sdev.describe()
Out:
count    8827.000000
mean     2727.724598
std      1339.636724
min       228.697624
25%      1771.421219
50%      2368.203536
75%      3595.757817
max      9253.436646
Name: sdev, dtype: float64

``````