Pandas

Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

  • Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.

  • Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.

We will start our code sessions with the standard NumPy and Pandas imports:


import numpy as np

import pandas as pd

1.1 The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data.

It can be created from a list or array as follows:


data = pd.Series([0.25, 0.5, 0.75, 1.0])

data

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.

The values are simply a familiar NumPy array:


data.values

The index is an array-like object of type pd.Index, which we'll discuss in more detail momentarily.


data.index

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:


data[1]


data[1:3]

The Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.

Series as generalized NumPy array

  • From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array.

  • The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

  • This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.

For example, if we wish, we can use strings as an index:


data = pd.Series([0.25, 0.5, 0.75, 1.0],

index=['a', 'b', 'c', 'd'])

data

And the item access works as expected:


data['a']


data.index

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

We can even use non-contiguous or non-sequential indices:


data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])

data


data[5]


data.index

Series as specialized dictionary

  • In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary.

  • A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values.

  • This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:


population_dict = {'California': 38332521,

'Texas': 26448193,

'New York': 19651127,

'Florida': 19552860,

'Illinois': 12882135}

population = pd.Series(population_dict)

population

By default, a Series will be created where the index is drawn from the sorted keys.

From here, typical dictionary-style item access can be performed:


population['California']

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:


population['California':'Florida']


population.index

Constructing Series objects

We've already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:


>>> pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.

For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:


pd.Series([2, 4, 6])

data can be a scalar, which is repeated to fill the specified index:


pd.Series(5, index=[100,200,300])

data can be a dictionary, in which index defaults to the dictionary keys:


pd.Series({2:'a', 1:'b', 3:'c'})

The index can be explicitly set if a different result is preferred:


pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

Notice that in this case, the Series is populated only with the explicitly identified keys.

Q1: What is the difference between a series and a one dimensional array?

Q2: What is the difference between a series and a dictionary?

1.2 The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame.

Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

We'll now take a look at each of these perspectives.

DataFrame as a generalized NumPy array

  • If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

  • Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:


area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,

'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)

area


population

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:


states = pd.DataFrame({'population': population,

'area': area})

states

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:


states.index

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:


states.columns

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

DataFrame as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary.

  • A dictionary maps a key to a value

  • A DataFrame maps a column name to a Series of column data.

For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:


states['area']

You can get the specific column as a Series by following code:


states[states.columns[0]] #states.columns[0] is area

Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways.

Here we'll give several examples.

From a single Series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:


pd.DataFrame(population)


pd.DataFrame(population, columns=['population'])

From a list of dicts

Any list of dictionaries can be made into a DataFrame.

We'll use a simple list comprehension to create some data:


data = [{'a': i, 'b': 2 * i} for i in range (3)]

print(data)

pd.DataFrame(data)

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:


pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

From a dictionary of Series objects

As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:


pd.DataFrame({'population': population,

'area': area})

From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names.

If omitted, an integer index will be used for each:


import numpy as np

pd.DataFrame(np.random.rand(3, 2),

columns=['foo', 'bar'],

index=['a', 'b', 'c'])


pd.DataFrame(np.random.rand(4))

Q3: What is the difference between a data frame and a two dimensional array?

1.3 The Pandas Index Object

  • We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data.

  • This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values).

As a simple example, let's construct an Index from a list of integers:


ind = pd.Index([2, 3, 5, 7, 11])

ind

Index as immutable array

The Index in many ways operates like an array.

For example, we can use standard Python indexing notation to retrieve values or slices:


ind[1]


ind[::2]

Index objects also have many of the attributes familiar from NumPy arrays:


print(ind.size, ind.shape, ind.ndim, ind.dtype)

One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:


ind[1] = 0

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.

Index as ordered set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.

The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:


indA = pd.Index([1, 3, 5, 7, 9])

indB = pd.Index([2, 3, 5, 7, 11])


indA & indB # intersection


indA | indB # union


indA ^ indB # symmetric difference

These operations may also be accessed via object methods, for example indA.intersection(indB).

2 Data Indexing and Selection

In the previous topic, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays.

These included

  • Indexing (e.g., arr[2, 1])

  • Slicing (e.g., arr[:, 1:5])

  • Masking (e.g., arr[arr > 0])

  • Fancy indexing (e.g., arr[0, [1, 5]])

and combinations thereof (e.g., arr[:, [1, 5]]).

Here we'll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects.

We'll start with the simple case of the one-dimensional Series object, and then move on to the more complicated two-dimesnional DataFrame object.

2.1 Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.

If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

Series as dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:


import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0],

index=['a', 'b', 'c', 'd'])

data


data['b']

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:


'a' in data


data.keys()


list(data.items())

Series objects can even be modified with a dictionary-like syntax.

Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:


data['e'] = 1.25

data

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing.

Examples of these are as follows:


# slicing by explicit index

data['a':'c']


# slicing by implicit integer index

data[0:2]


# masking

data[(data > 0.3) & (data < 0.8)]


# fancy indexing

data[['a', 'e']]

Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

Indexers: loc and iloc

These slicing and indexing conventions can be a source of confusion.

For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.


data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

data


# explicit index when indexing

data[1]


# implicit index when slicing

data[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.

These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit index:


data.loc[1]


data.loc[1:3]

The iloc attribute allows indexing and slicing that always references the implicit Python-style index:


data.iloc[1]


data.iloc[1:3]

One guiding principle of Python code is that "explicit is better than implicit."

The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, it is recommended using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

2.2 Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

These analogies can be helpful to keep in mind as we explore data selection within this structure.

DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of related Series objects.

Let's return to our example of areas and populations of states:


area = pd.Series({'California': 423967, 'Texas': 695662,

'New York': 141297, 'Florida': 170312,

'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,

'New York': 19651127, 'Florida': 19552860,

'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})

data

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:


data['area']

Equivalently, we can use attribute-style access with column names that are strings:


data.area

This attribute-style column access actually accesses the exact same object as the dictionary-style access:


data.area is data['area']


data.area = 10000

data

Though this is a useful shorthand, keep in mind that it does not work for all cases!

For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.

For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:


data.pop is data['pop']

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = 1000 rather than data.pop = 1000).

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:


data['density'] = data['pop'] / data['area']

data

DataFrame as two-dimensional array

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array.

We can examine the raw underlying data array using the values attribute:


data.values

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself.

For example, we can transpose the full DataFrame to swap rows and columns:


data


data.T

When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array.

In particular, passing a single index to an array accesses a row:


data.values[0]

and passing a single "index" to a DataFrame accesses a column:


data['area']

Thus for array-style indexing, we need another convention.

Here Pandas again uses the loc and iloc indexers mentioned earlier.

Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:


data.iloc[:3, :2]

Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:


data.loc[:'Illinois', :'pop']

Any of the familiar NumPy-style data access patterns can be used within these indexers.

For example, in the loc indexer we can combine masking and fancy indexing as in the following:


data.loc[data.density > 100, ['pop', 'density']]

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:


data.iloc[0, 2] = 90

data

Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.

  • First, while indexing refers to columns, slicing refers to rows:

data['Florida':'Illinois']

Such slices can also refer to rows by number rather than by index:


data[1:3]

  • Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

data[data.density > 100]

These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.

3 Operating on Data in Pandas

  • One of the essential pieces of NumPy is the ability to perform quick element-wise operations,e.g.,

  • with basic arithmetic (addition, subtraction, multiplication, etc.)

  • with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).

  • Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in numpy computation section are key to this.

  • Pandas includes a couple useful twists:

  • for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output

  • for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc.

This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof ones with Pandas.

We will additionally see that there are well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures.

3.1 Ufuncs: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.

Let's start by defining a simple Series and DataFrame on which to demonstrate this:


rng = np.random.RandomState(42)

ser = pd.Series(rng.randint(0, 10, 4))

ser


df = pd.DataFrame(rng.randint(0, 10, (3, 4)),

columns=['A', 'B', 'C', 'D'])

df

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:


np.exp(ser)

Or, for a slightly more complex calculation:


np.sin(df * np.pi / 4)

Any of the ufuncs discussed in the Computation on NumPy Arrays section can be used in a similar manner.

3.2 UFuncs: Index Alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation.

This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population:


# The name argument allows you to give a name to a Series object, i.e. to the column.

# So that when you'll put that in a DataFrame, the column will be named according to the name parameter.

  

area = pd.Series({'Alaska': 1723337, 'Texas': 695662,

'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 'Texas': 26448193,

'New York': 19651127}, name='population')

Let's see what happens when we divide these to compute the population density:


dfarea = pd.DataFrame(area)

dfpopulation = pd.DataFrame(population)

dfarea


dfpopulation


df_result = pd.DataFrame(population / area)

df_result

The resulting array contains the union of indices of the two input arrays.

Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number," which is how Pandas marks missing data.

This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:


A = pd.Series([2, 4, 6], index=[0, 1, 2])

B = pd.Series([1, 3, 5], index=[1, 2, 3])

A + B

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.

For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:


A.add(B, fill_value=0)

Index alignment in DataFrame

A similar type of alignment takes place for both columns and indices when performing operations on DataFrames:


A = pd.DataFrame(rng.randint(0, 20, (2, 2)),

columns=list('AB'))

A


B = pd.DataFrame(rng.randint(0, 10, (3, 3)),

columns=list('BAC'))

B


A + B

Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.

As was the case with Series, we can use the associated object's arithmetic method and pass any desired fill_value to be used in place of missing entries.

Here we'll fill with the mean of all values in A (computed by first stacking the rows of A):


# stack method reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels

# http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html

fill = A.stack().mean()

A.add(B, fill_value=fill)

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s) |

|-----------------|---------------------------------------|

| + | add() |

| - | sub(), subtract() |

| * | mul(), multiply() |

| / | truediv(), div(), divide()|

| // | floordiv() |

| % | mod() |

| ** | pow() |

3.3 Ufuncs: Operations Between DataFrame and Series

When performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained.

Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array.

Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:


A = rng.randint(10, size=(3, 4))

A


A - A[0]

According to NumPy's broadcasting rules, subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:


df = pd.DataFrame(A, columns=list('QRST'))

df


df - df.iloc[0]

If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword:


# Axis to target. Can be either the axis name (‘index’) or number (0)

df.subtract(df['R'], axis=0)


df.subtract(df['R'],axis='index')

Note that these DataFrame/Series operations, like the operations discussed above, will automatically align indices between the two elements:


halfrow = df.iloc[0, ::2]

halfrow


df - halfrow

This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

4 Import and inspect data

  • In this case, we will analyze a ready-made dataset by Gapminder https://www.gapminder.org/

  • More data can be found here: https://www.gapminder.org/data/


# a common way to import the "ready-made" or harvested data, normally in csv format

# the data file used in this tutorial is from this tutorial: "https://storage.googleapis.com/learn_pd_like_tidyverse/gapminder.csv"

# special thanks for the original author for compiling and sharing this dataset

  

gapminder = pd.read_csv('./Data/gapminder_cleaned_2.csv')

  • Using Excel, open this csv file and check its contents.

  • Be careful that you don't make changes to this csv file in Excel.

  • A safe way is to make a copy of the csv file, and then open the copy with Excel.

Note: reading and writing (external) data with Pandas (Nelli, 2018, p. 142)


# checkin the type and take a glance at the head

print(type(gapminder))

gapminder.head(3) # show the first 3 rows of data


gapminder.tail(3) # Show the last 3 rows of data

4A. Examining the attributes of the Data Frame (standard procedures)

  • df.info()

  • df.shape (similar to "dim" in R)

  • df.columns (check the variables, similar to "names" in R)

  • df.index (check the index of the "rows")

  • df.describe() (descriptive statistics for numerical variables)

  • Note that methods (functions) have parentheses, and attributes (values) don't.


gapminder.info()

  

# Shows info of the columns (name and data type) and

# number of rows (compare with results from the tail() command)


gapminder.shape

# (the number of cases/observations/rows, the number of variables/columns)


gapminder.columns


gapminder.index


gapminder.describe()