Introduction
Written for the python programming language, pandas is a software library. It is used for data structure, data manipulation, and analysis. pandas are normally used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and sci-kit-learn, and data visualization libraries like matplotlib. Numerous coding idioms have adopted pandas from NumPy, however, the biggest difference is that pandas are designed for working with tabular or heterogeneous data. By contrast, NumPy is best suited for working with homogeneous numerical array data. pandas have been matured into a quite large library that’s applicable in a broad set of real-world use cases since becoming an open-source project in 2010.
Description
We will need to get comfortable with its two workhorse data structures to get started with pandas.
- Series
- Data Frame
They provide a solid and easy-to-use basis for most applications while they are not a universal solution for every problem.
Series
A one-dimensional array-like object containing a sequence of values called a Series. The sequence of values is of similar types to NumPy types and an associated array of data labels, which is known as its index. The easiest and simplest Series is formed from only an array of data:
In [11]: obj = pd.Series([4, 7, -5, 3]) In [12]: obj Out[12]: 0 4 1 7 2 -5 3 3 dtype: int64
The string writing of a Series displayed interactively shows the index on the left and the values on the right. A default one consisting of the integers 0 through N – 1 as we did not specify an index for the data. The N is the length of the data is created. We may get the array representation and index object of the Series through its values and index attributes, respectively:
In [13]: obj.values Out[13]: array([ 4, 7, -5, 3]) In [14]: obj.index # like range(4) Out[14]: RangeIndex(start=0, stop=4, step=1
It would be desirable to create a Series with an index identifying each data point with a label: In [15]: obj2 = pd.Series([4, 7, -5, 3], index=[‘d’, ‘b’, ‘a’, ‘c’])
In [16]: obj2 Out[16]: d 4 b 7 a -5 c 3 dtype: int64 In [17]: obj2.index Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')
We can use labels in the index when selecting single values or a set of values compared with NumPy arrays:
In [18]: obj2['a'] Out[18]: -5 In [19]: obj2['d'] = 6 In [20]: obj2[['c', 'a', 'd']] Out[20]: c 3 a -5 d 6 dtype: int64
Even though it contains strings instead of integers, here [‘c’, ‘a’, ‘d’] is interpreted as a list of indices. There will be preserved the index-value link by using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions:
In [21]: obj2[obj2 > 0] Out[21]: d 6 b 7 c 3 dtype: int64
In [22]: obj2 * 2 Out[22]: d 12 b 14 a -10 c 6 dtype: int64
In [23]: np.exp(obj2) Out[23]: d 403.428793 b 1096.633158 a 0.006738 c 20.085537 dtype: float64
Data Frame
A rectangular table of data is being represented by Data Frames and contains an ordered collection of columns. Each of which may be a different value type such that numeric, string, boolean, etc. The Data Frame has a row and column index. It may be thought of as a dict of Series all sharing the same index. The data is always stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. We can use it to represent higher dimensional data in a tabular format using hierarchical indexing.
We adopt different ways to construct a Data Frame. One of among them the most common is from a dict of equal-length lists or NumPy arrays:
data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001, 2002, 2001, 2002, 2003], ‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
The resulting DataFrame would have its index assigned automatically as with Series, and the columns are placed in sorted order:
In [45]: frame Out[45]: pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.7 Ohio 2002 3 4.2 Nevada 2003 4 2.9 Nevada 2004 5 3.4 Nevada 2005 If you are using the Jupyter notebook, pandas Data Frame objects will be displayed as a more browser-friendly HTML table. The head method selects only the first five rows for large Data Frames,: In [46]: frame.head() pop state year
0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.7 Ohio 2002 3 4.2 Nevada 2003 4 2.9 Nevada 2004