What is Pandas Data Structure?

What is Pandas Data Structure?


Written for the python programming language, pandas is a software library. It is used for data structure, data manipulation, and analysis. pandas are normally used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and sci-kit-learn, and data visualization libraries like matplotlib. Numerous coding idioms have adopted pandas from NumPy, however, the biggest difference is that pandas are designed for working with tabular or heterogeneous data. By contrast, NumPy is best suited for working with homogeneous numerical array data. pandas have been matured into a quite large library that’s applicable in a broad set of real-world use cases since becoming an open-source project in 2010.


We will need to get comfortable with its two workhorse data structures to get started with pandas.

  1.  Series
  2. Data Frame

They provide a solid and easy-to-use basis for most applications while they are not a universal solution for every problem.


A one-dimensional array-like object containing a sequence of values called a Series. The sequence of values is of similar types to NumPy types and an associated array of data labels, which is known as its index. The easiest and simplest Series is formed from only an array of data:

In [11]: obj = pd.Series([4, 7, -5, 3])
In [12]: obj
0     4
1     7
2    -5
3     3
dtype: int64

The string writing of a Series displayed interactively shows the index on the left and the values on the right. A default one consisting of the integers 0 through N – 1 as we did not specify an index for the data. The N is the length of the data is created. We may get the array representation and index object of the Series through its values and index attributes, respectively:

In [13]: obj.values
Out[13]: array([ 4, 7, -5, 3])
In [14]: obj.index # like range(4)
Out[14]: RangeIndex(start=0, stop=4, step=1

It would be desirable to create a Series with an index identifying each data point with a label: In [15]: obj2 = pd.Series([4, 7, -5, 3], index=[‘d’, ‘b’, ‘a’, ‘c’])

In [16]: obj2
d      4
b      7
a     -5
c      3
dtype: int64
In [17]: obj2.index
Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')

We can use labels in the index when selecting single values or a set of values compared with NumPy arrays:

In [18]: obj2['a']
Out[18]: -5
In [19]: obj2['d'] = 6
In [20]: obj2[['c', 'a', 'd']]
c      3
a     -5
d      6
dtype: int64

Even though it contains strings instead of integers, here [‘c’, ‘a’, ‘d’] is interpreted as a list of indices. There will be preserved the index-value link by using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions:

In [21]: obj2[obj2 > 0]
d     6
b     7
c     3
dtype: int64
In [22]: obj2 * 2
d      12
b      14
a     -10
c       6
dtype: int64
In [23]: np.exp(obj2)
d     403.428793
b     1096.633158
a      0.006738
c      20.085537
dtype: float64

Data Frame

A rectangular table of data is being represented by Data Frames and contains an ordered collection of columns. Each of which may be a different value type such that numeric, string, boolean, etc. The Data Frame has a row and column index. It may be thought of as a dict of Series all sharing the same index. The data is always stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. We can use it to represent higher dimensional data in a tabular format using hierarchical indexing.

We adopt different ways to construct a Data Frame. One of among them the most common is from a dict of equal-length lists or NumPy arrays:
data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001, 2002, 2001, 2002, 2003], ‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
The resulting DataFrame would have its index assigned automatically as with Series, and the columns are placed in sorted order:

In [45]: frame

            pop                        state                        year

0         1.5                           Ohio                        2000       

1         1.7                           Ohio                        2001        

2         3.7                           Ohio                        2002

3         4.2                           Nevada                   2003

4         2.9                           Nevada                   2004

5         3.4                           Nevada                   2005

If you are using the Jupyter notebook, pandas Data Frame objects will be displayed as a more browser-friendly HTML table.
The head method selects only the first five rows for large Data Frames,:
In [46]: frame.head()
             pop                        state                        year
0         1.5                           Ohio                        2000       

1         1.7                           Ohio                        2001        

2         3.7                           Ohio                        2002

3         4.2                           Nevada                   2003

4         2.9                           Nevada                   2004
Mansoor Ahmed is Chemical Engineer, web developer, a writer currently living in Pakistan. My interests range from technology to web development. I am also interested in programming, writing, and reading.
Posts created 422

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top