Wednesday, August 8, 2018

Pandas 1. Hands-on python pandas intro.

To install pandas:
pip install pandas

[admin@localhost ~]$ cat >  test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> type(test_data)
<class 'pandas.core.frame.DataFrame'>
>>> help(pd.DataFrame)
class DataFrame(pandas.core.generic.NDFrame)
 |  Two-dimensional size-mutable, potentially heterogeneous tabular data
 |  structure with labeled axes (rows and columns). Arithmetic operations
 |  align on both row and column labels. Can be thought of as a dict-like
 |  container for Series objects. The primary pandas data structure.
>>> test_data
<output omitted - NaN means - missing value (Not a Number))>
>>> test_data.describe()
<output omitted - to read about mean, std, percentile go to it-tuff.blogspot.com/math-1 >
>>>  test_data.columns
Index([u'Rooms', u'Price', u'Floors', u'Area'], dtype='object')
>>> test_data.columns.values
array(['Rooms', 'Price', 'Floors', 'Area'], dtype=object)
>>> help(test_data.dropna)
dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
>>> test_data = test_data.dropna(axis=0)
>>> test_data
<<output omitted - rows with NaN values are removed>
>>> test_data.describe()
<output omitted>
>>> test_data.columns.values
array(['Rooms', 'Price', 'Floors', 'Area'], dtype=object)
>>> price = test_data.Price
>>> type(price)
<class 'pandas.core.series.Series'>
>>> help(pd.Series)
class Series(pandas.core.base.IndexOpsMixin, pandas.core.generic.NDFrame)
 |  One-dimensional ndarray with axis labels (including time series).
>>> price
<output omitted>
>>> price.describe()
<output omitted - only copy of the Price column is shown>
>>> test_data.columns.values
array(['Rooms', 'Price', 'Floors', 'Area'], dtype=object)
>>> test_data_features=['Rooms','Price']
>>> features = test_data[test_data_features]
>>> type(features)
<class 'pandas.core.frame.DataFrame'>
>>> features
<output omitted - only copies of Rooms and Price columns are shown>
>>> features.describe()
<output omitted>
>>> features.head(n=3)
<output omitted - only first 3 rows are shown>
>>> help(features.head)
head(self, n=5) method of pandas.core.frame.DataFrame instance
    Return the first `n` rows.
>>> >>> test_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
Rooms     5 non-null float64
Price     5 non-null float64
Floors    5 non-null int64
Area      5 non-null int64
dtypes: float64(2), int64(2)
memory usage: 200.0 bytes
>>> 





No comments:

Post a Comment