Pandas doc: https://pandas.pydata.org/docs/index.html
Cookbook (useful strips of code): https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook
Overview
Pandas is a powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of NumPy. It is widely used for data cleaning, transformation, analysis, and visualization in Python.
Axis:
Axis:
- Default is by columns, Axis=0
- Can override to be by rows, Axis=1
A B C D F E
2013 0.00 0.00 -1.50 5.0 NaN 1.0
2014 1.21 -0.17 0.11 5.0 1.0 1.0
2015 -0.86 -2.10 -0.49 5.0 2.0 NaN
2016 0.72 -0.70 -1.03 5.0 3.0 NaN
df.mean():
A -0.00
B -0.38
C -0.68
D 5.00
F 3.00
df.mean(axis=1):
2013 0.87
2014 1.43
2015 0.70
2016 1.39Key Features
- Data structures for efficiently storing large datasets
- Tools for reshaping and merging datasets
- Support for time series data
- High-performance operations for aggregating and filtering data
Core Data Structures
- Series: 1-dimensional labeled array capable of holding any data type.
# Default series
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)
--------
0 10
1 20
2 30
3 40
dtype: int64# Series with custom index
data = [10, 20, 30, 40]
index = ['a', 'b', 'c', 'd']
series = pd.Series(data, index=index)
print(series)
- - - - - - - - - - -
a 10
b 20
c 30
d 40
dtype: int64- DataFrame: 2-dimensional labeled data structure with columns of potentially different types.
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'] }
df = pd.DataFrame(data)
print(df)
- - - - - - - - - - -
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicagoindex = ['person1', 'person2', 'person3']
df = pd.DataFrame(data, index=index)
print(df)
- - - - - - - - - - -
Name Age City
person1 Alice 25 New York
person2 Bob 30 Los Angeles
person3 Charlie 35 ChicagoImportant Functions and Methods
Series
pd.Series(data, index=index): Create a Series. Shown above, start with list / listsseries.head(n): Returns the firstnelements.series.tail(n): Returns the lastnelements.series.describe(): Generates descriptive statistics.series.value_counts(): Returns a Series with counts of unique values.
DataFrame
pd.DataFrame(data, index=index, columns=columns): Create a DataFrame.df.head(n): Returns the firstnrows.df.tail(n): Returns the lastnrows.df.describe(): Generates descriptive statistics.df.info(): Provides a concise summary of the DataFrame.df.shape: Returns a tuple representing the dimensionality of the DataFrame.df.columns: Returns the column labels.df.index: Returns the row labels.df.dtypes: Returns the data type of each column
Quick Indexing
df["A"]: Passing a single label selects a column and return a Series equivalent to df.A. Effectively, plucking a column out of the df.df[0:3]: Will slice the rows, in this case returning the first three rows. When using non-index slicesdf['viper':'cobra'], both start and stop are included.df.loc()anddf.at(): Both finding values, use at if only searching for one value.df.loc["20130102":"20130104", ["A", "B"]]: Arg 1 is rows, arg 2 is cols.df.iloc()anddf.iat(): Integer version of loc and at
- Boolean Indexing: Selects if expression is true
df[df["A"] > 0]: Selects rows where value of A is greater than 0df[df > 0]: Returns values of df that are positive.s.isin([]): For a series, returns a boolean value depending on if value is in the array-like input. Can invert with the~operator. Strings and ints are not comparable.- Usage for indexing:
df2[df2["E"].isin(["two", "four"])]
- Usage for indexing:
s.isin(['cow', 'llama'])
- - - - - - - - - - -
0 True
1 True
2 True
3 False
4 True
5 FalseData Manipulation
df.sort_values(by, ascending=True): Sorts DataFrame by values.df.groupby(by): Groups DataFrame using a mapper or by a Series of columns.df.merge(right, how='inner', on=None): Merges DataFrame with another DataFrame.df.pivot_table(values, index, columns, aggfunc): Creates a pivot table.df.apply(func, axis=0): Applies a function along an axis of the DataFrame.
Data Cleaning
df.dropna(): Removes missing values.df.fillna(value): Fills NA/NaN values with a specified value.df.replace(to_replace, value): Replaces values in DataFrame.
Data Input/Output
pd.read_csv(filepath_or_buffer, sep=','): Reads a CSV file into a DataFrame.df.to_csv(path_or_buf, sep=','): Writes DataFrame to a CSV file.pd.read_excel(io, sheet_name=0): Reads an Excel file into a DataFrame.df.to_excel(excel_writer, sheet_name='Sheet1'): Writes DataFrame to an Excel file.
Documentation & Resources
- Official Documentation
- Pandas GitHub Repository
- Pandas Tutorials & Examples
- Pandas Cheat Sheet
Common Use Cases
- Data cleaning and preparation
- Exploratory data analysis (EDA)
- Data manipulation and transformation
- Feature engineering
- Data aggregation and summarization