[Python ML Guide] Section 1.3: Pandas

Jae. 2023. 8. 24. 09:25

728x90

https://www.inflearn.com/course/%ED%8C%8C%EC%9D%B4%EC%8D%AC-%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-%EC%99%84%EB%B2%BD%EA%B0%80%EC%9D%B4%EB%93%9C

[개정판] 파이썬 머신러닝 완벽 가이드 - 인프런 | 강의

이론 위주의 머신러닝 강좌에서 탈피하여 머신러닝의 핵심 개념을 쉽게 이해함과 동시에 실전 머신러닝 애플리케이션 구현 능력을 갖출 수 있도록 만들어 드립니다., [사진]상세한 설명과 풍부

www.inflearn.com

0. Pandas 개요

Pandas

Python에서 데이터 처리를 위해 존재하는 가장 인기있는 라이브러리
일반적으로 대부분의 데이터 세트는 2차원 데이터이므로 row, column 으로 구성되어있음
2차원의 데이터가 인기 있는 이유는 바로 인간이 이해하기 가장 쉬운 데이터구조이면서도 효과적으로 데이터를 담을 수 있는 구조이기 때문이다
Pandas는 이처럼 행과 열로 이루어진 2차원 데이터를 효율적으로 가공/처리할 수 있는 다양하고 훌륭한 기능을 제공함

Pandas 특징

구조화된 데이터의 처리를 지원하는 Python 라이브러리
Python계의 엑셀!
고성능 Array 계산 라이브러리인 Numpy와 통합하여, 강력한 “스프레드시트” 처리 기능을 제공
Numpy에서 사용할 수 있는 고성능 기능들을 그대로 재현 : numpy를 내포하고 있음
인덱싱, 연산용 함수, 전처리 함수 등을 제공함

1. Pandas 구성요소

Pandas 주요 구성 요소

DataFrame: Column X Rows 데이터 셋
Series: 1개의 Column값으로만 구성된 1차원 데이터 셋
Index: DataFrame/Series의 고유한 Key값 객체

DataFrame

Data Table 전체를 포함하는 Object
Index(세로축)와 Columns(가로축)으로 구성됨
- Index + Column -> Value의 형태
DataFrame()안에 data를 입력
- 주로 Dictionary, 그 외에 2차원 리스트를 사용
  - {column_name1 : list1, column_name2 : list2, column_name3 : list3,...}
  - Dictionary의 key = 값이 column명
  - Dictionary의 value = 개별 Column에 들어가는 Row 값들
- 2차원 리스트: 무조건 가로 배열만 가능함
pd.DataFrame(dictionary, columns = [ ], index = [ ])를 이용해서 새롭게 columns or index 업데이트 가능
차이점: Series에 column_name이 할당되는 순간 DataFrame이 된다

NumPy array-like
Each column can have a different type
Row & Column index
Size mutable: insert & delete columns

Series

DataFrame을 구성하는 하나의 Column에 해당하는 데이터의 모음 object
Index와 value로 구성됨: Index -> Value의 형태
Index를 내가 임의로 설정가능
DataFrame과 다르게 column_name이 존재하지 않는다 (있다면 Series가 아니라 DataFrame)
Data Index에 접근 및 수정은 dictionary 처럼 가능

subclass of numpy.ndarray
Data: any type
Index labels need not be ordered
Duplicates are possible

DataFrame의 생성

Dictionary를 이용하여 생성
새로운 Column명을 추가
Index를 새로운 값으로 할당
NaN: Not a Number (Null과 혼용해서 쓰임)

dic1 = {
    "Name": ["Chulmin", "Eunkyung", "Jinwoong", "Soobeom"],
    "Year": [2011, 2016, 2015, 2015],
    "Gender": ["Male", "Female", "Male", "Male"],
}

# Dictionary를 DataFrame으로 반환
data_pf = pd.DataFrame(dic1)
print(data_pf)
print()

# 새로운 column명을 추가
data_pf = pd.DataFrame(dic1, columns=["Name", "Year", "Gender", "Age"])
print(data_pf)
print()

# 인덱스를 새로운 값으로 할당
data_pf = pd.DataFrame(dic1, index=["one", "two", "three", "four"])
print(data_pf)
print()

#       Name  Year  Gender
# 0   Chulmin  2011    Male
# 1  Eunkyung  2016  Female
# 2  Jinwoong  2015    Male
# 3   Soobeom  2015    Male

#        Name  Year  Gender  Age
# 0   Chulmin  2011    Male  NaN
# 1  Eunkyung  2016  Female  NaN
# 2  Jinwoong  2015    Male  NaN
# 3   Soobeom  2015    Male  NaN

#            Name  Year  Gender
# one     Chulmin  2011    Male
# two    Eunkyung  2016  Female
# three  Jinwoong  2015    Male
# four    Soobeom  2015    Male

DataFrame, Series들을 ndarray로 바꾸고 싶다면 뒤에 .values()를 추가

print("columns:", titanic_df.columns)
print("index:", titanic_df.index)

# DataFrame, Series들을 ndarray로 바꾸고 싶다면 뒤에 .values()를 추가
print("index value:", titanic_df.index.values)

Index

Pandas의 Index 객체는 RDBMS의 PK(primary key)와 유사하게 DataFrame, Series의 레코드를 고유하게 식별하는 객체임 (하지만 pandas Index는 별도의 column 값이 아님)
DataFrame/Series 객체는 Index 객체를 포함하지만 Series 객체에 연산 함수를 적용할 때 Index는 연산에서 제외됨
Index는 오직 식별용으로만 사용
DataFrame, Series에서 Index 객체만 추출하려면 DataFrame.index or Series.index 속성을 통해 가능
Pandas Index는 반드시 숫자형 값이 아니어도 되며, 고유한 값을 유지할 수 있다면 문자형/Datetime도 상관없음

# 원본 파일 재로딩
titanic_df = pd.read_csv("titanic_train.csv")
# index 객체 추출
indexes = titanic_df.index
print(indexes)
# index 객체를 실제 값 ndarray로 변환
print("index 객체 array 값: \n", indexes.values)

# RangeIndex(start=0, stop=891, step=1)

print(type(indexes.values))
print(indexes.values.shape)
print(indexes[:5].values)
print(indexes.values[:5])
print(indexes[6])

# <class 'numpy.ndarray'>
# (891,)
# [0 1 2 3 4]
# [0 1 2 3 4]
# 6

# 어떤 연산을 하든 Index는 포함이 되지 않는다: 식별자로 사용

series_fair = titanic_df["Fare"]
print("Fair Series max 값:", series_fair.max())
print("Fair Series sum 값:", series_fair.sum())
print("sum() Fair Series:", sum(series_fair))
print("Fair Series + 3:\n", (series_fair + 3).head(3))

# Fair Series max 값: 512.3292
# Fair Series sum 값: 28693.9493
# sum() Fair Series: 28693.949299999967
# Fair Series + 3:
#  0    10.2500
# 1    74.2833
# 2    10.9250
# Name: Fare, dtype: float64

Series / Index 만 바꾸기

df.columns = [modify_list]로 재정의

df.index = [modify_list]로 재정의

2. Pandas 기본 API

기본 API

read_csv()
head()
shape
info()
describe()
value_counts()
sort_values()
reset_index()
rename()

WHEN?

describe(): DataFrame의 전체 데이터 분포도 확인

value_counts(): DataFrame의 특정 Column의 데이터 분포도 확인

info(): DataFrame의 Feature dtype와 null 개수 파악

read_csv()

csv 파일을 편리하게 DataFrame으로 로딩함
read_csv()의 sep인자를 ','가 아닌 다른 분리자로 변경하여 다른 유형의 파일도 로드가 가능함
필드 구분 문자가 콤마(','): read_csv()
필드 구분 문자가 탭('\t'): read_table() or read_csv('file_name', sep='\t')

titanic_df = pd.read_csv("titanic_train.csv")
print('titanic 변수 type:', type(titanic_df))
titanic_df

head()와 tail()

head(): DataFrame의 맨 앞부터 5개의 데이터만 추출
- ()안 parameter에 다른 숫자를 넣어 추출할 데이터의 개수를 결정할 수 있음
tail(): DataFrame의 맨 뒤부터 5개의 데이터만 추출
- ()안 parameter에 다른 숫자를 넣어 추출할 데이터의 개수를 결정할 수 있음

# 상위 3개의 데이터 추출
titanic_df.head(3)

# 하위 4개의 데이터 추출
titanic_df.tail(4)

Jupyter Notebook 에서만 .head() or .tail()을 사용했을 때 DataFrame 형태로 반환됨
print(titanic_df.head()): 불명확하게 나옴
display(titanic_df.head()): Jupyter Notebook 에서와 동일하게 나옴

# 불명확하게 나오는 경우
print(titanic_df.head())

# 정상적으로 나오는 경우
display(titanic_df.tail())

단독으로 사용하는 titanic_df.head() or .tail()은 display method보다 뒤에 나타나야 DataFrame이 둘다 나온다
웬만하면 display(titanic_df.head() or .tail())을 사용하자

# 정상적으로 2개의 DataFrame이 나온다

display(titanic_df.tail())
titanic_df.head()

# 정석대로 쓰기 위해서는 둘다 display를 사용하자

display(titanic_df.tail())
display(titanic_df.head())

display()

Pandas DataFrame의 모든 Row들을 축약형이 아니라 n개만큼 보이게 하고 싶다면:
- pd.set_option('display.max_rows', n)
Pandas DataFrame의 모든 Columns들을 축약형이 아니라 n개만큼 보이게 하고 싶다면:
- pd.set_option('display.max_columns', n)
Pandas DataFrame의 각 Column의 최대 가로길이를 n글자로 늘리고 싶다면:
- pd.set_option('display.max_colwidth', n)

shape

DataFrame의 행(Row)과 열(Column) 크기를 가지고 있는 속성
DataFrame.shape: (Row, Column) 형태로 반환

info

df.info()
DataFrame 내의 column_name, data type, Null 건수, Data 정보를 제공함

describe

df.describe()
Data값들의 평균, 표준편차, 4분위 분포도를 제공함
숫자형 column들에 대해 해당 정보를 제공함

value_counts(dropna=True)

개별 데이터값의 분포도를 제공
Series(DataFrame)에서 동일한 개별 데이터 값이 몇 건이 있는지 정보를 제공
기본적으로 Null 값을 무시하고 결과값을 내놓기 쉬움
개수가 많은 개별 data순으로 정렬
dropna()
- Null값을 포함하여 개별 데이터 값의 건수를 계산할지 여부를 dropna 인자로 판단
- dropna는 default로 True이며 이 경우는 Null 값을 무시하고 개별 데이터 값의 건수를 계산

value_counts() 함수는 기본적으로 내림차순 정렬이다

value_counts = titanic_df["Pclass"].value_counts()
print(value_counts)

# 3    491
# 1    216
# 2    184
# Name: Pclass, dtype: int64

type(titanic_df['Pclass'])

# pandas.core.series.Series

print("titanic_df 데이터 건수:", titanic_df.shape[0])
print("기본 설정인 dropna=True로 value_counts()")

# value_counts()는 default로 dropna=True 이므로 value_counts(dropna=True)와 동일
print(titanic_df['Embarked'].value_counts())
print(titanic_df['Embarked'].value_counts(dropna=False))


# titanic_df 데이터 건수: 891
# 기본 설정인 dropna=True로 value_counts()
# S    644
# C    168
# Q     77
# Name: Embarked, dtype: int64
# S      644
# C      168
# Q       77
# NaN      2
# Name: Embarked, dtype: int64

DataFrame에서도 value_counts() 적용 가능
- 가능한 column들의 조합을 개별 데이터로 인식하여 동일한 개별 데이터가 몇 건이 있는지 정보를 제공

reset_index(drop=False, inplace=False)

DataFrame 및 Series에 reset_index() method를 수행하면 새롭게 index를 연속 숫자형으로 할당 & 기존 인덱스는 'index'라는 새로운 column 명으로 추가
drop
- default 값이 False인데 이 경우 기존 index를 'index'라는 새로운 column 명으로 추가
- drop=True일 경우 기존 index를 'index'라는 새로운 column으로 추가하지 않음
inplace
- inplace=True: 원본 DataFrame은 유지하고 drop된 DataFrame을 새롭게 객체 변수로 받고 싶다면 inplace=False로 설정 (default 값이 False임)
- inplace=False: 원본 DataFrame에 drop된 결과를 적용할 경우에는 inplace=True를 적용

print("### before reset_index ###")
value_counts = titanic_df["Pclass"].value_counts()
print(value_counts)
print("value_counts 객체 변수 타입과 shape:", type(value_counts), value_counts.shape)

new_value_counts_01 = value_counts.reset_index(inplace=False)
print("### After reset_index ###")
print(new_value_counts_01)
print(
    "new_value_counts_01 객체 변수 타입과 shape:",
    type(new_value_counts_01),
    new_value_counts_01.shape,
)


new_value_counts_02 = value_counts.reset_index(drop=True, inplace=False)
print("### After reset_index ###")
print(new_value_counts_02)
print(
    "new_value_counts_02 객체 변수 타입과 shape:",
    type(new_value_counts_02),
    new_value_counts_02.shape,
)


# ### before reset_index ###
# 3    491
# 1    216
# 2    184
# Name: Pclass, dtype: int64
# value_counts 객체 변수 타입과 shape: <class 'pandas.core.series.Series'> (3,)
# ### After reset_index ###
#    index  Pclass
# 0      3     491
# 1      1     216
# 2      2     184
# new_value_counts_01 객체 변수 타입과 shape: <class 'pandas.core.frame.DataFrame'> (3, 2)
# ### After reset_index ###
# 0    491
# 1    216
# 2    184
# Name: Pclass, dtype: int64
# new_value_counts_02 객체 변수 타입과 shape: <class 'pandas.core.series.Series'> (3,)

rename(columns=dict)

DataFrame의 rename()은 인자로 index를 dictionary 형태로 받으면 '기존 index 명' : '신규 index 명' 형태로 변환
DataFrame의 rename()은 인자로 columns를 dictionary 형태로 받으면 '기존 column 명' : '신규 column 명' 형태로 변환

melb_data = melb_data.rename(columns={'Price':'PriceLevel'})

# DataFrame의 rename()은 인자로 columns를 dictionary 형태로 받으면 '기존 컬럼명' : '신규 컬럼명' 형태로 변환
new_value_counts_01 = titanic_df["Pclass"].value_counts().reset_index()
print(new_value_counts_01)
new_value_counts_01.rename(
    index={0: 1, 1: 2, 2: 3}, columns={"index": "Pclass", "Pclass": "Pclass_count"}
)

#    index  Pclass
# 0      3     491
# 1      1     216
# 2      2     184

# Pclass	Pclass_count
# 1	3	491
# 2	1	216
# 3	2	184

sort_values(by, ascending=True)

DataFrame의 sort_values() 메소드는 by인자로 정렬하고자 하는 column 값을 list로 입력 받아서 해당 column 값으로 DataFrame을 정렬
오름 차순이 기본 정렬이며 ascending=True로 설정됨. 내림차순 정렬 시 ascending=False로 설정
by = ["column_name1", "column_name2", ...] 으로 column이 여러개일 경우 반드시 list로 받는다
titanic_sorted = titanic_df.sort_values(by=['Name'], ascending=True)

# 이름으로 정렬
titanic_sorted = titanic_df.sort_values(by=["Name"], ascending=False)
titanic_sorted.head(3)

# Pclass와 Name으로 내림차순 정렬
titanic_sorted = titanic_df.sort_values(by=["Pclass", "Name"])
titanic_sorted.head(3)

3. Selection & Drop

DataFrame 인덱싱 및 필터링

[ ]
- column 기반 필터링 또는 Boolean Indexing 필터링 제공
- [ ] 에 단일 column명을 입력하면 column명에 해당하는 Series 객체를 반환
- [ ] 에 단일 column명을 list로 입력하면 column명에 해당하는 DataFrame 객체를 반환
- [ ] 에 여러 개의 column명들을 list로 입력하면 column명들에 해당하는 DataFrame 객체를 반환

왠만하면 단일 Column으로 Selection 하는 경우 [ ]를 2번 처리

DataFrame['column_name]: Series 반환

DataFrame[['column_name]]: DataFrame 반환

DataFrame[int]의 경우 일반적으로 사용하지 않지만 column명이 int(정수형)일 경우 사용가능

# DataFrame 객체에서 [] 연산자 내에 한 개의 column만 입력하면 Series 객체를 반환
series = titanic_df["Name"]
print(series.head(3))
print("## type:", type(series), "shape:", series.shape)

# DataFrame 객체에서 [] 연산자 내에 여러 개의 column을 list로 입력하면 그 column들로 구성된 DataFrame 반환
filtered_df = titanic_df[["Name", "Age"]]
display(filtered_df.head())
print("## type:", type(filtered_df), "shape:", filtered_df.shape)

# DataFrame 객체에서 [] 연산자 내에 한 개의 column을 list로 입력하면 한 개의 column으로 구성된 DataFrame 반환
one_col_df = titanic_df[["Name"]]
display(one_col_df.head(3))
print("## type:", type(one_col_df), "shape:", one_col_df.shape)

.loc[ ]
- 명칭 / 위치 기반 인덱싱을 제공
- .loc[ x(value, slicing, fancy_index), y(value, slicing, fancy_index) ]
- .loc[boolean_index, y(value, slicing, fancy_index)]
  - 명칭(Label) 기반 인덱싱은 column의 명칭을 기반으로 위치를 지정하는 방식
  - 'column 명' 같이 명칭으로 열 위치를 지정하는 방식 (행 위치는 Index 이용)
  - loc[ ]에 slicing기호(':')를 적용하면 종료값까지 포함

Boolean Indexing을 지원함 (x와 무관한 조건도 ok)

순서중요: Boolean Indexing -> y조건 순서로 이어져야함

Boolean Index를 []안에 ,로 사용하려면 앞에 .loc가 필요함

.iloc[ ]
- .iloc[ x(value, slicing, fancy_index), y(value, slicing, fancy_index) ]
  - 위치(Position)기반 인덱싱은 0을 출발점으로 하는 가로축, 세로축 좌표 기반의 행과 열의 위치를 기반으로 데이터를 지정
  - Boolean Indexing을 지원하지 않음
  - 따라서 행, 열 위치값으로 정수가 입력됨 (Index를 사용하지 않음)

DataFrame에 X_features or y_labels를 Slicing할 때:

DataFrame.iloc[:, :-1] or DataFrmae.iloc[:,-1] 반드시 .iloc[ ] 사용
~~DataFrame[:, :-1] or DataFrmae[:,-1]~~

data = {
    "Name": ["Chulmin", "Eunkyung", "Jinwoong", "Soobeom"],
    "Year": [2011, 2016, 2015, 2015],
    "Gender": ["Male", "Female", "Male", "Male"],
}

data_df = pd.DataFrame(data, index=["one", "two", "three", "four"])
data_df

iloc[ ]

data_df.iloc[0, 0]

# 'Chulmin'

# 아래 코드는 오류를 발생한다
data_df.iloc[0, "Name"]

# 아래 코드는 오류를 발생한다
data_df.iloc["one", 0]

print("\n iloc[1, 0] 두번째 행의 첫번째 열 값:", data_df.iloc[1, 0])
print("\n iloc[2, 1] 세번째 행의 두번째 열 값:", data_df.iloc[2, 1])

print("\n iloc[0:2, [0,1]] 첫번째에서 두번째 행의 첫번째, 두번째 열 값:\n", data_df.iloc[0:2, [0, 1]])
print("\n iloc[0:2, 0:3] 첫번째에서 두번째 행의 첫번째부터 세번째 열값:\n", data_df.iloc[0:2, 0:3])

print("\n 모든 데이터 [:] \n", data_df.iloc[:])
print("\n 모든 데이터 [:, :] \n", data_df.iloc[:, :])

print("\n 맨 마지막 칼럼 데이터 [:, -1] \n", data_df.iloc[:, -1])
print("\n 맨 마지막 칼럼을 제외한 모든 데이터 [:, :-1] \n", data_df.iloc[:, :-1])

# iloc[]는 불린 인덱싱을 지원하지 않아서 아래는 오류를 발생.
print("\n ix[data_df.Year >= 2014] \n", data_df.iloc[data_df.Year >= 2014])

loc[ ]

data_df.loc["one", "Name"]

# 'Chulmin'

# 다음 코드는 오류를 발생합니다. 
data_df.loc[0, 'Name']

# KeyError

# loc[ ]에서 슬라이싱을 하면 종료 인덱스까지 포함

print("위치기반 iloc slicing\n", data_df.iloc[0:1, 0], "\n")
print("명칭기반 loc slicing\n", data_df.loc["one":"two", "Name"])

# 위치기반 iloc slicing
#  one    Chulmin
# Name: Name, dtype: object

# 명칭기반 loc slicing
#  one     Chulmin
# two    Eunkyung
# Name: Name, dtype: object

print("인덱스 값 three인 행의 Name칼럼값:", data_df.loc["three", "Name"])
print(
    "\n인덱스 값 one 부터 two까지 행의 Name과 Year 칼럼값:\n",
    data_df.loc["one":"two", ["Name", "Year"]],
)
print(
    "\n인덱스 값 one 부터 three까지 행의 Name부터 Gender까지의 칼럼값:\n",
    data_df.loc["one":"three", "Name":"Gender"],
)
print("\n모든 데이터 값:\n", data_df.loc[:])
print("\n불린 인덱싱:\n", data_df.loc[data_df.Year >= 2014])

# 인덱스 값 three인 행의 Name칼럼값: Jinwoong

# 인덱스 값 one 부터 two까지 행의 Name과 Year 칼럼값:
#           Name  Year
# one   Chulmin  2011
# two  Eunkyung  2016

# 인덱스 값 one 부터 three까지 행의 Name부터 Gender까지의 칼럼값:
#             Name  Year  Gender
# one     Chulmin  2011    Male
# two    Eunkyung  2016  Female
# three  Jinwoong  2015    Male

# 모든 데이터 값:
#             Name  Year  Gender
# one     Chulmin  2011    Male
# two    Eunkyung  2016  Female
# three  Jinwoong  2015    Male
# four    Soobeom  2015    Male

# 불린 인덱싱:
#             Name  Year  Gender
# two    Eunkyung  2016  Female
# three  Jinwoong  2015    Male
# four    Soobeom  2015    Male

pd.set_option("display.max_colwidth", 200)
titanic_df = pd.read_csv("titanic_train.csv")
titanic_boolean = titanic_df[titanic_df["Age"] > 60]
print(type(titanic_boolean))
titanic_boolean

Boolean Indexing

조건식에 따른 필터링을 제공
위치기반, 명칭기반 인덱싱 모두 사용할 필요없이 조건식을 [ ] 안에 기입하여 간편하게 필터링을 수행
여러 조건식들 같이 사용가능
- 개별조건들은 반드시 ()로 감싸준다: df[ (condition_1) & (condition_2) & (condition3) ]
- &, | 사용 (and, or 조건)

1. Series[Boolean Indexing] = Series w/ 조건이 True에 해당하는 Index만을 지닌 Series가 반환

2. DF[Boolean Indexing] = DF w/ 조건이 True에 해당하는 Index만을 지닌 DF이 반환

3. DF.loc[Boolean Indexing, y(value, slicing, fancy_index)]: 특정 Column에 (,)로 Boolean Indexing하고 싶을 때

titanic_df[titanic_df["Pclass"] == 3].head(3)

pd.set_option("display.max_colwidth", 200)
titanic_df = pd.read_csv("titanic_train.csv")
titanic_boolean = titanic_df[titanic_df["Age"] > 60]
print(type(titanic_boolean))
titanic_boolean

titanic_df[titanic_df['Age']>60][['Name', 'Age']].head(3)

titanic_df.loc[titanic_df["Age"] > 60, ["Name", "Age"]].head(3)

# 개별조건들은 반드시 ()로 감싸준다!
titanic_df[
    (titanic_df["Age"] > 60)
    & (titanic_df["Pclass"] == 1)
    & (titanic_df["Sex"] == "female")
]

# 따로 분리해서 작성하는 것을 선호
cond1 = titanic_df["Age"] > 60
cond2 = titanic_df["Pclass"] == 1
cond3 = titanic_df["Sex"] == "female"
titanic_df[cond1 & cond2 & cond3]

# Boolean index를 사용해주세요

titanic.loc[(titanic['Sex'] == 'male'), 'Sex'] = 0
titanic.loc[(titanic['Sex'] == 'female'), 'Sex'] = 1

titanic['Sex'].value_counts()


# 0    577
# 1    314
# Name: Sex, dtype: int64

Selection

DataFrame -> DataFrame
- df[ ['column_name1', 'column_name2'] ][:2]
- df.loc[0, 'PassengerID']
- df.iloc[0, 1]

DataFrame -> Series
- df['column_name']
- df.column_name

DataFrame -> Index
- df.loc['index_name']: index location (index 이름) = '이름' 기준
- df.iloc['index_number(int)']: index position (index 숫자) = '순서' 기준
- ~~df[number(int)]: df.iloc[]과 동일한 기능:~~ 되도록 사용하지 말 것 (혼동이 많다)

print('[ ] 안에 숫자 index는 KeyError 오류 발생:\n', titanic_df[0])

# KeyError

titanic_df[0:2]

# 정상 실행

Series -> Series
- df.loc['index_name']: index location (index 이름) = '이름' 기준
- df.iloc['index_number']: index position (index 숫자) = '순서' 기준
- df[index]: value 가져옴

Basic loc, iloc selection
- Column & Index number: df[["name", "street"]][:2]
- Column number & Index number: df.iloc[:2, :2]
- Column name & Index name: df[[211829, 320563], ['name', 'street']]
- Boolean Index w/ condition

Add & Drop

Column 변경하기
- Modify
  - df.columns = [modify_list]로 재정의
- Add
  - df['new_column'] = value or list or condition 조건문 or ( df['기존_column']를 이용한 수식/조작 )
  - df['new_column'] = df['기존_column'].apply(function)
  - pd.DataFrame(dict, columns=[ ])으로 재정의
  - df.insert(loc, column, value, allow_duplicates=False): DataFrame의 특정 위치에 열을 삽입하는 method
    - loc : 삽입될 열의 위치 (0번부터 정수형으로 시작)
    - column : 삽입될 열의 이름
    - val : 삽입될 열의 값
    - allow_duplicates : {True or False} 기본값은 False로 True일경우 중복 열의 삽입을 허용합니다.
- Delete
  - df.drop('column_name', axis=1, inplace= )
  - del df['column_name']: 원본 데이터까지 변경

titanic_df["Age_0"] = 0
titanic_df.head(3)

titanic_df["Age_by_10"] = titanic_df["Age"] + 10
titanic_df["Family_No"] = titanic_df["SibSp"] + titanic_df["Parch"] + 1
titanic_df.head(3)

Index 변경하기
- Modify
  - df.index = [modify_list]로 재정의
- Delete
  - df.drop('index_name', axis=0, inplace= )
  - df.drop('index_name'): axis를 굳이 명시 안하면 index 삭제
  - df.drop([0,1,2,3]): 한 개 이상의 index_name으로 drop

drop()

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
axis: DataFrame의 row를 삭제하고 싶다면 axis=0, column을 삭제하고 싶다면 axis=1 으로 설정 (주로 사용)
inplace
- 원본 DataFrame은 유지하고 drop된 DataFrame을 새롭게 객체 변수로 받고 싶다면 inplace=False로 설정 (default 값이 False임)
  - titanic_df.drop('Age_0', axis=1, inplace=False)
  - titanic_df.drop([0,1,2], axis=0, inplace=False)
- 원본 DataFrame에 drop된 결과를 적용할 경우에는 inplace=True를 적용
  - titanic_df.drop('Age_0', a
  - xis=1, inplace=True)
  - titanic_df.drop([0,1,2], axis=0, inplace=True)
- 원본 DataFrame에서 drop된 DataFrame을 다시 원본 DataFrame 객체 변수로 할당하면(inplace=False로 해줘야함) 원본 DataFrame에서 drop된 결과를 적용할 경우와 동일 (단, 기존 원본 DataFrame 객체 변수는 메모리에서 추후 제거됨)
  - titanic_df = titanic_df.drop('Age_0', axis=1, inplace=False)
  - titanic_df = titanic_df.drop('Age_0', axis=1, inplace=True)는 반환값이 None

drop_duplicates()

drop_duplicates 메서드는 내용이 중복되는 행을 제거하는 메서드입니다.

df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

subset : 중복값을 검사할 열 입니다. 기본적으로 모든 열을 검사합니다.
keep : {first / last} 중복제거를할때 남길 행입니다. first면 첫값을 남기고 last면 마지막 값을 남깁니다.
inplace : 원본을 변경할지의 여부입니다.
ignore_index : 원래 index를 무시할지 여부입니다. True일 경우 0,1,2, ... , n으로 부여됩니다.

titanic_drop_df = titanic_df.drop("Age_0", axis=1)
titanic_drop_df.head(3)

titanic_df.head(3)

# 여러 개의 column들의 삭제는 drop의 인자로 삭제 column들을 list로 입력
# inplace=True일 경우 호출을 한 DataFrame에 drop 결과가 반영됨 & 이 때 반환값은 None

drop_result = titanic_df.drop(["Age_0", "Age_by_10", "Family_No"], axis=1, inplace=True)
print(" inplace=True로 drop 후 반환된 값:", drop_result)
titanic_df.head(3)

#  inplace=True로 drop 후 반환된 값: None

pd.set_option("display.width", 1000)
pd.set_option("display.max_colwidth", 15)
print("#### before axis 0 drop ####")
print(titanic_df.head(6))

titanic_df.drop([0, 1, 2], axis=0, inplace=True)
print("#### after axis 0 drop ####")
print(titanic_df.head(3))

titanic_df = titanic_df.drop("Fare", axis=1, inplace=False)
titanic_df

4. DataFrame Operations

pd.concat

pd.concat([df1, df2, df3, ...], axis=0, ignore_index=False, join='outer')

반드시 df1, df2, ...의 자리에는 DataFrame 형태만 넣어야 하며, 2개 이상의 DataFrame을 한 번에 넣을 수 있다
axis=0: (row-bind), axis=1: column-bind
index 초기화 X
outer join 방식 (겹치는 값 상관없이 전부 join)
- index 값이나 column값이 꼭 겹치지 않아도 모든 데이터를 붙여서 반환 (겹치지 않는 부분은 NaN표시)
- inner join시 index값이나 column값이 같은 것끼리만 join

합집합 형태로 데이터를 묶어야 할 때 사용하면 편리하다

특히 데이터 간 공통되는 값이 없어 그냥 row-bind or column-bind로 데이터를 연결하고자 하는 경우 Good

pd.merge

pd.merge(df1, df2, on='공통열', how='inner')

df1, df2의 column_name이 모두 동일한 경우

df1.merge(df2, left_on='df1의 공통 column_name', right_on='df2의 공통 column_name', how='inner')

df1, df2의 열이 의미하는 것은 같은데 이름이 다른 경우

how='inner'가 default지만, how에 left, right, inner, outer를 써줄 수 있음
- 각각 왼쪽 테이블 기준 joint, 오른쪽 테이블 기준 join, 교집합, 합집합
DataFrame이 2개인 경우에만 join 가능

merge는 '특정 공통열' 기준으로, 나머지 열까지 join하고 싶을 때 편리하다

DataFrame과 list, dictionary, ndarray 상호 변환

list -> DataFrame
- df_list = pd.DataFrame(list, columns=col_name1)
- DataFrame 생성 인자로 리스트 객체와 매핑되는 column명들을 입력
- 하나의 column값에 list가 DataFrame의 value로 들어감
ndarray -> DataFrame
- df_array2 = pd.DataFrame(array2, columns=col_name2)
- DataFrame 생성 인자로 ndarray와 매핑되는 column명들을 입력
- 하나의 column값에 ndarray가 DataFrame의 value로 들어감
columns=list 형태로 반드시 처리: column's'로 복수이기 때문에 반드시 list로 처리
Series에 column_name이 할당되는 순간(columns=column_name으로 처리) DataFrame이 된다

col_name1 = ['col1']
list1 = [1,2,3]
array1 = np.array(list1)

print('array1 shape', array1.shape)
df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1차원 리스트로 만든 DataFrame: \n', df_list1)
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1차원 ndarray로 만든 DataFrame:\n', df_array1)

# array1 shape (3,)
# 1차원 리스트로 만든 DataFrame: 
#     col1
# 0     1
# 1     2
# 2     3
# 1차원 ndarray로 만든 DataFrame:
#     col1
# 0     1
# 1     2
# 2     3

# 3개의 column명이 필요함
col_name2 = ["col1", "col2", "col3"]

# 2행 X 3열 형태의 리스트와 ndarray 생성한 뒤 이를 DataFrame으로 반환
list2 = [[1, 2, 3], [11, 12, 13]]
array2 = np.array(list2)
print("array2 shape", array2.shape)

df_list2 = pd.DataFrame(list2, columns=col_name2)
print("2차원 리스트로 만든 DataFrame: \n", df_list2)

df_array2 = pd.DataFrame(array2, columns=col_name2)
print("2차원 ndarray로 만든 DataFrame: \n", df_array2)


# array2 shape (2, 3)
# 2차원 리스트로 만든 DataFrame:
#     col1  col2  col3
# 0     1     2     3
# 1    11    12    13
# 2차원 ndarray로 만든 DataFrame:
#     col1  col2  col3
# 0     1     2     3
# 1    11    12    13

dictionary -> DataFrame
- dict = {'col1' : [1,11], 'col2' : [2,22], 'col3' : [3,33]}
- df_dict = pd.DataFrame(dict)
- dictionary의 key로 column명을 & 값(value)를 리스트 형식으로 입력

# key는 column명으로 매핑, value는 리스트 형(또는 ndarray)
dict1 = {"col1": [1, 11], "col2": [2, 22], "col3": [3, 33]}
df_dict = pd.DataFrame(dict1)
print("딕셔너리로 만든 DataFrame: \n", df_dict)

# 딕셔너리로 만든 DataFrame: 
#     col1  col2  col3
# 0     1     2     3
# 1    11    22    33

DataFrame -> ndarray
- DataFrame 객체의 values 속성을 이용하여 ndarray 변환

# DataFrame을 ndarray로 변환
array3 = df_dict.values
print("df_dict.values 타입:", type(array3), "df_dict.values shape:", array3.shape)
print(array3)

# df_dict.values 타입: <class 'numpy.ndarray'> df_dict.values shape: (2, 3)
# [[ 1  2  3]
#  [11 22 33]]

DataFrame -> list
- DataFrame 객체의 values 속성을 이용하여 ndarray로 먼저 변환 후 tolist()를 이용하여 list로 변환
DataFrame -> dictionary
- DataFrame 객체의 to_dict('list')를 이용하여 변환
- to_dict()안의 parameter에 'list'를 넣으면 dictionary의 value가 {}에서 [ ] 형태로 반환

# DataFrame을 리스트로 변환
list3 = df_dict.values.tolist()
print('df_dict.values.tolist() 타입:', type(list3))
print(list3)

# DataFrame을 딕셔너리로 변환
dict3 = df_dict.to_dict('list')
print('\n df_dict.to_dict() 타입:', type(dict3))
print(dict3)

# df_dict.values.tolist() 타입: <class 'list'>
# [[1, 2, 3], [11, 22, 33]]

#  df_dict.to_dict() 타입: <class 'dict'>
# {'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

Aggregation - DataFrame의 집합 연산 수행

sum(), max(), min(), count(), mean() 등은 DataFrame에서 column filtering 후 / Series에서 집합(Aggregation) 연산을 수행
DataFrame의 경우 DataFrame에서 바로 aggregation을 호출할 경우 모든 column에 해당 aggregation을 적용

np.aggregation(DataFrame) 으로 사용하지 말고, DataFrame.aggregation( )으로 사용해야 함

# DataFrame의 건수를 알고 싶다면 count() 보다는 shape
titanic_df.count()

# PassengerId    891
# Survived       891
# Pclass         891
# Name           891
# Sex            891
# Age            714
# SibSp          891
# Parch          891
# Ticket         891
# Fare           891
# Cabin          204
# Embarked       889
# dtype: int64

titanic_df.shape

# (891, 12)

titanic_df[['Age', 'Fare']].mean()

titanic_df[['Age', 'Fare']].sum()

titanic_df[['Age', 'Fare']].count()

Groupby

DataFrame은 Group by 연산을 위해 groupby() method를 제공
groupby() method는 by 인자로 group by 하려는 column명을 입력 받으면 DataFrameGroupBy 객체를 반환
이렇게 반환된 DataFrameGroupBy 객체에 column filtering 후 aggregation 함수를 수행
groupby(DataFrameGroupBy) -> column filtering(DataFrameGroupBy) -> aggregation 적용 (DataFrame)

groupby() 안에 여러개의 column들을 넣을 경우: 수형도로 가지치기 되어 나온다

groupby().reset_index(): 수형도 제거하여 동일한 값으로 DataFrame의 빈칸 채우기

titanic_groupby = titanic_df.groupby(by=["Pclass"])
print(type(titanic_groupby))
print(type(titanic_df))
print(type(titanic_groupby.head()))


# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
# <class 'pandas.core.frame.DataFrame'>
# <class 'pandas.core.frame.DataF  rame'>

# display 인자로 DataFrameGroupBy이 들어가면 object type 출력과 메모리 주소 반환

display(titanic_groupby)
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa0b1fd1430>

display(titanic_df)

titanic_groupby[["Age", "Fare"]]

# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc768477970>

titanic_groupby[["Age", "Fare"]].count()

동일 column에 서로 다른 aggregation을 적용하려면 서로 다른 aggregation method를 호출해야함
이 경우 aggregation method가 많아지면 코드 작성이 번거로워지므로 DataFrameGroupBy의 agg()를 활용

titanic_df.groupby(["Pclass"])["Age"].max(), titanic_df.groupby(["Pclass"])["Age"].min()

# 윗 코드보다 아래의 코드를 활용

titanic_df.groupby(["Pclass"])["Age"].agg([max, min])

서로 다른 column에 서로 다른 aggregation을 적용하려면 column flitering을 먼저 하지 말고 agg() 내에 column과 적용할 method를 dict 형태로 입력
aggregation method를 'string' 형태로 dictionary안에 입력 or np.aggregation_method으로 바로 입력해도 ok

agg_format = {"Age": "max", "SibSp": "sum", "Fare": "mean"}
titanic_df.groupby(["Pclass"]).agg(agg_format)

agg 내의 인자로 들어가는 Dict 객체에 동일한 Key값을 가지는 두 개의 Value가 있을 경우 마지막 Value로 업데이트 됨
동일한 Column에 서로 다른 aggregation을 가지면서 추가적인 column aggregation이 있을 경우 원하는 결과로 출력되지 않음

agg_format = {"Age": "max", "Age": "mean", "Fare": "mean"}
titanic_df.groupby(["Pclass"]).agg(agg_format)

Named Aggregation 적용
agg 내의 인자로 Dict를 넣지 않고, agg 내의 인자에 tuple ('column_filter', 'aggregation_method')를 ','로 구분하여 인자를 여러 개 입력
agg 내의 인자로 np.NamedAgg( column, aggfunc)을 ','로 구분하여 입력해도 됨

titanic_df.groupby(["Pclass"]).agg(
    age_max=("Age", "max"), age_mean=("Age", "mean"), fare_mean=("Fare", "mean")
)

titanic_df.groupby(["Pclass"]).agg(
    age_max=("Age", np.max), age_mean=("Age", np.mean), fare_mean=("Fare", np.mean)
)

titanic_df.groupby("Pclass").agg(
    age_max=pd.NamedAgg(column="Age", aggfunc="max"),
    age_mean=pd.NamedAgg(column="Age", aggfunc="mean"),
    fare_mean=pd.NamedAgg(column="Fare", aggfunc="mean"),
)

5. lambda, map, apply

lambda function

한 줄로 함수를 표현하는 익명 함수 기법
lambda argument: expression

def get_square(a):
    return a**2

print('3의 제곱은:',get_square(3))

lambda_square = lambda x : x ** 2
print('3의 제곱은:',lambda_square(3))

a=[1,2,3]
squares = map(lambda x : x**2, a)
list(squares)

map()

함수와 sequence형 data를 인자로 받아 각 element마다 입력받은 함수를 적용하여 list로 반환

apply(function)

Pandas는 apply 함수에 lamda 식을 결합해 DataFrame이나 Series의 record 별로 데이터를 가공하는 기능을 제공
Pandas의 경우 column에 일괄적으로 데이터 가공을 하는 것이 속도 면에서 더 빠르나 복잡한 데이터 가공이 필요할 경우 어쩔 수 없이 apply lambda를 이용

titanic_df["Child_Adult"] = titanic_df["Age"].apply(
    lambda x: "Child" if x <= 15 else "Adult"
)
titanic_df[["Child_Adult", "Age"]].head(10)

titanic_df["Age_cat"] = titanic_df["Age"].apply(
    lambda x: "Child" if x <= 15 else ("Adult" if x <= 60 else "Elderly")
)

titanic_df["Age_cat"].value_counts()

# Adult      786
# Child       83
# Elderly     22
# Name: Age_cat, dtype: int64

함수가 더 복잡해질 경우
- 아예 함수 function을 따로 정의한 후에 apply() 인자자체에 function을 삽입
- apply(lambda argument: expression) 중 expression 자리에 function을 삽입

# 나이에 따라 세분화된 분류를 수행하는 함수 생성.
def get_category(age):
    cat = ""
    if age <= 5:
        cat = "Baby"
    elif age <= 12:
        cat = "Child"
    elif age <= 18:
        cat = "Teenager"
    elif age <= 25:
        cat = "Student"
    elif age <= 35:
        cat = "Young Adult"
    elif age <= 60:
        cat = "Adult"
    else:
        cat = "Elderly"

    return cat


# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정.
# get_category(X)는 입력값으로 ‘Age’ 칼럼 값을 받아서 해당하는 cat 반환
titanic_df["Age_cat"] = titanic_df["Age  "].apply(lambda x: get_category(x))
titanic_df[["Age", "Age_cat"]].head()

# 	Age	Age_cat
# 0	22.0	Student
# 1	38.0	Adult
# 2	26.0	Young Adult
# 3	35.0	Young Adult
# 4	35.0	Young Adult

applymap(function)

Series 단위가 아닌 element 단위로 함수를 적용
Series 단위에 apply를 적용시킬 때와 같은 효과

6. Pandas Built-in Functions

nunique()

Column 내 몇 건의 고유값이 있는지 파악

titanic_df["Pclass"].value_counts()

# 3    491
# 1    216
# 2    184
# Name: Pclass, dtype: int64

print(titanic_df["Pclass"].nunique())
print(titanic_df["Survived"].nunique())
print(titanic_df["Name"].nunique())

# 3
# 2
# 891

replace()

원본 값을 특정 값으로 대체 : Dict를 이용해서 Key를 바꾸기 전, Value를 바꾼 후로 세팅
특정 Column에 적용
- Column.replace( 이전 value, column.aggregation_method)
DataFrame 전체에 적용: 각 column별로 replace 적용
- DataFrame[colulmn_filtering].replace(이전 value, DataFrame[colulmn_filtering].aggregation_method)
- DataFrame에서 바로 aggregation을 호출할 경우 모든 column에 해당 aggregation을 적용

# Sex의 male값을 Man

replace_test_df['Sex'].replace('male', 'Man')

replace_test_df["Sex"] = replace_test_df["Sex"].replace(
    {"male": "Man", "female": "Woman"}
)

replace_test_df['Cabin'] = replace_test_df['Cabin'].replace(np.nan, 'COO1')

replace_test_df["Cabin"].value_counts(dropna=False)

# COO1           687
# C23 C25 C27      4
# G6               4
# B96 B98          4
# C22 C26          3
#               ...
# E34              1
# C7               1
# C54              1
# E36              1
# C148             1
# Name: Cabin, Length: 148, dtype: int64

replace_test_df["Sex"].value_counts()

# Man      577
# Woman    314
# Name: Sex, dtype: int64

7. Missing Data(결손 데이터) 처리하기

isna()

DataFrame의 isna() method는 주어진 column 값들이 NaN인지 True/False 값을 반환 (NaN이면 True)

# 모든 DataFrame에 대해 T/F 여부 반환
titanic_df.isna().head(3)

NaN(Null) 값 건수 구하기: isna() 반환 결과에 sum()을 호출

# NaN 건수 구하기 :
titanic_df.isna().sum()

fillna()

Missing Data를 인자로 주어진 값으로 대체함

titanic_df["Cabin"] = titanic_df["Cabin"].fillna("0000")
titanic_df.head(3)

fillna() 안의 인자로 단순 'string'이 아니라 aggregation등 여러가지 가능

titanic_df["Age"] = titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")
titanic_df.isna().sum()

8. Pandas Summary

summary

2차원 데이터 핸들링을 위해서는 Pandas를 사용하자
Pandas는 매우 편리하고 다양한 Data 처리 API를 제공하지만 (join, 피벗/언피벗, SQL like API등) 이를 다 알기에는 많은 시간과 노력이 필요
지금까지 언급된 핵심 사항에 집중하고, 데이터 처리를 직접 수행해 보면서 문제에 부딛칠 때마다 Pandas의 다양한 API를 찾아서 해결해 가면 Pandas에 대한 실력을 더욱 향상시킬 수 있을 것

728x90

저작자표시 (새창열림)