다이아몬드 가격 예측해보기

몇 주전에 과제를 했었다...

다이아몬드 EDA + 해보고 싶은거 정도 했었는데...

2023.03.05 - [멋쟁이사자처럼 AI스쿨] - 과제3 심화?

과제3 심화?

강사님이 seaborn내장 데이터셋인 diamonds로... 간단한 eda와 시각화 과제를 내주셨는데... 과제가 생각보다 일찍 끝나서... 간단히? 통계 분석을 해봤다! 우선 필요라이브러리 로드! import pandas as pd imp

helpming.tistory.com

여기서는 물론 해보고 싶은 것만 나와있다...

암튼 eda해봤으니까 가격예측?도 해보고 싶어서...

그냥 해봤다!

다이아몬드 가격을 예측해보자!

라이브러리 로드!

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

데이터셋 불러오고 쓱 함 보면...

diamonds = sns.load_dataset('diamonds')
diamonds

캐럿(무게), 컷(커팅등급), 색(색등급), 투명도, 깊이비율(2z/(x+y) 퍼센트),

테이블(다이아몬드 윗부분의 평평한 곳의 넓이와 가장 두꺼운 곳의 비율 퍼센트),

가격, x(길이), y(넓이), z(깊이) 컬럼이 있는 데이터다!

info를 보면...

diamonds.info()

가격만 정수, cut, color, clarity는 category, 나머지는 실수 형식이다!

우선 카테고리 형식의 기술통계를 보면...

diamonds.describe(include='category')

클래스가 5개, 7개, 8개이므로 원핫 인코딩을 사용하기로 생각을 해본다...

수치형 변수 히스토그램을 보면...

diamonds.hist(bins=100, figsize=(15, 10));

다른 값들은 적당하고..?

우리가 예측할 가격이 왼쪽으로 쏠려 있으니 레이블 스무딩을 고려해보자...

가격에 log1p를 취해준 price_log1p컬름을 새로 만들고...

diamonds['price_log1p'] = np.log1p(diamonds['price'])

히스토그램을 그려보면...

diamonds['price_log1p'].hist(bins=100);

정규분포에 그나마 가까워졌다!

그럼 이제 정답값과 피쳐를 나눠보자...

레이블 스무딩...

label_name = 'price_log1p'
label_name

피쳐 이름 만들기!

feature_names = diamonds.columns.tolist()
feature_names.remove(label_name)
feature_names.remove('price')
feature_names

예측에 사용될 컬럼들이다...

X_raw(인코딩 전 X)와 y 만들기!

X_raw = diamonds[feature_names]
y = diamonds[label_name]

X_raw.shape, y.shape

X_raw는 9개의 컬럼, y는 시리즈다!

그럼 train_test_split 갈기기!

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_raw, y, test_size=0.2, random_state=42)

회기모델? 이라서 stratify=y로 나눌수가 없다!

shape도 한번 찍어주고...

X_train.shape, X_test.shape, y_train.shape, y_test.shape

잘 나뉘었다!

이제 인코딩할 건데 앞에서 봤던 카테고리 3개만 꺼내오자!

col_ohe = X_train.select_dtypes(include='category').columns
col_ohe

앞에서 본 것 처럼 클래스가 각각 5개, 7개, 8개므로... 원 핫 인코딩 하자!

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore')

X_train_ohe = ohe.fit_transform(X_train[col_ohe])
X_test_ohe = ohe.transform(X_test[col_ohe])
X_train_ohe.shape, X_test_ohe.shape

5+7+8 = 20이니까 굿!

데이터프레임으로 만들어주기!

df_train_ohe = pd.DataFrame(X_train_ohe.toarray(), columns=ohe.get_feature_names_out())
df_train_ohe.index = X_train.index

df_test_ohe = pd.DataFrame(X_test_ohe.toarray(), columns=ohe.get_feature_names_out())
df_test_ohe.index= X_test.index

display(df_train_ohe)
display(df_test_ohe)

굿!

수치형 변수 6개 데이터프레임과 인코딩한 데이터프레임 합쳐주기!

X_train_num = X_train.select_dtypes(exclude='category')
X_train_enc = pd.concat([X_train_num, df_train_ohe], axis=1)

X_test_num = X_test.select_dtypes(exclude='category')
X_test_enc = pd.concat([X_test_num, df_test_ohe], axis=1)

print(X_train_enc.shape, X_test_enc.shape)

6+20=26이니까 굿!

이번엔 GradientBoostingRegressor를 써보자!

모델 만들기

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(random_state=42)
model

굿!

랜덤서치를 사용할 건데 그러니까 파라미터 많이 넣어줘보자!

parameters = {'n_estimators':(200, 300, 400, 500, 600),
              'learning_rate':(0.05, 0.15, 0.2, 0.25, 0.3),
              'subsample':(0.9, 0.95, 1.0)}
parameters

랜덤서치하기!

경우의 수가 5*5*3해서 75가진데...

그 중 n_iter=30으로 30개만 랜덤으로 서치하자!

이미 레이블 스무딩을 적용해줬으므로...

scoring='neg_root_mean_squared_error'로 해줬다!

from sklearn.model_selection import RandomizedSearchCV
reg = RandomizedSearchCV(model, parameters, n_jobs=-1, n_iter=30, cv=5, verbose=2, scoring='neg_root_mean_squared_error')
reg.fit(X_train_enc, y_train)

랜덤으로 고른 30개의 파라미터 조합을 cv=5니까 5개의 교차검증... 총 150번 fit한다!

겁나오래걸림...

일단 랜덤으로 고른 모델들을 보면...

pd.DataFrame(reg.cv_results_).sort_values('rank_test_score')

이렇게 30개인데...

그 중 베스트 모델은?

best_model = reg.best_estimator_
best_model

learning_rate=0.3, n_estimators=500, subsample=1(기본값이라서 생략된듯?)인 모델이다!

.predict를 사용해 베스트모델로 y_valid_predict를 만들고...

y_valid_predict = reg.predict(X_train_enc)

평가지표를 rmsle로 해서 보면...

rmsle = ((y_train - y_valid_predict) ** 2).mean() ** 0.5
rmsle

모델이 이 정도로 정확하다고 예측해본다...

그럼 .predict로 X_test_enc에 적용해서 답안지를 만들어주자!

y_predict = reg.predict(X_test_enc)

이렇게 y_predict값이 만들어졌고...

마지막으로 y_test와 비교해 rmsle를 보면...

result_rmsle = ((y_test- y_predict) ** 2).mean() ** 0.5
result_rmsle

실제 결과는 정확도가 예측보다는 낮은 것(값은 커짐)을 확인했다!

모델의 피쳐중요도를 확인해보면...

fi = pd.Series(best_model.feature_importances_)
fi.index = best_model.feature_names_in_
fi.sort_values().plot.barh(figsize=(15, 15));

이렇게 y가 압도적이고 그 다음이 캐럿인데...

diamonds.corr()

상관관계를 보면 price와 가장 상관계수가 높은 것은 캐럿이고...

레이블 스무딩을 고려해서... price_log1p와 가장 상관계수가 높은것은 X인데...

이런 피쳐 중요도가 나온 것이? 신기했다!!!

저작자표시

'재미로 하는 코딩' 카테고리의 다른 글

시각화 뽀개기12 (0)	2023.03.23
시각화 뽀개기11 (0)	2023.03.19
시각화 뽀개기10 (0)	2023.03.16
시각화 뽀개기9 (0)	2023.03.05
태블로를 사용한 스타벅스 매장정보 대시보드 (0)	2023.03.05