kaggle - House Prices - Advanced Regression Techniques(상위 6%)

728x90

kaggle에서 진행한 House Prices - Advanced Regression Techniques는 Bike Sharing Demand, titanic data와 같이 머신러닝을 입문하는 사람들이 가장 먼저 시작하는 kaggle의 대회 중 하나이다. 이번에는 House Prices - Advanced Regression Techniques를 통해서 Regression 연습하고자 한다.

문제에 대한 정보 수집
1. 문제 정의
2. 분석 대상에 대한 이해
House Prices - Advanced Regression Techniques을 이용한 EDA
1. 공통 코드
2. 분석
  1. House Prices - Advanced Regression Techniques에 대한 기본적인 정보(구조 파악)
  2. 시각화
  3. Data cleaning
  4. Feature Engineering
모델 학습
1. Lasso
2. 추가 data cleaning
3. CatBoost
결론
1. EDA 및 주요 인사이트
2. 모델 성능 비교
3. 로그 변환을 통한 성능 향상
4. 한계점

1. 문제에 대한 정보 수집

1) 문제 정의

House Prices - Advanced Regression Techniques 데이터세은 미국 아이오와주 에임스에 있는 주거용 주택 가격을 예측하는 문제이다. 대한민국에서 나만의 집을 생각하면 수도권 근처에서 교통이 편리한, 직장 근처, 주변 상권에 부족함이 없는, 신축 건물인 집을 선호한다. 과연 이런 대한민국의 관점이 미국에서도 많은 사람들이 같은 관점을 공유할까? House Prices - Advanced Regression Techniques 데이터셋은 미국 아이오와주 에임스에 있는 주거용 주택의 거의 모든 측면을 설명하는 79개의 설명 변수를 가지고 있다. 물론 주택 가격에 영향을 주는 요인은 매우 다양하기 때문에 더 복잡한 데이터를 가지고도 예측하기 어려운 부분이다. 하지만 kaggle에서 제공하는 House Prices - Advanced Regression Techniques 데이터셋은 R 또는 Python과 머신러닝 기초에 대한 경험이 있다면, 이 대회는 머신러닝 기초에 있어서 좋은 연습이될 것이다.

2) 분석 대상에 대한 이해

kaggle에서 제공하는 데이터셋은 train.csv, test.csv, data_description.txt, sample_submission.csv 총 4개로 구성되어 있다. train.csv를 통해 분석 및 모델 학습을 진행하고 test.csv를 이용해 예측을 하는 문제이다. 주택을 설명하는 컬럼은 총 79개의 컬럼으로 컬럼에 대한 정보는 data_description.txt를 통해 확인할 수 있다. 본 글에서는 79개 컬럼 전부를 설명하기엔 무리가 있기 때문에 필요하다고 생각하는 컬럼에 대해서만 설명을 하겠다.

그림1을 통해 확인할 수 있듯 1460개의 행과 81개의 열로 구성되어 있으며, Id와 target인 SalePrice를 제외하면 총 79개의 컬럼으로 구성되어 있다. 그림1을 통해서도 확인할 수 있듯 Alley column의 값에 NaN 값이 많이 있다. 즉, 해당 데이터에는 NaN 값을 어떻게 처리해야 할지 중요한 문제로 에상할 수 있다.

2. House Prices - Advanced Regression Techniques을 이용한 EDA

1) 공통 코드

House Prices - Advanced Regression Techniques의 평가 요소는 RMSLE이다. RMSLE는 Root Mean Squared Logarithmic Error로 RMSE에 로그를 적용한 것, 결정값이 클 수록 오류값도 커지기 때문에 일부 큰 오류값들로 인해 전체 오류값이 커지는 것을 막아준다. RMSE는 MSE 값은 로그를 적용한 것, 실제 오류 평균보다 더 커지는 특성이 있으므로 MSE에 루트를 씌운 것이다. MSE는 평균 제곱 오차로 실제 타깃값과 예측 타깃값 차의 제곱의 평균이다. 쉽게 말해 RMSLE는 예측값과 실제값의 로그 차이를 측정하는 지표로 예측값과 실제값이 큰 범위에 걸쳐 있거나, 상대적 오차에 더 관심이 있을 때 사용된다. 특히, 값의 크기가 매우 클 때 과도한 오류를 방지하고, 작은 값에 더 민감하게 반응하도록 하는 데 유리하다. 따라서 아래의 평가 지표를 통해 RMSLE뿐만 아니라 RMSE, MAE도 같이 확인하며 평가해보고자 한다.

from sklearn.metrics import make_scorer

def rmsle(y, pred):
    log_y = np.log1p(y)
    log_pred = np.log1p(pred)
    squared_error = (log_y - log_pred) ** 2
    rmsle = np.sqrt(np.mean(squared_error))
    return rmsle

def rmse(y, pred):
    return np.sqrt(mean_squared_error(y,pred))

def evaluate_regr(y, pred):
    rmsle_val = rmsle(y, pred)
    rmse_val = rmse(y, pred)
    mae_val = mean_absolute_error(y, pred)
    print('RMSLE: {0:.3f}, RMSE: {1:.3F}, MAE: {2:.3F}'.format(rmsle_val, rmse_val, mae_val))
    return rmsle(y, pred)

scores = make_scorer(evaluate_regr)

2) 분석

1. House Prices - Advanced Regression Techniques에 대한 기본적인 정보(구조 파악)

data.shape
# (1460, 81)

House Prices - Advanced Regression Techniques 데이터셋의 기본적인 구조 앞에서 확인했듯 1460 개의 행과 81개의 열로 구성되어 있다. 특히, 해당 데이터셋은 NaN 값이 많을 것으로 예상되기 때문에 확인을 해보고자 한다.

data.isnull().sum().sort_values(ascending=False)[:21]

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtFinType2      38
BsmtExposure      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
Electrical         1
Id                 0
Functional         0

매우 많은 컬럼에서 NaN 값을 가지고 있기 때문에 슬라이싱 연산을 통해 추출하면 총 19개의 column에서 NaN 값이 확인된다. 특히, 상위 7개의 컬럼인 PoolQC, MiscFeature, Alley, Fence, MasVnrType, FireplaceQu, LotFrontage는 NaN 값이 매우 많기 때문에 해당 컬럼을 drop을 하는 방법을 택할지, 다른 방법을 통해 대체할지 결정을 해야할 것 같다. 또한, NaN 값을 시각화를 하기 위해 missingno를 이용하면 그림2와 같다. 검은색이 값이 있는 부분이다. 흰색이 NaN 값이 있는 부분이다.

import missingno
missingno.matrix(data)

컬럼이 매우 많기 때문에 머신러닝 모델 성능 향상에 악영향을 주거나 아무런 영향을 주지 않는 필요없는 컬럼이 존재할 수 있다. 따라서 모든 값이 NaN값인 컬럼(앞에서 확인했듯 모든 값이 NaN 값인 컬럼은 존재하지 않는다.), 고유값이 1인 컬럼이 존재할 수 있기 때문에 확인후 존재하면 해당 column은 drop하려고 한다. 또한 추가적으로 target 값인 SalePrice를 제외한 모든 행에서 같은 값을 가지는 행의 유뮤를 확인하고 존재하면 해당 행을 drop하고자 한다.

all_nan_columns = data.columns[data.isna().all()].tolist()
print(f"모든 값이 NaN인 컬럼 개수: {len(all_nan_columns)}")
data.drop(columns=all_nan_columns, inplace=True, axis=1)
test.drop(columns=all_nan_columns, inplace=True, axis=1)
# 모든 값이 NaN인 컬럼 개수: 0

unique_one_columns = [col for col in data.columns if data[col].nunique() == 1]
print(f'고유값이 1인 컬럼 개수: {len(unique_one_columns)}')
data.drop(columns=unique_one_columns, inplace=True, axis=1)
test.drop(columns=unique_one_columns, inplace=True, axis=1)
# 고유값이 1인 컬럼 개수: 0

duplicate_all = data[data.drop(columns=['SalePrice']).duplicated(keep=False)]
duplicate_all
# 없음

다음으로 각 컬럼에 대해 히트맵을 통해 상관관계를 확인해보려고 한다. 그림3을 보면 다음과 같이 해석할 수 있다.

약한 상관관계: Id, MiscVal, MoSold 같은 변수들은 SalePrice와 거의 상관관계가 없는 것으로 나타나는데, 이는 이러한 변수들이 판매 가격을 예측하는 데 크게 기여하지 않는다는 것을 의미하기 때문에 모델링에서 중요하지 않을 수 있다.
다중공선성: GarageCars와 GarageArea, 1stFlrSF(1층 면적)와 TotalBsmtSF(총 지하 면적)처럼 서로 매우 높은 상관관계를 가지는 변수들이 있다. 이는 다중공선성 문제를 야기할 수 있으며, 모델을 만들 때 이러한 특징들 중 하나를 제거하여 중복을 줄이는 것이 좋을 수 있다.

Id, MiscVal, MoSold은 상관관계가 거의 0에 가깝기 때문에 삭제하려고 한다. GarageCars와 GarageArea, 1stFlrSF(1층 면적)와 TotalBsmtSF(총 지하 면적)에 대해서는 모델 학습을 통해 제거하는 것이 도움이 되는지, 영향이 없는지, 악영향을 주는지 확인이 필요하다. 물론 Id, MiscVal, MoSold 컬럼의 삭제 역시 확인이 필요하다.

2. 시각화

이제 숫자형 컬럼에 대해 시각화를 해보겠다. 먼저 아래의 코드를 통해 숫자형 컬럼은 전부 시각화를 해보겠다.

import seaborn as sns
import matplotlib.pyplot as plt

numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric = []

for i in data.columns:
    if data[i].dtype in numeric_dtypes:
        if i in ['TotalSF', 'Total_Bathrooms','Total_porch_sf','haspool','hasgarage','hasbsmt','hasfireplace']:
            pass
        else:
            numeric.append(i)     

fig, axs = plt.subplots(ncols=3, nrows=(len(numeric) // 3 + 1), figsize=(18, len(numeric) * 2))
plt.subplots_adjust(hspace=0.5, wspace=0.3)

sns.color_palette("husl", 8)
for i, feature in enumerate(numeric):
    row = i // 3
    col = i % 3
    
    sns.scatterplot(x=feature, y='SalePrice', data=data, ax=axs[row][col])
    
    axs[row][col].set_xlabel(feature, fontsize=12)
    axs[row][col].set_ylabel('SalePrice', fontsize=12)
    axs[row][col].tick_params(axis='x', labelsize=10)
    axs[row][col].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

전체 컬럼이 81개로 숫자형 컬럼 역시 매우 많기 때문에 그래프를 전체적으로 보여주기엔 가독성을 줄일 수 있어 일부만 보면 그림4, 그림5와 같다. 이상치를 제거하는 것이 무조건 모델에 좋은 영향을 주는 것은 아니지만 그림4와 그림5를 통해 이상치에 대한 처리 또한 필요하다고 본다.

3. Data cleaning

1. 필요 없는 컬럼 삭제

필요 없는 컬럼에 대한 삭제는 앞에서 House Prices - Advanced Regression Techniques 데이터셋에 대한 기본적인 정보를 파악할 때 확인해봤듯 모든 값이 NaN값인 컬럼, 고유값이 1인 컬럼은 존재하지 않는다. 또한, 약한 상관관계인 Id, MiscVal, MoSold 컬럼과 상관관계가 높은 컬럼인 GarageCars와 GarageArea, 1stFlrSF(1층 면적)와 TotalBsmtSF(총 지하 면적)에 대해서는 다음과 같다. Id, MiscVal, MoSold 컬럼은 삭제했을 때 더 좋은 점수를 주었으며 GarageCars와 GarageArea, 1stFlrSF(1층 면적)와 TotalBsmtSF(총 지하 면적) 중 GarageArea만 drop했을 때 가장 좋은 점수를 얻을 수 있었다.

2. NaN 값 처리

아래의 결과를 통해 확일 할 수 있듯 House Prices - Advanced Regression Techniques 데이터셋은 NaN값이 굉장히 많다. 따라서 NaN 값에 대한 처리가 필요하다. NaN 값이 매우 많은 컬럼의 경우 drop을 하는 것이 일반적이지만 drop하기 전에 어떤 데이터인지 확인이 필요하다.

train.isnull().sum().sort_values(ascending=False)[:21]

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtFinType2      38
BsmtExposure      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
Electrical         1
Id                 0
Functional         0

train 데이터에서의 NaN을 하나씩 확인해보겠다. NaN 값이 많은 6개를 먼저 처리를 하겠다. 컬럼에 대한 자의적인 해석이 아닌 kaggle에서 제공하는 data_description.txt를 통해 해석 및 처리하는 것이다.

PoolQC는 수영장의 품질을 나타낸다. 여기서 NaN 값은 수영장이 없다는 것을 나타낸다. 따라서 NaN 값은 'No'로 대체해서 수영장이 없다는 것으로 표시했다.
MiscFeature는 부수적인 주택의 특징을 나타낸다. 집의 특별한 부가적인 시설이나 특징을 표시한 것으로 NaN 값인 경우 부가적인 시설이나 특징이 없다는 것이다. 따라서 'No'로 대체해서 부가적인 시설이나 특징이 없다는 것으로 표시했다.
Alley는 주택이 뒷골목에 접하는지 여부를 나타낸다. 따라서 NaN 값은 뒷골목이 없다는 것을 나타내기 때문에 'No'로 대체해 뒷골목이 없다고 표시했다.
Fence는 울타리를 나타낸다. NaN 값은 울타리가 없다는 것을 나타내기 때문에 'No'로 대체해 울타리가 없다고 표시했다.
MasVnrType는 주택의 외부 벽에 사용된 벽돌 베니어(장식용 외벽)의 유형을 나타낸다. NaN 값은 주택 외부에 추가적인 벽돌 마감이 없는 경우를 나타내기 때문에 'No'로 대체해 없다고 표시했다.
FireplaceQu은 화재 대피 장소를 나타낸다. NaN 값은 화재 대피 장소가 없는 것을 나타내기 때문에 'No'를 통해 없다고 표시했다.

이후의 컬럼에 대해서는 코드와 함께 간단하게 설명하고자 한다. 먼저 차고에 대해서 NaN 값을 처리하겠다.

# GarageFinish = 차고 내부 마감 상태 -> nan = 차고 없음
train['GarageFinish']= train['GarageFinish'].fillna('No')
train['GarageFinish'].unique()

# GarageType = 차고 위치 -> nan = 차고 없음
train['GarageType']= train['GarageType'].fillna('No')
train['GarageType'].unique()

# GarageQual: 차고의 품질 -> nan = 차고 없음
train['GarageQual']= train['GarageQual'].fillna('No')
train['GarageQual'].unique()

# GarageCond: 차고의 상태 -> nan = 차고 없음
train['GarageCond']= train['GarageCond'].fillna('No')
train['GarageCond'].unique()

# GarageYrBlt: 차고가 건설된 연도 -> nan = 차고 없음
train['GarageYrBlt']= train['GarageYrBlt'].fillna('No')
train['GarageYrBlt'].unique()

# 위의 차고지 NaN 값 처리를 기준으로 다섯 개의 컬럼이 No이면 차고지가 없기 때문에
# 조건을 만족하는 행에 대해 GarageArea, GarageCars은 0으로 바꾼다.
condition = (
    (train['GarageFinish'] == 'No') &
    (train['GarageType'] == 'No') &
    (train['GarageQual'] == 'No') &
    (train['GarageCond'] == 'No') &
    (train['GarageYrBlt'] == 'No')
)

train.loc[condition, 'GarageArea'] = 0 # Size of garage in square feet
train.loc[condition, 'GarageCars'] = 0 # Size of garage in car capacity

차고에 대한 NaN 값을 처리하고 'No'인 경우 차고가 없다는 것을 의미하기 때문에 데이터를 확인해 보겠다. 차고와 관련된 컬럼의 NaN 값은 모두 81개로 아래의 코드를 통해 확인해 보면 전부 이상이 없다는 것을 확인할 수 있다.

garage_columns = ['GarageFinish', 'GarageType', 'GarageQual', 'GarageCond', 'GarageYrBlt']

# 하나라도 'No'인 경우
no_garage_condition = (train[garage_columns] == 'No').any(axis=1)
# 모든 컬럼이 'No'인 경우
all_no_condition = (train[garage_columns] == 'No').all(axis=1)
# 'No' 값이 하나라도 포함된 행 중에서 모든 컬럼이 'No'인지 확인한 결과
matching_rows = train[no_garage_condition & all_no_condition]
matching_rows[garage_columns] # 차고가 없을 때 다른 것도 차고가 없음


garage_columns = ['GarageCars', 'GarageArea']
one_zero_condition = (train[garage_columns] == 0).any(axis=1)
both_zero_condition = (train[garage_columns] == 0).all(axis=1)
matching_rows = train[one_zero_condition & both_zero_condition]

matching_rows[garage_columns] # 차고가 없을 때 나머지 차고 관련 컬럼도 0

다음으로 지하실의 NaN 값 처리를 하겠다. 지하실 역시 차고와 같이 코드와 함께 간단하게 설명하고자 한다.

# BsmtFinType1: 지하실 마감 영역 평가-> nan = 지하실 없음
train['BsmtFinType1']= train['BsmtFinType1'].fillna('No')
train['BsmtFinType1'].unique()

# BsmtFinType2: 지하실의 두 번째 마감 유형 평가 -> nan = 지하실 없음
train['BsmtFinType2']= train['BsmtFinType2'].fillna('No')
train['BsmtFinType2'].unique()

# BsmtExposure: 지하실의 노출 상태 -> nan = 지하실 없음
train['BsmtExposure']= train['BsmtExposure'].fillna('No')
train['BsmtExposure'].unique()

# BsmtCond: 지하실의 전반적인 상태 평가 -> nan = 지하실 없음
train['BsmtCond']= train['BsmtCond'].fillna('No')
train['BsmtCond'].unique()

# BsmtQual: 지하실의 높이를 평가 -> nan = 지하실 없음
train['BsmtQual']= train['BsmtQual'].fillna('No')
train['BsmtQual'].unique()

# 지하실 역시 위의 다섯 개의 컬럼이 'No'이면 지하실이 없다는 것이기 때문에
# 조건에 해당하는 컬럼에 대해서는 0으로 대체했다.
condition = (
    (train['BsmtFinType2'] == 'No') &
    (train['BsmtExposure'] == 'No') &
    (train['BsmtQual'] == 'No') &
    (train['BsmtCond'] == 'No') &
    (train['BsmtFinType1'] == 'No')
)

train.loc[condition, 'BsmtFullBath'] = 0
train.loc[condition, 'BsmtHalfBath'] = 0
train.loc[condition, 'BsmtFinSF1'] = 0
train.loc[condition, 'BsmtFinSF2'] = 0
train.loc[condition, 'BsmtUnfSF'] = 0
train.loc[condition, 'TotalBsmtSF'] = 0

특이한 점은 BsmtFinType2, BsmtExposure은 NaN 값이 있는 행이 총 38개로 다른 지하실관련 컬럼보다 하나 더 많다. 따라서 이 부분은 예측을 통해 진행하려고 한다. 아래 코드와 같이 작성했다. 앞에서 확인했듯 BsmtFinType1이 'No'이면 다른 지하실 관련 컬럼 역시 'No'이기 때문에 'BsmtFinType1', 'BsmtFinType2'를 묶어서 진행했다. 결과적으로 332행에서 BsmtFinType2, BsmtExposure의 값이 'No'로 나타났음을 확인했다.

# 예외
exception_Bs_columns = ['BsmtFinType1', 'BsmtFinType2']

exception_Bs_condition = (train[exception_Bs_columns] == 'No').any(axis=1)
print(len(train[exception_Bs_condition]))
train[exception_Bs_condition]

object, category type의 컬럼을 추가적으로 변형하기 보다는 모델에서 제공하는 기능을 사용하는 것이 성능에 더 좋을 수 있기 때문에 그리고 실제로 더 좋았기 때문에 CatBoost를 통해 BsmtFinType2, BsmtExposure 컬럼에 대한 예측을 진행했다. 또한, Electrical 컬럼 역시 NaN 값이 있기 때문에 같은 방법으로 예측을 진행했다.

def convert_to_string(dataframe, category_columns):
    for col in category_columns:
        dataframe[col] = dataframe[col].astype(str)
    return dataframe

pred = train.drop(['Electrical', 'LotFrontage', 'MasVnrArea', 'BsmtExposure'], axis=1)
test_pred = pred.iloc[[332]].copy()
test_pred.drop(['BsmtFinType2'], axis=1, inplace=True)

pred = pred.drop(index=332).copy()
pred_y = pred['BsmtFinType2']
pred_X = pred.drop(['BsmtFinType2'], axis=1)

category_df = pred_X.select_dtypes(include=['object', 'category']).columns
category_columns = category_df.tolist()

pred_X = convert_to_string(pred_X, category_columns)
test_pred[category_columns] = test_pred[category_columns].astype(str)

cat = CatBoostClassifier(cat_features=category_columns, verbose=False, random_state=156)
cat.fit(pred_X, pred_y)

predict_test = cat.predict(test_pred)
predict_test

pred = train.drop(['Electrical', 'LotFrontage', 'MasVnrArea'], axis=1)
test_pred = pred.iloc[[332]].copy()
test_pred.drop(['BsmtExposure'], axis=1, inplace=True)

pred = pred.drop(index=332).copy()
pred_y = pred['BsmtExposure']
pred_X = pred.drop(['BsmtExposure'], axis=1)

category_df = pred_X.select_dtypes(include=['object', 'category']).columns
category_columns = category_df.tolist()

pred_X = convert_to_string(pred_X, category_columns)
test_pred[category_columns] = test_pred[category_columns].astype(str)

cat = CatBoostClassifier(cat_features=category_columns, verbose=False, random_state=156)
cat.fit(pred_X, pred_y)

predict_test = cat.predict(test_pred)
predict_test

nan_rows = train[train['Electrical'].isna()]
pred = train.drop(['LotFrontage', 'MasVnrArea'], axis=1)

test_pred = pred.loc[nan_rows.index].copy()
test_pred.drop(['Electrical'], axis=1, inplace=True)
pred = pred.drop(index=nan_rows.index).copy()
pred_y = pred['Electrical']
pred_X = pred.drop(['Electrical'], axis=1)

category_df = pred_X.select_dtypes(include=['object', 'category']).columns
category_columns = category_df.tolist()

pred_X = convert_to_string(pred_X, category_columns)
test_pred[category_columns] = test_pred[category_columns].astype(str)

cat = CatBoostClassifier(cat_features=category_columns, verbose=False, random_state=156)
cat.fit(pred_X, pred_y)

predict_test = cat.predict(test_pred)
predict_test

또한, test 데이터셋에의 NaN 값이 등장하는 컬럼과 train 데이터셋에서의 NaN 값이 등장하는 컬럼이 다르기 때문에 추가적인 NaN 값 처리도 진행했다.

# Exterior1st, 2nd는 집의 외관을 덮고 있는 외장 재료를 나타내는 변수
train['Exterior1st']= train['Exterior1st'].fillna('No') # other
train['Exterior1st'].unique()

train['Exterior2nd']= train['Exterior2nd'].fillna('No') # other
train['Exterior2nd'].unique()

# KitchenQual은 주방의 품질을 평가하는 변수
train['KitchenQual']= train['KitchenQual'].fillna('TA') # 보통
train['KitchenQual'].unique()

# SaleType는 주택 매각 시 사용된 판매 방식 또는 거래 유형
train['SaleType']= train['SaleType'].fillna('WD') # 보통
train['SaleType'].unique()

# Utilities는 주택에서 사용 가능한 기반 시설(유틸리티)의 종류를 나타내는 변수
train['Utilities']= train['Utilities'].fillna(test['Utilities'].mode()[0]) # 빈도 수가 가장 많은 것
train['Utilities'].unique()

# Functional는 주택의 기능적 상태를 나타내며, 일반적인 사용 가능 상태부터 주요 결함이나 손상까지 다양한 상태를 평가하는 변수
train['Functional']= train['Functional'].fillna('Typ') # 보통
train['Functional'].unique()

# MSZoning은 주택이 속한 토지의 용도 구역을 나타내는 변수
train['MSZoning']= train['MSZoning'].fillna(test['MSZoning'].mode()[0]) # 빈도 수가 가장 많은 것
train['MSZoning'].unique()

train, test에서 공통으로 NaN 값인 MasVnrArea, LotFrontage은 각각 평균값으로 대체, 회귀를 통한 예측을 진행했다.

train["MasVnrArea"] = train["MasVnrArea"].fillna(train["MasVnrArea"].mean())
test["MasVnrArea"] = test["MasVnrArea"].fillna(test["MasVnrArea"].mean())

lot = train[train['LotFrontage'].isnull()]
lot.index

pred = train
test_pred = pred.iloc[lot.index].copy()
test_pred.drop(['LotFrontage'], axis=1, inplace=True)

pred = pred.drop(index=lot.index).copy()
pred_y = pred['LotFrontage']
pred_X = pred.drop(['LotFrontage'], axis=1)

category_df = pred_X.select_dtypes(include=['object', 'category']).columns
category_columns = category_df.tolist()

# 범주형 열을 문자열로 변환 (실수형 값 포함 가능성)
pred_X = convert_to_string(pred_X, category_columns)
test_pred[category_columns] = test_pred[category_columns].astype(str)


cat = CatBoostRegressor(cat_features=category_columns, verbose=False, random_state=156)
cat.fit(pred_X, pred_y)

predict_test = cat.predict(test_pred)
predict_test

train.loc[lot.index, 'LotFrontage'] = predict_test
train.loc[lot.index, 'LotFrontage']

추가적으로 MssubClass, OveralCond, YrSold, MoSold 컬럼의 경우 수치형 컬럼이 아닌 category type으로 보는 것이 맞다고 생각하기 때문에 str로 바꾸었다. 이유에 대해서는 주석을 통해 확인할 수 있다.

# MSSubClass=The building class
train['MSSubClass'] = train['MSSubClass'].apply(str)

# Changing OverallCond into a categorical variable
train['OverallCond'] = train['OverallCond'].astype(str)

# Year and month sold are transformed into categorical features.
train['YrSold'] = train['YrSold'].astype(str)
train['MoSold'] = train['MoSold'].astype(str)


test['MSSubClass'] = test['MSSubClass'].apply(str)
test['OverallCond'] = test['OverallCond'].astype(str)
test['YrSold'] = test['YrSold'].astype(str)
test['MoSold'] = test['MoSold'].astype(str)

train.drop(['Id', 'MiscVal', 'MoSold'], axis=1, inplace=True)
test.drop(['Id', 'MiscVal', 'MoSold'], axis=1, inplace=True)

결과적으로 그림6과 같이 1460개의 row와 77의 columns로 구성된 것을 확인할 수 있다.

3. 이상치 처리

이상치를 확인해 보려고 한다. 이상치는 데이터 분포에서 벗어난 비정상적인 값으로, 분석과 모델링에 부정적인 영향을 줄 수 있다. 이상치를 제거하는 방버으로는 IQR을 통해 제거하거나 IsolationForest, DBSCAN 등을 통해 제거할 수 있다. 필자는 IsolationForest를 통해 이상치를 제거하고자 한다.

train_1 = train.copy()

y = train_1['SalePrice']
train_1.drop(['SalePrice'], axis=1, inplace=True)

from sklearn.preprocessing import LabelEncoder

label_encoders = {}

for col in train_1.select_dtypes(include=['object']).columns:
    train_1[col] = train_1[col].astype(str)
    
    le = LabelEncoder()
    train_1[col] = le.fit_transform(train_1[col])
    label_encoders[col] = le
    
# 이상치 탐지    
from sklearn.ensemble import IsolationForest

clf = IsolationForest(
    n_estimators=50, 
    max_samples=50, 
    contamination=float(0.004), 
    max_features=1.0, 
    bootstrap=False, 
    n_jobs=-1, 
    verbose=0)

clf.fit(train_1)
pred = clf.predict(train_1)

train['label'] = pred

# 이상치 인덱스 추출 (-1은 이상치)
outlier_index = train[train['label'] == -1].index

# 이상치 인덱스를 원본 train 데이터에서 제거
train_cleaned = train.drop(outlier_index)
train_cleaned = train_cleaned.drop(columns=['label'])  # 'label' 열 제거

# 타겟 y에서도 이상치 인덱스 제거
y_cleaned = y.drop(outlier_index)

# 결과 확인
print(train_cleaned.shape)
print(y_cleaned.shape)

# 복구
train_cleaned['SalePrice'] = y_cleaned
train = train_cleaned.copy()
train

이상치를 제거하면 행의 수가 1460에서 1454로 줄어든 것을 확인할 수 있다.

4. Feature Engineering

머신러닝 모델은 복잡한 패턴을 인식하는 데 한계가 있을 수 있다. 따라서 이번에는 데이터셋에 대한 직관을 바탕으로 몇 가지 feature를 만들어서 모델의 성능을 더 좋게 만들어보려고 한다.

train['HasWoodDeck'] = (train['WoodDeckSF'] == 0) * 1
train['HasOpenPorch'] = (train['OpenPorchSF'] == 0) * 1
train['HasEnclosedPorch'] = (train['EnclosedPorch'] == 0) * 1
train['Has3SsnPorch'] = (train['3SsnPorch'] == 0) * 1
train['HasScreenPorch'] = (train['ScreenPorch'] == 0) * 1

train['YrBltAndRemod'] = train['YearBuilt'] + train['YearRemodAdd']

train['Total_Bathrooms'] = (train['FullBath'] + (0.5 * train['HalfBath']) + train['BsmtFullBath'] + (0.5 * train['BsmtHalfBath']))

new_feature = ['HasWoodDeck', 'HasOpenPorch', 'HasEnclosedPorch', \
'Has3SsnPorch', 'HasScreenPorch', 'Total_Bathrooms', 'YrBltAndRemod']

위의 코드와 같이 6개의 feature를 추가적으로 만들었다. 각각 설명은 다음과 같다.

train['HasWoodDeck'] = (train['WoodDeckSF'] == 0) * 1 -> 우드 데크가 없는 집은 HasWoodDeck 컬럼에 1, 있는 집은 0으로 표기
train['HasOpenPorch'] = (train['OpenPorchSF'] == 0) * 1 -> 오픈형 현관이 없는 집은 HasOpenPorch 컬럼에 1, 있는 집은 0으로 표기
train['HasEnclosedPorch'] = (train['EnclosedPorch'] == 0) * 1 -> 실내 현관이 없는 집은 HasEnclosedPorch 컬럼에 1, 있는 집은 0으로 표기
train['Has3SsnPorch'] = (train['3SsnPorch'] == 0) * 1 ->3계절 현관이 없는 집은 Has3SsnPorch 컬럼에 1, 있는 집은 0으로 표기
train['HasScreenPorch'] = (train['ScreenPorch'] == 0) * 1 ->스크린 현관이 없는 집은 HasScreenPorch 컬럼에 1, 있는 집은 0으로 표기

3. 모델 학습

머신 러닝 모델 학습에 사용한 모델은 총 6가지 이다. 사용한 목록은 아래와 같으며, 이 중 가장 성능이 좋았었던 Lasso 모델과 다음으로 성능이 좋았던 CatBoost 모델만 확인해 보겠다.

RandomForest
Ridge
Lasso
XGBoost
LightGBM
CatBoost

1.) 공통 코드

numeric_df = train.select_dtypes(include=['int', 'float']).columns.drop('SalePrice')
numeric_df = numeric_df.drop(new_feature, errors='ignore')
numeric_data = train[numeric_df]

log_scaled_numeric_data = np.log1p(numeric_data)
log_scaled_numeric_df = pd.DataFrame(log_scaled_numeric_data, columns=numeric_df, index=train.index)

train[numeric_df] = log_scaled_numeric_df


test_numeric_df = test.select_dtypes(include=['int', 'float']).columns
test_numeric_df = test_numeric_df.drop(new_feature, errors='ignore')
test_numeric_data = test[test_numeric_df]

test_log_scaled_numeric_data = np.log1p(test_numeric_data)
test_log_scaled_numeric_df = pd.DataFrame(test_log_scaled_numeric_data, columns=test_numeric_df, index=test.index)

test[numeric_df] = test_log_scaled_numeric_df

from sklearn.preprocessing import OneHotEncoder

categorical_columns = train.select_dtypes(include=['object']).columns

train[categorical_columns] = train[categorical_columns].astype(str)
test[categorical_columns] = test[categorical_columns].astype(str)

encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

train_encoded = encoder.fit_transform(train[categorical_columns])
test_encoded = encoder.transform(test[categorical_columns])

train_encoded_df = pd.DataFrame(train_encoded, columns=encoder.get_feature_names_out(categorical_columns), index=train.index)
test_encoded_df = pd.DataFrame(test_encoded, columns=encoder.get_feature_names_out(categorical_columns), index=test.index)

train_final = pd.concat([train.drop(columns=categorical_columns), train_encoded_df], axis=1)
test_final = pd.concat([test.drop(columns=categorical_columns), test_encoded_df], axis=1)

모델 학습에 앞서 공통적으로 로그 변환과 원핫인코딩을 적용했다. target 값인 SalePrice를 제외한 컬럼 중 feature engineering으로 만든 새로운 feature를 제외한 모든 숫자형 컬럼에 대해서 로그변환을 적용했다. target 값인 SalePrice에도 로그 변환을 적용했다. 이후 object type의 컬럼에 대해서는 원핫인코딩을 적용했다.

SalePrice의 경우 로그 변환 전과 후의 차이는 그림7과 같다. 위의 그래프가 원래의 SalePrice에 대한 분포와 Q-Q 플롯이고, 아래의 그래프가 로그 변환된 SalePrice에 대한 분포와 Q-Q 플롯이다. 로그 변환 후에는 데이터의 분포가 좀 더 정규 분포에 가깝게 보이며, Q-Q 플롯에서도 데이터가 이론적 직선에 더 잘 맞춰지는 것을 확인할 수 있다. 이는 로그 변환이 데이터의 왜도를 줄여주고, 모델의 성능을 높이는 데 도움이 될 수 있다는 것을 의미한다.

2) 모델 학습

1. Lasso

Lasso의 경우 아래의 컬럼을 category type으로 변환하지 않을 때 좋은 점수를 보여준다.

#MSSubClass=The building class
train['MSSubClass'] = train['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable
train['OverallCond'] = train['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.
train['YrSold'] = train['YrSold'].astype(str)
train['MoSold'] = train['MoSold'].astype(str)


test['MSSubClass'] = test['MSSubClass'].apply(str)
test['OverallCond'] = test['OverallCond'].astype(str)
test['YrSold'] = test['YrSold'].astype(str)
test['MoSold'] = test['MoSold'].astype(str)

train.drop(['Id', 'MiscVal', 'MoSold'], axis=1, inplace=True)
test.drop(['Id', 'MiscVal', 'MoSold'], axis=1, inplace=True)

데이터셋의 크기가 작기 때문에 train, test를 나누지 않고 GridSearchCV를 통해 데이터를 여러 번 학습하고 평가하므로, 과적합을 방지하고, 일반화 성능을 극대화하고자 했다. 결과는 나쁘지 않았지만 kaggle에 제출했을 때 좋은 점수를 얻지 못 했다.

from sklearn.model_selection import GridSearchCV

def print_best_params(model, params):
    grid_model = GridSearchCV(model, param_grid=params, scoring='neg_mean_squared_error', cv=5)
    grid_model.fit(X_features, y_log)
    rmse = np.sqrt(-1* grid_model.best_score_)
    print('{0} 5 CV 시 최적 평균 RMSE 값: {1}, 최적 alpha:{2}'.format(model.__class__.__name__, np.round(rmse, 4), grid_model.best_params_))
    return grid_model.best_estimator_
 
lasso_params = { 'alpha':[0.001, 0.005, 0.008, 0.05, 0.03, 0.1, 0.5, 1,5, 10] } # L1: 가중치의 절대값 합에 페널티를 부과
best_lasso = print_best_params(lasso, lasso_params)

Lasso 5 CV 시 최적 평균 RMSE 값: 0.1252, 최적 alpha:{'alpha': 0.001}

3) 추가 data cleaning

kaggle에 제출했을 때의 점수가 높지 않아서 이상치에 대해 다시 확인해보니 삭제하면 좋을 것 같다는 컬럼을 발견했다.

그림8을 보면 왼 쪽은 이상치를 제거하기 전의 그래프이고 오른 쪽은 이상치를 제거한 후의 그래프이다. GrLivArea는 거실 면적을 말하는데 일반적으로 집이 클 수록 주택의 가격이 높다. 하지만 오른 쪽 그래프를 보듯 Isolationforest로 이상치가 완전히 제거된 것이 아니기 때문에 추가적으로 이상치를 제거하고 다시 모델을 학습시켰다. 결과는 아래와 같이 모델의 성능이 더욱 좋아진 것을 확인할 수 있다. 상위 11%에 해당하는 점수이다.

Lasso 5 CV 시 최적 평균 RMSE 값: 0.1182, 최적 alpha:{'alpha': 0.001}

4) CatBoost

아래의 코드를 category str로 만든 후 모델을 학습했을 때 점수가 가장 잘 나왔으며, CatBoost를 사용할 수 있어서 하이퍼파라미터 튜닝에 있어 부담감이 줄어들 수 있었다.

#MSSubClass=The building class
train['MSSubClass'] = train['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable
train['OverallCond'] = train['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.
train['YrSold'] = train['YrSold'].astype(str)
train['MoSold'] = train['MoSold'].astype(str)


test['MSSubClass'] = test['MSSubClass'].apply(str)
test['OverallCond'] = test['OverallCond'].astype(str)
test['YrSold'] = test['YrSold'].astype(str)
test['MoSold'] = test['MoSold'].astype(str)

train.drop(['Id', 'MiscVal', 'MoSold'], axis=1, inplace=True)
test.drop(['Id', 'MiscVal', 'MoSold'], axis=1, inplace=True)

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from catboost import CatBoostRegressor

k_fold = KFold(n_splits=5, shuffle=True, random_state=0)

category_df = train.select_dtypes(include=['object', 'category']).columns
category_columns = category_df.tolist()

cat = CatBoostRegressor(n_estimators=1000, random_state=0, verbose=False, cat_features=category_columns)%time

score = cross_val_score(cat, X_features_cat, y_log_cat, cv=k_fold, scoring=scores)
score = score.mean()

# 0에 근접할수록 좋은 데이터
print("Score= {0:.5f}".format(score))

위의 코드와 같이 k-fold를 이용해 적은 데이터셋을 최대한 이용하려고 했다. 최종적으로 아래와 같은 점수가 나왔으며, kaggle에 제출했을 때 275등으로 상위 6%에 해당하는 점수를 받을 수 있었다.

RMSLE: 0.011, RMSE: 0.140, MAE: 0.086
RMSLE: 0.009, RMSE: 0.110, MAE: 0.077
RMSLE: 0.011, RMSE: 0.134, MAE: 0.086
RMSLE: 0.009, RMSE: 0.114, MAE: 0.079
RMSLE: 0.009, RMSE: 0.115, MAE: 0.081
Score= 0.00952

4. 결론

1) EDA 및 주요 인사이트

1. NaN 값 처리의 중요성

이번 데이터셋에서 가장 큰 문제는 여러 열에서 NaN 값이 많이 존재한다는 점이었다. 예를 들어, PoolQC, MiscFeature, Alley, Fence 등은 NaN 값이 대부분을 차지했다. 이에 따라 적절한 NaN 값 처리 전략을 선택하는 것이 모델 성능에 큰 영향을 미쳤다고 보고 있다.
PoolQC, MiscFeature, Alley 등의 열은 대부분의 데이터가 NaN이었기 때문에 data_description.txt을 참고해 No로 처리하여 의미를 부여했다. 반면, 수치형 NaN 값(LotFrontage 등)은 평균값이나 회귀 모델을 통해 예측하여 채웠다.

2. 상관관계 분석

히트맵을 통해 상관관계를 분석한 결과, 몇몇 변수가 SalePrice와 높은 상관관계를 보였다.
OverallQual, GrLivArea, TotalBsmtSF, GarageCars 등은 SalePrice와 높은 상관관계를 보였으며, 이는 모델 성능에 중요한 영향을 주는 변수로 확인되었다. 반면 Id, MiscVal, MoSold와 같은 변수는 상관관계가 거의 없었으며, 모델 학습 시 제외하는 것이 바람직했다.
특히, 상관관계가 약하다고 또는 강하다고 삭제하는 것이 무조건 머신러닝 모델 성능 향상에 도움을 주지 않는 다는 것을 다시 한 번 배울 수 있었다.

3. 이상치 처리

데이터에서 이상치를 확인하고 처리하는 것은 모델 성능을 향상시키는 데 중요한 역할을 했다. 특히, GrLivArea와 같은 변수에서 이상치를 확인하고 제거함으로써 모델의 성능이 크게 향상되었다. Isolation Forest를 통해 이상치를 탐지하고, 이를 제거하여 일반화 성능을 높였다.

4. Feature Engineering

머신러닝 모델이 복잡한 패턴을 인식하는 데 한계가 있기 때문에, Feature Engineering을 통해 모델 성능을 더욱 향상시킬 수 있다.
예를 들어, HasWoodDeck, HasOpenPorch, Total_Bathrooms와 같은 변수를 추가하여 주택의 실제 기능을 더 정확히 반영했다. 또한, YrBltAndRemod와 같이 주택의 건설 연도와 리모델링 연도를 결합한 변수는 모델의 성능에 긍정적인 영향을 미쳤다.

2) 모델 성능 비교

모델 성능 비교를 위해 여러 머신러닝 알고리즘을 사용해보았으며, 그 중 Lasso와 CatBoost가 가장 좋은 성능을 보였다.

1. Lasso

Lasso (L1 Regularization)는 변수 선택과 규제를 동시에 수행하기 때문에 불필요한 변수들을 자동으로 제거해주며, 모델의 복잡도를 줄여준다.
GridSearchCV를 사용해 최적의 하이퍼파라미터를 찾았으며, alpha 값을 0.001로 설정했을 때 최적의 성능을 보였다.
최종적으로 RMSLE: 0.1182로 성능이 측정되었으며, 이를 통해 Lasso가 상위 11%에 해당하는 성능을 기록했다.

2. CatBoost

CatBoost는 범주형 변수를 자동으로 처리해주며, 데이터 전처리 과정을 단순화할 수 있는 장점이 있다.
범주형 변수는 문자열로 변환하여 CatBoost의 cat_features 옵션을 활용해 학습시켰으며, k-fold 교차 검증을 통해 성능을 평가했다.
최종적으로 RMSLE: 0.00952로 CatBoost 모델이 가장 높은 성능을 기록했으며, 이를 통해 상위 6%에 해당하는 성능을 달성했다.

모델 성능 요약:

Lasso: RMSLE 0.1182 (상위 11%)
CatBoost: RMSLE 0.00952 (상위 6%)

3) 로그 변환을 통한 성능 향상

SalePrice는 매우 큰 값의 범위와 비대칭적인 분포를 가지기 때문에, 로그 변환을 통해 값의 스케일을 조정하고, 모델이 작은 값에 더 민감하게 반응하도록 만들었다.

1. 로그 변환의 효과

SalePrice 뿐만 아니라 수치형 변수(GrLivArea, TotalBsmtSF 등)에도 로그 변환을 적용했다. 이를 통해 값의 분포를 정규분포에 가깝게 만들었으며, 모델이 극단적인 값에 과도하게 반응하지 않도록 조정했다.

4)한계점

데이터 양의 부족
- 데이터셋의 크기가 상대적으로 작아, 복잡한 모델을 적용할 때 과적합의 위험이 있었다. 이를 방지하기 위해 교차 검증을 사용했지만, 더 많은 데이터를 사용하여 학습을 했다면 성능을 더 높일 수 있었을 것같다.
이상치 처리의 어려움
- 이상치를 완벽하게 처리하지 못한 부분이 있었다. 일부 이상치는 모델 성능을 저해할 수 있지만, 그들을 완전히 제거하거나 수정하는 것이 항상 최선의 선택은 아니라는 것을 배웠다.
NaN 값 처리
- 일부 NaN 값은 수동으로 처리했지만, 더 정교한 방법으로 NaN 값을 예측하거나 대체하는 것이 가능했을 수도 있을 것으로 생각한다.

House Prices - Advanced Regression Techniques 대회는 머신러닝의 다양한 기법을 연습하고, 실제 문제에 적용하는 데 좋은 기회였다. EDA와 데이터 전처리, 이상치 처리, Feature Engineering, 모델 학습 및 튜닝까지 전체 과정을 경험하면서 주택 가격 예측 문제의 복잡성을 실감할 수 있었다.

728x90

'kaggle' 카테고리의 다른 글

kaggle Bike Sharing Demand(1) - 자전거 수요 예측(상위 약 5%) (9)	2024.09.17
kaggle Bank Churn Dataset (4)	2024.09.12
Santander Customer Satisfaction EDA (2) (1)	2024.09.05
Santander Customer Satisfaction EDA (1) (1)	2024.09.01
titanic 데이터 - EDA(2) (4)	2024.08.30

짱태's 데이터

kaggle - House Prices - Advanced Regression Techniques(상위 6%)

1. 문제에 대한 정보 수집

2. House Prices - Advanced Regression Techniques을 이용한 EDA

3. 모델 학습

4. 결론

'kaggle' 카테고리의 다른 글

티스토리툴바

kaggle - House Prices - Advanced Regression Techniques(상위 6%)

1. 문제에 대한 정보 수집

2. House Prices - Advanced Regression Techniques을 이용한 EDA

3. 모델 학습

4. 결론

'kaggle' 카테고리의 다른 글

관련글

티스토리툴바