데이터분석 종합반 5주차[스파르타 코딩]<고객 행동 예측, 고객 행동 예측>

Notice

Recent Posts

Recent Comments

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

Jaegool_'s log

데이터분석 종합반 5주차[스파르타 코딩]<고객 행동 예측, 고객 행동 예측> 본문

Development Log/Data Analytics

데이터분석 종합반 5주차[스파르타 코딩]<고객 행동 예측, 고객 행동 예측>

Jaegool 2022. 6. 28. 19:49

https://teamsparta.notion.site/5-e126d4fb86e34b47bc073f18ba79c293

[스파르타코딩클럽] 데이터분석 종합반 - 5주차

매 주차 강의자료 시작에 PDF파일을 올려두었어요!

teamsparta.notion.site

https://programmerpsy.tistory.com/17

[Pandas 기초]4.여러 DataFrame 연결하기(Join)

안녕하세요. PSYda 입니다. 이번 포스팅에서는 두 개의 DataFrame을 연결하는 Join 기법에 대해 알아보겠습니다. 소개할 내용은 아래와 같습니다. 컬럼 기준 Join Index 기준 Join 행기준 Join Inner, Left, Right

programmerpsy.tistory.com

데이터 합병

pd.merge(left = DF1 , right = DF2, how = "left", on = "이름")

조인 후에는 혹시 데이터에 어떤 문제가 생기지 않았는지 한 번 확인해주시는 게 좋습니다.

혹시 데이터의 개수가 바뀌지는 않았는지 확인해봅니다.

그 후로는 결측치를 확인해 봅니다.

<혼자 groupby 이용하면서 얻을 수 있는 가설들 생각하기>

customer_join.groupby("class_name")["customer_id"].count()
customer_join.groupby("campaign_name")["customer_id"].count()
customer_join.groupby("gender")["customer_id"].count()
customer_join.groupby("is_deleted")["customer_id"].count()

+ 행사에 대한 성별 간 선호도, 올 해 가입한 인원들의 특징 등.

이때 얻은 가설들은 집계를 통해서도 얻는 것도 중요하지만, 현장의 목소리. 즉, 이 데이터의 경우에는 회원들의 목소리를 직접 청취해보시는 게 좋습니다. 그렇게 되면 더 좋은 가설이나 인사이트들을 얻을 수 있는 경우가 많기 때문입니다.

# 실습하다가 코드가 정확히 기억이 나지 않았던 것.
# apply(pd.to_datetime): 날짜의 형태를 변환가능하게 해줌.

uselog['usedate'] = uselog['usedate'].apply(pd.to_datetime)

uselog['연월'] = uselog["usedate"].dt.strftime("%Y%m") # 날짜는 : '%d'
uselog

uselog_months = uselog.groupby(['연월', 'customer_id'], as_index=False).count()
# 여기에서 "as_index=Flase"의 역할은 index에 연월을 넣지 않고 기존의 0, 1, 2 ...의 형태를 유지시켜준다는 것

uselog_customer = uselog_months.groupby("customer_id")["count"].agg(["mean", "median","max", "min"])
uselog_customer = uselog_customer.reset_index(drop=False)
uselog_customer

# 위의 2번 째 코드("drop=False") 또한 사용하기 편리하도록 왼쪽에 인덱스를 부여함으로(없애지 않으므로) 
# "customer_id"를 열의 형태로 만들어준다.

customer_join = pd.merge(customer_join, uselog_customer, on="customer_id", how="left")
customer_join = pd.merge(customer_join, uselog_weekday[['customer_id', 'routine_flg']], on="customer_id", how='left')

<What I have to remember after doing homework(통신사 고객 데이터를 이용한 종합적 데이터 분석)>

Sparta_CodingClub_Telco_Customer.csv

0.93MB

Sparta_CodingClub_5강_통신사_고객_데이터를_이용한_종합적_데이터_분석.ipynb

0.34MB

Churn : 고객 이탈 여부, 종속 변수.
customerID : 고객의 고유한 ID.
gender : 고객 성별.
SeniorCitizen : 고객이 노약자인가 아닌가.
Partner : 고객에게 파트너가 있는지 여부(결혼 여부).
Dependents : 고객의 부양 가족 여부.
tenure : 고객이 회사에 머물렀던 개월 수.
PhoneService : 고객에게 전화 서비스가 있는지 여부.
MultipleLines : 고객이 여러 회선을 사용하는지 여부.
InternetService : 고객의 인터넷 서비스 제공업체.
OnlineSecurity : 고객의 온라인 보안 여부.
OnlineBackup : 고객이 온라인 백업을 했는지 여부.
DeviceProtection : 고객에게 기기 보호 기능이 있는지 여부.
TechSupport : 고객이 기술 지원을 받았는지 여부.
StreamingTV : 고객이 스트리밍TV을 가지고 있는지 여부.
StreamingMovies : 고객이 영화를 스트리밍하는지 여부.
Contract : 고객의 계약기간.
PaperlessBilling : 고객의 종이 없는 청구서 수신 여부(모바일 청구서).
PaymentMethod : 고객의 결제 수단.
MonthlyCharges : 매월 고객에게 청구되는 금액.
TotalCharges : 고객에게 청구된 총 금액.

1. Import some packages that I need.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Bring the data that I have to analyze.

df = pd.read_csv('Sparta_CodingClub_Telco_Customer.csv')

3. Watch and understand it with visualization.

ex)

- df.info()

- df['key'].nunique()

- df.describe() -> Change the string to int if it could be.

- df['column'] = pd.to_numeric(df['column'])

if they have error, because of ' ', I should use this.

- df['column'] = df['column'].replace({' ':0})

1) matplotlib

- 통계량 비교( ex. 지속 회원과 탈퇴 회원 )

df.groupby("종속변수")["id"].count() OR df.종속변수.value_counts()

-> an example of visualize with pie plot

df.Churn.value_counts().plot(kind='pie', y='Churn', figsize = (5, 5), autopct='%1.0f%%')

-> ex) histogram

 plt.hist(customer_end["tenure"])

2) Clustering

https://mkjjo.github.io/python/2019/01/10/scaler.html

[Python] 어떤 스케일러를 쓸 것인가?

* 본 포스트는 개인연구/학습 기록 용도로 작성되고 있습니다. By MK on January 10, 2019 데이터를 모델링하기 전에는 반드시 스케일링 과정을 거쳐야 한다. 스케일링을 통해 다차원의 값들을 비교 분

mkjjo.github.io

from sklearn.cluster import KMeans 
from sklearn.preprocessing import MinMaxScaler

3) Chart

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(ncols=2, nrows=2)
fig.set_size_inches(10, 10)

sns.countplot(x='gender', hue="Churn", data=df, ax=ax1).set_title('Demographic Variables: gender')
sns.countplot(x='SeniorCitizen', hue="Churn", data=df, ax=ax2).set_title('Demographic Variables: SeniorCitizen')
sns.countplot(x='Partner', hue="Churn", data=df, ax=ax3).set_title('Demographic Variables: Partner')
sns.countplot(x='Dependents', hue="Churn", data=df, ax=ax4).set_title('Demographic Variables: Dependents')

plt.tight_layout()

figure, (ax1, ax2) = plt.subplots(1, 2, figsize= (8, 6))
sns.boxplot(x = 'Churn',  y = 'TotalCharges', data = df, ax=ax1)
sns.boxplot(x = 'Churn',  y = 'MonthlyCharges', data = df, ax=ax2)

4. Machine Learning

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

X = final_df.drop(['Churn'], axis=1)
y = final_df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def get_metrics( model ):
    y_pred = model.predict(X_test)
    y_actual = y_test 
    print()
    print('-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*')
    print()
    print('테스트 데이터에 대한 정확도 :' , accuracy_score(y_actual, y_pred)*100 , '%' )
    print()
    f1score = f1_score(y_actual,y_pred)
    precision = precision_score(y_actual,y_pred)
    recall = recall_score(y_actual,y_pred)
    score_dict = { 'f1_score':[f1score], 'precision':[precision], 'recall':[recall]}
    score_frame = pd.DataFrame(score_dict)
    print(score_frame)

lr = LogisticRegression(C=10000, penalty='l2')
lr.fit(X_train, y_train) 
print('훈련 데이터에 대한 정확도 :',lr.score(X_train,y_train)*100,'%')
get_metrics(lr)

tree = DecisionTreeClassifier(max_depth=6, random_state=0)
tree.fit(X_train, y_train)
print('훈련 데이터에 대한 정확도 :',tree.score(X_train, y_train)*100,'%')
get_metrics(tree)

<Q. What is the meaning of C parameter in sklearn.linear_model.LogisticRegression? How does it affect the decision boundary? Do high values of C make the decision boundary non-linear? How does overfitting look like for logistic regression if we visualize the decision boundary?>

A high value of C tells the model to give high weight to the training data, and a lower weight to the complexity penalty. A low value tells the model to give more weight to this complexity penalty at the expense of fitting to the training data. Basically, a high C means "Trust this training data a lot", while a low value says "This data may not be fully representative of the real world data, so if it's telling you to make a parameter really large, don't listen to it".

출처: https://stackoverflow.com/questions/67513075/what-is-c-parameter-in-sklearn-logistic-regression

what is C parameter in sklearn Logistic Regression?

What is the meaning of C parameter in sklearn.linear_model.LogisticRegression? How does it affect the decision boundary? Do high values of C make the decision boundary non-linear? How does overfitt...

stackoverflow.com

'Development Log > Data Analytics' 카테고리의 다른 글

데이터분석 종합반 4주차[스파르타 코딩]<LinearRegression, 자전거 수요 예측 준비 단계> (0)	2022.06.25
데이터분석 종합반 3주차[스파르타 코딩]<Data Visualization> (0)	2022.06.18
데이터분석 종합반 2주차[스파르타 코딩]<데이터 시각화, 워드클라우드, 벡터화, 머신러닝> (0)	2022.06.12
Data analysis 1st week [SpartaCoding] <Kaggle, Colab, BeautifulSoup4> (0)	2022.06.01

'Development Log/Data Analytics' Related Articles

Jaegool_'s log

데이터분석 종합반 5주차[스파르타 코딩]<고객 행동 예측, 고객 행동 예측> 본문

데이터분석 종합반 5주차[스파르타 코딩]<고객 행동 예측, 고객 행동 예측>

'Development Log > Data Analytics' 카테고리의 다른 글

티스토리툴바