02. 데이터분석1 : 가상 온라인 쇼핑몰 데이터 활용

1. 가상 온라인 쇼핑몰 데이터 활용

파일다운로드 확인

판다스 import

import pandas as pd

파일 가져오기

retail = pd.read_csv('/content/drive/MyDrive/1. KDT/5. 데이터 분석/데이터/OnlineRetail.csv')

retail

정보보기

retail.info()

### 컬럼

* InvoiceNo : 주문번호

* StockCode : 상품코드

* Description : 상품설명

* Quantity : 주문수량

* InvoiceDate : 주문날짜

* UnitPrice : 상품가격

* CustomerID : 고객 아이디

* Country : 고객 거주지역(국가)

각 칼럼당 null()이 몇개 있는지 확인해보기

# 각 열에 포함된 null 값의 개수 확인

retail.isnull().sum()

비회원/ 탈퇴/휴면회원 제외

# 비회원/탈퇴/휴면회원 제거

retail = retail[pd.notnull(retail['CustomerID'])]

retail

비회원/ 탈퇴/휴면회원 개수 확인

len(retail)

retail.describe()

구입수량이 0 또는 0 이하인 데이터 확인

# 구입수량이 0 또는 0 이하인 데이터 확인

retail[retail['Quantity'] < 0]

구입 수량이 1이상인 데이터만 저장

# 구입 수량이 1이상인 데이터만 저장

retail = retail[retail['Quantity'] >= 1]

len (retail)

구입 가격이 0 또는 0 이하인 데이터를 확인

# 구입 가격이 0 또는 0 이하인 데이터를 확인

retail[retail['UnitPrice']<=0]

구입 가격이 0 초과인 데이터만 저장

# 구입 가격이 0 초과인 데이터만 저장

retail = retail[retail['UnitPrice']>0]

len(retail)

고객의 총 지출비용 파생변수 만들기

# 고객의 총 지출비용 파생변수 만들기

# 총 지출비용(CheckoutPrice) = 가격(UnitPrice) * 수량(Qunatity)

retail['CheckoutPrice'] = retail['UnitPrice'] * retail['Quantity']

retail.head()

날짜타입 바꿔주기

retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'])

retail.info()

전체매출 구하기

# 전체 매출

total_revenue = retail['CheckoutPrice'].sum()

total_revenue

각 나라별 구매 횟수

# 각 나라별 구매 횟수

retail['Country'].value_counts()

국가별 매출

# 국가별 매출

rev_by_countiries = retail.groupby('Country')['CheckoutPrice'].sum().sort_values()

rev_by_countiries

* 오름차순 정렬

국가별 매출에 따른 막대그래프

#'bar' 형식의 그래프 : 가로 20 / 세로 10

plot = rev_by_countiries.plot(kind='bar', figsize=(20, 10))

# x축 : 'country' (글자 크기 : 12)

plot.set_xlabel('Country', fontsize=12)

# y축 : 'Revenue' (글자 크기 : 12)

plot.set_ylabel('Revenue', fontsize=12)

# 제목 : 'Revenue By Country' (글자 크기 : 15)

plot.set_title('Revenue By Country', fontsize=15)

# x축 : 국가명으로 설정(45도 회전)

plot.set_xticklabels(labels=rev_by_countiries.index, rotation=45)

퍼센테이지로 표현

rev_by_countiries / total_revenue

확인

retail['InvoiceDate'].sort_values(ascending=False)

월별 매출 구하기

# 월별 매출 구하기

def extract_month(date): # 2011-12-09 12:50:00

month = str(date.month) # 12

if date.month < 10:

month = '0' + month # 예) 2월인 경우 02

return str(date.year) + month # 201112, 201101

rev_by_month = retail.set_index('InvoiceDate').groupby(extract_month)['CheckoutPrice'].sum()

rev_by_month

월별 매출 막대그래프로 그리기

# 막대그래프 함수화시키기

def plot_bar(df,xlabel,ylabel,title, titlesize=15, fontsize=12, rotation=45, figsize=(20,10)):

plot = df.plot(kind='bar', figsize=figsize)

plot.set_xlabel(xlabel, fontsize=fontsize)

plot.set_ylabel(ylabel, fontsize=fontsize)

plot.set_title(title, fontsize=titlesize)

plot.set_xticklabels(labels=df.index, rotation=rotation)

plot_bar(rev_by_month, 'Month','Revenue','Revenue By Month')

요일별 매출 구하기

# 요일별 매출 구하기

def extract_dow(date):

return date.dayofweek

# rev_by_dow = retail.set_index('InvoiceDate').groupby(extract_dow)['CheckoutPrice'].sum()

# rev_by_dow

rev_by_dow = retail.set_index('InvoiceDate').groupby(lambda date: date.dayofweek)['CheckoutPrice'].sum()

rev_by_dow

요일 배열을 정의하고 해당 요일로 변환하기

import numpy as np

DAY_OF_WEEK = np.array(['Mon','Tue','Web','Thur','Fri','Sat','Sun'])

rev_by_dow.index = DAY_OF_WEEK[rev_by_dow.index]

rev_by_dow.index

요일별 매출 막대그래프로 만들기

plot_bar(rev_by_dow, 'DOW','Revenue','Revenue BY DOW')

시간대별 매출 구하기

# 시간대별 매출 구하기

rev_by_hour = retail.set_index('InvoiceDate').groupby(lambda date: date.hour)['CheckoutPrice'].sum()

rev_by_hour

시간대별 매출 막대그래프로 만들기

plot_bar(rev_by_hour, 'Hour','Revenue','Revenue BY Hour')

데이터로부터 Insight

* 전체매출의 약 82% 가 UK에서 발생
* 매출은 꾸준히 성장하는 것으로 보임(11년 12월 데이터는 9일까지만 포함)
* 토요일은 영업을 하지 않음
* 새벽 6시 오픈, 오후 9시에 마감이 예상
* 일주일중 목요일까지는 성장셀르 보이고 이후 하락

문제

# 판매제품(StockCode) Top 10

# 단, 기준은 Quantity

'''

StockCode

23843 80995

23166 77916

84077 54415

22197 49183

85099B 46181

85123A 36782

84879 35362

21212 33693

23084 27202

22492 26076

'''

# 'StockCode'별로 'Quantity'의 합계를 계산한 후, 합계가 큰 순서대로 정렬하여 상위 10개를 선택합니다.

top_selling = retail.groupby('StockCode')['Quantity'].sum().sort_values(ascending=False)[:10]

# 상위 10개의 판매 항목을 출력합니다.

top_selling

문제

# 우수고객(CustomerID) Top 10

# 단, 기준은 CheckoutPrice

'''

CustomerID

14646.0 280206.02

18102.0 259657.30

17450.0 194550.79

16446.0 168472.50

14911.0 143825.06

12415.0 124914.53

14156.0 117379.63

17511.0 91062.38

16029.0 81024.84

12346.0 77183.60

'''

# 오류뜸

# vvip = retail.groupby('CustomerID')['CheckoutPrice'].sum().sort_values(ascending=False)[:10]

vvip = retail.groupby('CustomerID')['CheckoutPrice'].sum().sort_values(ascending=False).head()

vvip

'데이터분석 > 실습' 카테고리의 다른 글

06. 데이터분석5 : 떡볶이 프렌차이즈의 입점전략 (0)	2024.06.03
05. 데이터분석4 : 서울시 따릉이 API 데이터 활용 (0)	2024.06.03
04. 데이터분석3 : 전국 도시공원 데이터 (0)	2024.05.28
03. 데이터분석2 : 상권별 업종 밀집 통계 데이터 (0)	2024.05.28
01. Matplotlib (0)	2024.05.27

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

leesarr-study

02. 데이터분석1 : 가상 온라인 쇼핑몰 데이터 활용

1. 가상 온라인 쇼핑몰 데이터 활용

'데이터분석 > 실습' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

02. 데이터분석1 : 가상 온라인 쇼핑몰 데이터 활용

1. 가상 온라인 쇼핑몰 데이터 활용

'데이터분석 > 실습' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역