Pandas 를 활용해보자

category 데이터 분석 2022. 3. 2. 15:30
728x90
반응형

Pandas 기초 사용법

1. 생성, 쓰기, 읽기

import pandas as pd

데이터 생성

pd.DataFrame({'컬럼1': [33, 40], '컬럼2': [42, 21]})
  컬럼1 컬럼2
0 33 42
1 40 21
pd.DataFrame({'철수': ['남자', 183], '영희': ['여자', 162]})
  철수 영희
0 남자 여자
1 183 162
pd.DataFrame({'철수': ['남자', 183], '영희': ['여자', 162]}, 
                 index=['성별', '키'])
  철수 영희
성별 남자 여자
183 162

시리즈

pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

pd.Series([30, 35, 40], index=['2015년 세일', '2016년 세일', '2017년 세일'], name='상품')

2015년 세일    30
2016년 세일    35
2017년 세일    40
Name: 상품, dtype: int64

파일 데이터 읽기

movies = pd.read_csv("movies.csv")
movies.head()
  Unnamed: 0 movieId title genres userId rating timestamp
0 0 1 Toy Story (1995) Adventure 2 3.5 1141415820
1 1 1 Toy Story (1995) Adventure 3 4.0 1439472215
2 2 1 Toy Story (1995) Adventure 4 3.0 1573944252
3 3 1 Toy Story (1995) Adventure 5 4.0 858625949
4 4 1 Toy Story (1995) Adventure 8 4.0 890492517
movies.shape

(2124594, 7)

movies = pd.read_csv("movies.csv", index_col=0)
movies.head()

 

  movieId title genres userId rating timestamp
0 1 Toy Story (1995) Adventure 2 3.5 1141415820
1 1 Toy Story (1995) Adventure 3 4.0 1439472215
2 1 Toy Story (1995) Adventure 4 3.0 1573944252
3 1 Toy Story (1995) Adventure 5 4.0 858625949
4 1 Toy Story (1995) Adventure 8 4.0 890492517
movies.shape

(2124594, 6)

 

2. 인덱스 선택 및 할당

movies.title

0          Toy Story (1995)
1          Toy Story (1995)
2          Toy Story (1995)
3          Toy Story (1995)
4          Toy Story (1995)
                 ...       
2124589    City Hall (1996)
2124590    City Hall (1996)
2124591    City Hall (1996)
2124592    City Hall (1996)
2124593    City Hall (1996)
Name: title, Length: 2124594, dtype: object

movies['title']

0          Toy Story (1995)
1          Toy Story (1995)
2          Toy Story (1995)
3          Toy Story (1995)
4          Toy Story (1995)
                 ...       
2124589    City Hall (1996)
2124590    City Hall (1996)
2124591    City Hall (1996)
2124592    City Hall (1996)
2124593    City Hall (1996)
Name: title, Length: 2124594, dtype: object

movies['title'][1]

'Toy Story (1995)'

인덱스 행 번호로 선택 (iloc)

movies.iloc[0]

movieId                     1
title        Toy Story (1995)
genres              Adventure
userId                      2
rating                    3.5
timestamp          1141415820
Name: 0, dtype: object

# 첫번째 인자 : 행
# 두번째 인자 : 열
movies.iloc[:,0]

0            1
1            1
2            1
3            1
4            1
          ... 
2124589    100
2124590    100
2124591    100
2124592    100
2124593    100
Name: movieId, Length: 2124594, dtype: int64

movies.iloc[:3,0]

0    1
1    1
2    1
Name: movieId, dtype: int64

movies.iloc[1:3,0]

1    1
2    1
Name: movieId, dtype: int64

movies.iloc[[0, 1, 2], 0]

0    1
1    1
2    1
Name: movieId, dtype: int64
movies.iloc[-5:]

인덱스 정보를 활용하여 선택 (loc)

movies.loc[1, 'title']

'Toy Story (1995)'

movies.loc[:, ['title', 'genres']]
  title genres
0 Toy Story (1995) Adventure
1 Toy Story (1995) Adventure
2 Toy Story (1995) Adventure
3 Toy Story (1995) Adventure
4 Toy Story (1995) Adventure
... ... ...
2124589 City Hall (1996) Thriller
2124590 City Hall (1996) Thriller
2124591 City Hall (1996) Thriller
2124592 City Hall (1996) Thriller
2124593 City Hall (1996) Thriller

인덱스 조작

movies.set_index("title")
  movieId genres userId rating timestamp
title          
Toy Story (1995) 1 Adventure 2 3.5 1141415820
Toy Story (1995) 1 Adventure 3 4.0 1439472215
Toy Story (1995) 1 Adventure 4 3.0 1573944252
Toy Story (1995) 1 Adventure 5 4.0 858625949
Toy Story (1995) 1 Adventure 8 4.0 890492517
... ... ... ... ... ...
City Hall (1996) 100 Thriller 162445 3.0 939556195
City Hall (1996) 100 Thriller 162454 3.0 838259221
City Hall (1996) 100 Thriller 162479 3.0 850136396
City Hall (1996) 100 Thriller 162504 3.0 848591738
City Hall (1996) 100 Thriller 162507 3.0 866722978

인덱스 조건 검색

movies.genres == 'Drama'

0          False
1          False
2          False
3          False
4          False
           ...  
2124589    False
2124590    False
2124591    False
2124592    False
2124593    False
Name: genres, Length: 2124594, dtype: bool

movies.loc[movies.genres == 'Drama']
  movieId title genres userId rating timestamp
385360 4 Waiting to Exhale (1995) Drama 141 3.0 838711786
385361 4 Waiting to Exhale (1995) Drama 175 3.0 992403830
385362 4 Waiting to Exhale (1995) Drama 230 3.0 862580281
385363 4 Waiting to Exhale (1995) Drama 236 4.0 848680533
385364 4 Waiting to Exhale (1995) Drama 484 4.0 857579144
... ... ... ... ... ... ...
2120819 100 City Hall (1996) Drama 162445 3.0 939556195
2120820 100 City Hall (1996) Drama 162454 3.0 838259221
2120821 100 City Hall (1996) Drama 162479 3.0 850136396
2120822 100 City Hall (1996) Drama 162504 3.0 848591738
2120823 100 City Hall (1996) Drama 162507 3.0 866722978
movies.loc[(movies.genres == 'Drama') & (movies.rating > 3)]
  movieId title genres userId rating timestamp
385363 4 Waiting to Exhale (1995) Drama 236 4.0 848680533
385364 4 Waiting to Exhale (1995) Drama 484 4.0 857579144
385365 4 Waiting to Exhale (1995) Drama 528 4.0 844766853
385380 4 Waiting to Exhale (1995) Drama 1906 4.0 836349383
385382 4 Waiting to Exhale (1995) Drama 1979 4.0 840312031
... ... ... ... ... ... ...
2120801 100 City Hall (1996) Drama 161576 4.0 866474998
2120803 100 City Hall (1996) Drama 161631 4.0 864258224
2120807 100 City Hall (1996) Drama 161910 4.0 963683808
2120811 100 City Hall (1996) Drama 162068 4.0 876851667
2120818 100 City Hall (1996) Drama 162377 4.0 855400509
movies.loc[(movies.genres == 'Drama') | (movies.rating > 3)]
  movieId title genres userId rating timestamp
0 1 Toy Story (1995) Adventure 2 3.5 1141415820
1 1 Toy Story (1995) Adventure 3 4.0 1439472215
3 1 Toy Story (1995) Adventure 5 4.0 858625949
4 1 Toy Story (1995) Adventure 8 4.0 890492517
5 1 Toy Story (1995) Adventure 10 3.5 1227571347
... ... ... ... ... ... ...
2124571 100 City Hall (1996) Thriller 161576 4.0 866474998
2124573 100 City Hall (1996) Thriller 161631 4.0 864258224
2124577 100 City Hall (1996) Thriller 161910 4.0 963683808
2124581 100 City Hall (1996) Thriller 162068 4.0 876851667
2124588 100 City Hall (1996) Thriller 162377 4.0 855400509
movies.loc[movies.genres.isin(['Drama', 'Comedy'])]
  movieId title genres userId rating timestamp
171927 1 Toy Story (1995) Comedy 2 3.5 1141415820
171928 1 Toy Story (1995) Comedy 3 4.0 1439472215
171929 1 Toy Story (1995) Comedy 4 3.0 1573944252
171930 1 Toy Story (1995) Comedy 5 4.0 858625949
171931 1 Toy Story (1995) Comedy 8 4.0 890492517
... ... ... ... ... ... ...
2120819 100 City Hall (1996) Drama 162445 3.0 939556195
2120820 100 City Hall (1996) Drama 162454 3.0 838259221
2120821 100 City Hall (1996) Drama 162479 3.0 850136396
2120822 100 City Hall (1996) Drama 162504 3.0 848591738
2120823 100 City Hall (1996) Drama 162507 3.0 866722978
movies.loc[movies.title.notnull()]
  movieId title genres userId rating timestamp
0 1 Toy Story (1995) Adventure 2 3.5 1141415820
1 1 Toy Story (1995) Adventure 3 4.0 1439472215
2 1 Toy Story (1995) Adventure 4 3.0 1573944252
3 1 Toy Story (1995) Adventure 5 4.0 858625949
4 1 Toy Story (1995) Adventure 8 4.0 890492517
... ... ... ... ... ... ...
2124589 100 City Hall (1996) Thriller 162445 3.0 939556195
2124590 100 City Hall (1996) Thriller 162454 3.0 838259221
2124591 100 City Hall (1996) Thriller 162479 3.0 850136396
2124592 100 City Hall (1996) Thriller 162504 3.0 848591738
2124593 100 City Hall (1996) Thriller 162507 3.0 866722978

데이터 할당

movies['country'] = 'korea'
movies['country']

0          korea
1          korea
2          korea
3          korea
4          korea
           ...  
2124589    korea
2124590    korea
2124591    korea
2124592    korea
2124593    korea
Name: country, Length: 2124594, dtype: object

movies['index_backwards'] = range(len(movies), 0, -1)
movies['index_backwards']

0          2124594
1          2124593
2          2124592
3          2124591
4          2124590
            ...   
2124589          5
2124590          4
2124591          3
2124592          2
2124593          1
Name: index_backwards, Length: 2124594, dtype: int64

3. 요약 함수 및 맵

요약 함수

movies.title.describe()

count              2124594
unique                  99
top       Toy Story (1995)
freq                286545
Name: title, dtype: object

movies.genres.describe()

count      2124594
unique          17
top       Thriller
freq        311176
Name: genres, dtype: object

movies.rating.mean()

3.6029992083193307

movies.genres.unique()

array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'IMAX', 'Documentary', 'War', 'Musical'],
      dtype=object)

movies.genres.value_counts()

Thriller       311176
Drama          298704
Comedy         268791
Mystery        170739
Romance        170185
Crime          169761
Adventure      165345
Children       140118
Action         129869
Fantasy        106496
Animation       72485
Sci-Fi          68259
Horror          31146
Musical         13461
War              6965
Documentary       954
IMAX              140
Name: genres, dtype: int64

4. 그룹핑 및 정렬

그룹핑

movies.groupby('geners').rating.count()

genres
Action         129869
Adventure      165345
Animation       72485
Children       140118
Comedy         268791
Crime          169761
Documentary       954
Drama          298704
Fantasy        106496
Horror          31146
IMAX              140
Musical         13461
Mystery        170739
Romance        170185
Sci-Fi          68259
Thriller       311176
War              6965
Name: rating, dtype: int64

movies.groupby('geners').rating.min()

genres
Action         0.5
Adventure      0.5
Animation      0.5
Children       0.5
Comedy         0.5
Crime          0.5
Documentary    0.5
Drama          0.5
Fantasy        0.5
Horror         0.5
IMAX           0.5
Musical        0.5
Mystery        0.5
Romance        0.5
Sci-Fi         0.5
Thriller       0.5
War            0.5
Name: rating, dtype: float64

movies.groupby('geners').apply(lambda df: df.title.iloc[0])

genres
Action                                Heat (1995)
Adventure                        Toy Story (1995)
Animation                        Toy Story (1995)
Children                         Toy Story (1995)
Comedy                           Toy Story (1995)
Crime                                 Heat (1995)
Documentary         Across the Sea of Time (1995)
Drama                    Waiting to Exhale (1995)
Fantasy                          Toy Story (1995)
Horror         Dracula: Dead and Loving It (1995)
IMAX                      Wings of Courage (1995)
Musical                         Pocahontas (1995)
Mystery                            Copycat (1995)
Romance                   Grumpier Old Men (1995)
Sci-Fi                              Powder (1995)
Thriller                              Heat (1995)
War                            Richard III (1995)
dtype: object

멀티 인덱스

title_genres = movies.groupby(['title', 'genres']).userId.agg([len])
title_genres
    len
title genres  
Ace Ventura: When Nature Calls (1995) Comedy 21552
Across the Sea of Time (1995) Documentary 75
IMAX 75
American President, The (1995) Comedy 17042
Drama 17042
... ... ...
White Squall (1996) Adventure 3921
Drama 3921
Wings of Courage (1995) Adventure 65
IMAX 65
Romance 65
mi = title_genres.index
type(mi)

pandas.core.indexes.multi.MultiIndex

title_genres.reset_index()
  title genres len
0 Ace Ventura: When Nature Calls (1995) Comedy 21552
1 Across the Sea of Time (1995) Documentary 75
2 Across the Sea of Time (1995) IMAX 75
3 American President, The (1995) Comedy 17042
4 American President, The (1995) Drama 17042
... ... ... ...
221 White Squall (1996) Adventure 3921
222 White Squall (1996) Drama 3921
223 Wings of Courage (1995) Adventure 65
224 Wings of Courage (1995) IMAX 65
225 Wings of Courage (1995) Romance 65

정렬

title_genres = title_genres.reset_index()
title_genres.sort_values(by='len')
  title genres len
94 Guardian Angel (1994) Drama 28
95 Guardian Angel (1994) Thriller 28
93 Guardian Angel (1994) Action 28
225 Wings of Courage (1995) Romance 65
224 Wings of Courage (1995) IMAX 65
... ... ... ...
200 Toy Story (1995) Fantasy 57309
199 Toy Story (1995) Comedy 57309
198 Toy Story (1995) Children 57309
196 Toy Story (1995) Adventure 57309
197 Toy Story (1995) Animation 57309
title_genres.sort_values(by='len', ascending=False)
  title genres len
199 Toy Story (1995) Comedy 57309
200 Toy Story (1995) Fantasy 57309
198 Toy Story (1995) Children 57309
197 Toy Story (1995) Animation 57309
196 Toy Story (1995) Adventure 57309
... ... ... ...
119 Kids of the Round Table (1995) Adventure 65
225 Wings of Courage (1995) Romance 65
94 Guardian Angel (1994) Drama 28
93 Guardian Angel (1994) Action 28
95 Guardian Angel (1994) Thriller 28
title_genres.sort_index()
  title genres len
0 Ace Ventura: When Nature Calls (1995) Comedy 21552
1 Across the Sea of Time (1995) Documentary 75
2 Across the Sea of Time (1995) IMAX 75
3 American President, The (1995) Comedy 17042
4 American President, The (1995) Drama 17042
... ... ... ...
221 White Squall (1996) Adventure 3921
222 White Squall (1996) Drama 3921
223 Wings of Courage (1995) Adventure 65
224 Wings of Courage (1995) IMAX 65
225 Wings of Courage (1995) Romance 65
title_genres.sort_values(by=['genres', 'len'])
  title genres len
93 Guardian Angel (1994) Action 28
184 Shopping (1994) Action 83
52 Crossing Guard, The (1995) Action 1129
74 Fair Game (1995) Action 1202
127 Lawnmower Man 2: Beyond Cyberspace (1996) Action 2215
... ... ... ...
203 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) Thriller 47054
181 Seven (a.k.a. Se7en) (1995) Thriller 50596
209 Usual Suspects, The (1995) Thriller 55366
139 Misérables, Les (1995) War 2699
172 Richard III (1995) War 4266

5. 데이터 타입과 값

movies.rating.dtype

dtype('float64')

movies.dtypes

movieId              int64
title               object
genres              object
userId               int64
rating             float64
timestamp            int64
country             object
index_backwards      int64
dtype: object

movies.index_backwards.astype('float64')

0          2124594.0
1          2124593.0
2          2124592.0
3          2124591.0
4          2124590.0
             ...    
2124589          5.0
2124590          4.0
2124591          3.0
2124592          2.0
2124593          1.0
Name: index_backwards, Length: 2124594, dtype: float64

movies.index_backwards.dtype

dtype('int64')

movies[pd.isnull(movies.title)]
movies.title.fillna("Unknown")

0          Toy Story (1995)
1          Toy Story (1995)
2          Toy Story (1995)
3          Toy Story (1995)
4          Toy Story (1995)
                 ...       
2124589    City Hall (1996)
2124590    City Hall (1996)
2124591    City Hall (1996)
2124592    City Hall (1996)
2124593    City Hall (1996)
Name: title, Length: 2124594, dtype: object

movies.title.replace("Toy", "TEST")

6. 이름 변경 및 병합

이름 변경

movies.rename(columns={'rating': 'score'})
  movieId title genres userId score timestamp country index_backwards
0 1 Toy Story (1995) Adventure 2 3.5 1141415820 korea 2124594
1 1 Toy Story (1995) Adventure 3 4.0 1439472215 korea 2124593
2 1 Toy Story (1995) Adventure 4 3.0 1573944252 korea 2124592
3 1 Toy Story (1995) Adventure 5 4.0 858625949 korea 2124591
4 1 Toy Story (1995) Adventure 8 4.0 890492517 korea 2124590
... ... ... ... ... ... ... ... ...
2124589 100 City Hall (1996) Thriller 162445 3.0 939556195 korea 5
2124590 100 City Hall (1996) Thriller 162454 3.0 838259221 korea 4
2124591 100 City Hall (1996) Thriller 162479 3.0 850136396 korea 3
2124592 100 City Hall (1996) Thriller 162504 3.0 848591738 korea 2
2124593 100 City Hall (1996) Thriller 162507 3.0 866722978 korea 1
movies.rename(index={0: 'firstEntry', 1: 'secondEntry'})
  movieId title genres userId rating timestamp country index_backwards
firstEntry 1 Toy Story (1995) Adventure 2 3.5 1141415820 korea 2124594
secondEntry 1 Toy Story (1995) Adventure 3 4.0 1439472215 korea 2124593
2 1 Toy Story (1995) Adventure 4 3.0 1573944252 korea 2124592
3 1 Toy Story (1995) Adventure 5 4.0 858625949 korea 2124591
4 1 Toy Story (1995) Adventure 8 4.0 890492517 korea 2124590
... ... ... ... ... ... ... ... ...
2124589 100 City Hall (1996) Thriller 162445 3.0 939556195 korea 5
2124590 100 City Hall (1996) Thriller 162454 3.0 838259221 korea 4
2124591 100 City Hall (1996) Thriller 162479 3.0 850136396 korea 3
2124592 100 City Hall (1996) Thriller 162504 3.0 848591738 korea 2
2124593 100 City Hall (1996) Thriller 162507 3.0 866722978 korea 1
movies.rename_axis("country", axis='rows').rename_axis("fields", axis='columns')
fields movieId title genres userId rating timestamp country index_backwards
country                
0 1 Toy Story (1995) Adventure 2 3.5 1141415820 korea 2124594
1 1 Toy Story (1995) Adventure 3 4.0 1439472215 korea 2124593
2 1 Toy Story (1995) Adventure 4 3.0 1573944252 korea 2124592
3 1 Toy Story (1995) Adventure 5 4.0 858625949 korea 2124591
4 1 Toy Story (1995) Adventure 8 4.0 890492517 korea 2124590
... ... ... ... ... ... ... ... ...
2124589 100 City Hall (1996) Thriller 162445 3.0 939556195 korea 5
2124590 100 City Hall (1996) Thriller 162454 3.0 838259221 korea 4
2124591 100 City Hall (1996) Thriller 162479 3.0 850136396 korea 3
2124592 100 City Hall (1996) Thriller 162504 3.0 848591738 korea 2
2124593 100 City Hall (1996) Thriller 162507 3.0 866722978 korea 1
728x90
반응형