Jaegool_'s log

Data analysis 1st week [SpartaCoding] <Kaggle, Colab, BeautifulSoup4> 본문

Development Log/Data Analytics

Data analysis 1st week [SpartaCoding] <Kaggle, Colab, BeautifulSoup4>

Jaegool 2022. 6. 1. 17:40

https://teamsparta.notion.site/1-24d64fcddf1b4754bf48344858aadcff

 

[스파르타코딩클럽] 데이터분석 종합반 - 1주차

매 주차 강의자료 시작에 PDF파일을 올려두었어요!

teamsparta.notion.site

https://www.kaggle.com/dipam7/student-grade-prediction

 

Student Grade Prediction

Predict the final grade of Portugese high school students

www.kaggle.com

 

<Pandas and DataFrame>

import pandas as pd

You should install the 'pandas' package to use pandas before writing a code if you use Pycharm.

 

items = {'code' : [101, 102, 103, 104, 105, 106, 107, 108],
         '과목': ['수학', '영어', '국어', '체육', '미술', '사회', '도덕', '과학'],
         '수강생':[15, 15, 10, 50, 20, 50, 70, 10],
         '선생님': ['김민수','김현정','강수정', '이나리', '도민성', '강수진', '김진성', '오상배']}
df = pd.DataFrame(items)

This is an example of using pandas to make DataFrame.

 

items2 = {'code' : [109, 110],
         '과목': ['컴퓨터', '한자'],
         '수강생':[10, 12],
         '선생님': ['이철민', '김영우']}
df2 = pd.DataFrame(items2)


total_df = pd.concat([df, df2]) #concatenate

We can concatenate two DataFrame with a '.concat([])' function.

Do not forget to make this a list form!

 

<some basic functions>

df.head()

df.sample()

df.tail()

df.to_csv('dataName.csv', index=False)

The last one is a function to save a file.

new_df = pd.read_table('data.csv', sep=',') #seperate

- way to read this saving DataFrame

 

<install beautifulSoup4>

!pip install beautifulSoup4
 
# import 'BeautifulSoup' from the package 'bs4'

 

from bs4 import BeautifulSoup
# HTML 문서를 문자열 html로 저장
html = '''
<html> 
    <head> 
    </head> 
    <body> 
        <h1> 장바구니
            <p id='clothes' class='name' title='라운드티'> 라운드티
                <span class = 'number'> 25 </span> 
                <span class = 'price'> 29000 </span> 
                <span class = 'menu'> 의류</span> 
                http://www.naver.com'> 바로가기 </a> 
            </p> 
            <p id='watch' class='name' title='시계'> 시계
                <span class = 'number'> 28 </span>
                <span class = 'price'> 32000 </span> 
                <span class = 'menu'> 액세서리 </span> 
                http://www.facebook.com'> 바로가기 </a> 
            </p> 
        </h1> 
    </body> 
</html>
'''

# BeautifulSoup 인스턴스 생성. 두번째 매개변수는 분석할 분석기(parser)의 종류.
soup = BeautifulSoup(html, 'html.parser')

- soup.select('tagName')

- soup.select('.className')

- soup.select('#idName')

 

- soup.select('상위태그명 하위태그명') : 자손 관계 (어떤 태그 내부에 있는 모든 태그를 자손이라고 함)

- soup.select('상위태그명 > 하위태그명') : 자식 관계 (어떤 태그 내부에 있는 태그 중 바로 한 단계 아래에 있는 태그를 자식이라고 함)

 

 

 

<naver news crawling practice file>

[스파르타]_데이터분석_종합반_1주차_13강_네이버_뉴스_크롤링_실습의_사본.ipynb
0.29MB

 

< HW >

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=D&ymd=20211103&hh=13&rtm=N&pg=1',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

songs = soup.select('#body-content > div.newest-list > div > table > tbody > tr')
for song in songs:
  rank = song.select_one('.number').text[:2].strip()
  title = song.select_one('a.title.ellipsis').text.strip()
  if "19금" in title:
    title = song.select_one('a.title.ellipsis').text.strip().lstrip('19금').strip()
  singer = song.select_one('a.artist.ellipsis').text.strip()
  print(rank, title, singer)

List the ranking of songs, using BeautifulSoup.