[Python] 네이버 영화 평 가져오기&데이터 분류

티스토리 뷰

프로그래밍/스크래핑

[Python] 네이버 영화 평 가져오기&데이터 분류

부단 2020. 2. 19. 21:48

728x90

「네이버 영화 최근 평 10개를 크롤링해서 데이터를 분류해 출력하는 것을 해보겠습니다.」

- 코드는 다음과 같습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

import requests
from bs4 import BeautifulSoup
 
data = []
star_point = ""
title = ""
star_point_L = ""
star_point_S = ""
content = ""
True_False = False
star_point_F = []
title_f = []
content_f = []
url='https://movie.naver.com/movie/point/af/list.nhn' 
hdr = {'Accept-Language': 'ko_KR,en;q=0.8', 'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36')}
req = requests.get(url, headers=hdr)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
for i in soup.select('td[class=title]'):
    data.append(i.text)
    
for a in range(0,10):
    data[a] = data[a].replace("\t","")
    data[a] = data[a].replace("\n","")
    star_point = data[a]
    star_point = star_point.replace("신고","")
    star_point = star_point.split("별점 - 총",1)
    title = str(star_point[0])
    star_point_L = star_point[1]
    
    star_point_S = star_point_L[:8]
    if(star_point_S[-1] != "0"):
        star_point_S = star_point_S[:-1]
        content = star_point_L[7:]
    else:
        content = star_point_L[8:]
    star_point_S = star_point_S.replace("10점 중","")
    content_f.append(content)
    star_point_F.append(star_point_S)
    title_f.append(title)
print("=======================================================================================================")
print("   ||                                 영화 평 정리                                               ||    ")
print("=======================================================================================================")
for b in range(0,10):
    print("영화 제목 : " + title_f[b] + "\n감상 : " + content_f[b] + "\n별점 : " + star_point_F[b])
    print("---------------------------------------------------------------------------------------------------")
    
 
Colored by Color Scripter

cs

코드설명:

15번 줄 - url = https://movie.naver.com/movie/point/af/list.nhn

평점 : 네이버 영화

네티즌 평점과 리뷰 정보 제공

movie.naver.com

은 말 그대로 네이버 영화 평점을 제공해주는 사이트의 url입니다. 이 사이트의 데이터를 분류하는 것이 이번 포스팅의 목적입니다.

16번 줄 - hdr 즉, headers는 접속하는 사람의 정보가 들어가는데, 이 코드에서는 정보를 가져올 때 봇으로 인식되어 차단당하는 것을 막기 위한 목적으로 User_Agent를 표시해 주었습니다.

17~19번 줄 - 해당 사이트에서 html정보를 받아옵니다. (해당 사이트에서 F12로 확인할 수 있습니다.)

20~21번 줄 - 원하는 정보<td class="title">만 가져와 data 배열에 저장합니다.

23~37번 줄 - 가져온 필요한 데이터를 정제하고 분리합니다.

38~41번 줄 - 분리한 데이터를 각각 알맞은 배열에 추가합니다. ex)영화 제목같은 경우는 title_f에 추가

45~47번 줄 - 데이터가 추가된 배열에서 10개를 뽑아 print문의 형식에 맞춰 출력됩니다.

※해당 사이트의 robots.txt를 찾아보면

User-agent: * Disallow: /search

이므로 이 포스팅은 공부 참고용으로만 사용해주시기를 권장합니다! ※

저작자표시

'프로그래밍 > 스크래핑' 카테고리의 다른 글

[Python] 롤 카운터 자동화 (0)	2021.01.03
[Python] 주식 정보 자동화 (0)	2020.10.23
[Python] webdriver를 이용한 리로스쿨 크롤링 (0)	2020.02.19

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

글 보관함

부단이네 블로그

티스토리 뷰