youtubecrawling

crawling Yotube video's url, title, full description, caption, comment


Keywords
youtube, crawler, textdata
Install
pip install youtubecrawling==1.0.4

Documentation

ํ˜„์žฌ ๋Œ“๊ธ€ ํฌ๋กค๋ง ๊ด€๋ จ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค

  • ์˜์ƒ์ด ๋ธ”๋ฝ ๋œ ๊ฒฝ์šฐ ๋Œ“๊ธ€ ํฌ๋กค๋ง์ด ๋ถˆ๊ฐ€๋Šฅํ•˜์—ฌ ์ด๋ฅผ ์˜ˆ์™ธ์ฒ˜๋ฆฌํ•˜๋Š” ๋ถ€๋ถ„ ์ถ”๊ฐ€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค

youtube_crawler

์œ ํŠœ๋ธŒย API๋ฅผย ์ด์šฉํ•˜์—ฌ ๋งค์šฐ ์‰ฝ๊ฒŒ!ย ์—ฌ๋Ÿฌ ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํฌ๋กค๋งํ•˜๊ณ  ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ •์‹ api๋ฅผ ์ด์šฉํ•˜์—ฌ ํฌ๋กค๋งํ•˜๊ธฐ์— ์•ˆ์ •์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค

ํ• ๋‹น๋Ÿ‰์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋ฉด ํฌ๋กค๋ง์ด ๋˜์ง€ ์•Š๊ธฐ์— apiํ‚ค๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค

๊ตฌ๊ธ€ ๊ณ„์ •๋‹น 1๊ฐœ์”ฉ ๋ฐœ๊ธ‰ ๊ฐ€๋Šฅํ•˜๋‹ˆ ๊ตฌ๊ธ€ ๊ณ„์ •์ด ๋งŽ๋‹ค๋ฉด ์—ฌ๋Ÿฌ๊ฐœ์˜ ํ‚ค๋ฅผ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค

๋ฐœ๊ธ‰ ๋ฐฉ๋ฒ•

https://velog.io/@yhe228/Youtube-API%EB%A5%BC-%EC%9D%B4%EC%9A%A9%ED%95%B4-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EA%B0%80%EC%A0%B8%EC%98%A4%EA%B8%B0

์ฒ˜์Œ์— API key๋ฆฌ์ŠคํŠธ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์ž๋™์œผ๋กœ ํ• ๋‹น๋Ÿ‰์ด ๋๋‚˜๋ฉด ๋‹ค๋ฅธ ํ‚ค๋กœ ๋ฐ”๊ฟ”์ค๋‹ˆ๋‹ค.

ํฌ๋กค๋ง ํ•ญ๋ชฉ

  • ์˜์ƒย id (URL)
  • ์ œ๋ชฉ
  • ์ƒ์„ธ์ •๋ณด
  • ๋Œ“๊ธ€
  • ์ž๋ง‰ - pytube ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ

pip install

pip install youtubecrawling

์•„๋ž˜ ์ฝ”๋“œ๋กœ ๋ชจ๋“  ํฌ๋กค๋ง์ด ๋๋‚ฉ๋‹ˆ๋‹ค

from youtubecrawling import youtubecrawling

key_list=[ "asdasd", "bddfg", "hgfd"]

# ํฌ๋กค๋Ÿฌ ํ˜ธ์ถœ
c = youtubecrawling.Crawler("D:/youtube_crawler", key_list)

# ๋ธ”๋ž™ํ•‘ํฌ๋กœ ์œ ํŠœ๋ธŒ ๊ฒ€์ƒ‰ , id, ์ œ๋ชฉ ํฌ๋กค๋ง
df = c.youtube_search("๋ธ”๋ž™ํ•‘ํฌ")

# ๋ธ”๋ž™ํ•‘ํฌ ์˜์ƒ ์ž๋ง‰ ํฌ๋กค๋ง
cap_no = c.make_captions(df)

# ์˜์ƒ ๋Œ“๊ธ€ ํฌ๋กค๋ง
com_no = c.get_comments(df)

# ์ƒ์„ธ์ •๋ณด ํฌ๋กค๋ง
desc = c.get_descriptions(df)

ํด๋” ๊ตฌ์กฐ

--์ง€์ •ํ•œ ํด๋”

 :--videoIds

 :--captions
 
 :--comments

 :--description

๊ฐœ๋ฐœ ๋™๊ธฐ

์œ ํŠœ๋ธŒ ํฌ๋กค๋ง์„ ํ•˜๋Š”๋ฐ ๊ณต์‹ ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ์ฝ”๋“œ๋ฅผ ์งœ๊ณ 

๋‚ด๊ฐ€ ์›ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋งŒ๋“ค๊ณ  API์˜ ํ• ๋‹น๋Ÿ‰์ด ๋๋‚˜๋ฉด ์†์ˆ˜ ๋ฐ”๊ฟ”์ค˜์•ผ ํ•˜๊ณ 

์–ด๋””๋ถ€ํ„ฐ ๋Š๊ฒผ๋Š”์ง€ ์ฐพ๊ธฐ ๊ท€์ฐฎ์•„์„œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ง์ ‘ ๊ฐœ๋ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ฒ€์ƒ‰๋ถ€ํ„ฐ csvํŒŒ์ผ๋กœ์˜ ์ €์žฅ๊นŒ์ง€ ํ•œ๋ฒˆ์— ํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ์ž๋Š” ๋ณ„๋‹ค๋ฅธ ์„ค์ •์ด ํ•„์š” ์—†๋Š”๊ฒƒ์ด ์žฅ์ ์ž…๋‹ˆ๋‹ค.

๋งŽ์ด ์‚ฌ์šฉํ•ด์ฃผ์‹œ๊ณ  ๋ฌธ์ œ์ ์€ Issue๋ฅผ ํ†ตํ•ด ๋ง์”€ํ•ด์ฃผ์‹œ๋ฉด ์กฐ์น˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค

์‚ฌ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ ์‹œ ์ž๋™์œผ๋กœ ์„ค์น˜ ๋ฉ๋‹ˆ๋‹ค

  • google-api-python-client
  • oauth2client
  • pytube
  • tqdm
  • pandas

๊ธฐ๋Šฅ ์ƒ์„ธ ์„ค๋ช…

Crawler

yc = Crawler(path , key_list, key_index=0)

#์˜ˆ์‹œ

key_list=[ "asdasd", "bddfg", "hgfd"]
# "D:/youtube_crawler"์— ํŒŒ์ผ๋“ค ์ €์žฅ
# api๋Š” key_list๋กœ ์‚ฌ์šฉ
# 2๋ฒˆ์งธ ์ธ๋ฑ์Šค์— ํ•ด๋‹น๋˜๋Š” ํ‚ค๋ถ€ํ„ฐ ์‚ฌ์šฉ
yc = Crawler("D:/youtube_crawler", key_list, 2)

Crawler ๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค

parameter

  • path
    • ์ •๋ณด๋ฅผ ์ €์žฅํ•  ์œ„์น˜
  • key_list
    • ์‚ฌ์šฉํ•  API key list
  • key_index
    • ์ฒ˜์Œ ์‚ฌ์šฉํ•  key์˜ index

yotube_search

df = c.youtube_search(query, topicId=None, videoCaption=None, regionCode="KR")

#์˜ˆ์‹œ

# ๋Œ€ํ•œ๋ฏผ๊ตญ ์˜์ƒ ์ค‘ nlp ๊ฐ•์˜๋ฅผ ๊ฒ€์ƒ‰ํ•ด์„œ ๊ฐ€์ ธ์˜ค๋Š”๋ฐ
# topic์ •ํ•˜์ง€ ์•Š๊ณ  ์ž๋ง‰์—ฌ๋ถ€ ์ƒ๊ด€์—†์ด ๊ฐ€์ ธ์˜จ๋‹ค
df = c.youtbue_search("nlp ๊ฐ•์˜")

# ๋Œ€ํ•œ๋ฏผ๊ตญ ์˜์ƒ ์ค‘ ๋ธ”๋ž™ํ•‘ํฌ๋ฅผ ๊ฒ€์ƒ‰ํ•ด์„œ ๊ฐ€์ ธ์˜ค๋Š”๋ฐ
# Music ๊ณผ Entertainment์— ๊ด€๋ จ๋œ ์˜์ƒ์„ ๊ฐ€์ ธ์˜ด
df = c.youtube_search("๋ธ”๋ž™ํ•‘ํฌ",topicId="/m/04rlf, /m/02jjt")

query๋กœ ๊ฒ€์ƒ‰ํ•˜๊ณ  ๊ด€๋ จ ์˜์ƒ์˜ id๋“ค์„ ์ €์žฅํ•˜๊ณ  DataFrame์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค

parameter

  • query

    • ๊ฒ€์ƒ‰์–ด
  • topicId

    • ์ฃผ์ œ ์„ ํƒ - ์—ฌ๋Ÿฌ๊ฐœ ๊ฐ€๋Šฅ

    • topicId

      Music topics

      • /m/04rlf Music
      • /m/05fw6t Children's music
      • /m/02mscn Christian music
      • /m/0ggq0m Classical music
      • /m/01lyv Country
      • /m/02lkt Electronic music
      • /m/0glt670 Hip hop music
      • /m/05rwpb Independent music
      • /m/03_d0 Jazz
      • /m/028sqc Music of Asia
      • /m/0g293 Music of Latin America
      • /m/064t9 Pop music
      • /m/06cqb Reggae
      • /m/06j6l Rhythm and blues
      • /m/06by7 Rock music
      • /m/0gywn Soul music

      Gaming topics

      • /m/0bzvm2 Gaming
      • /m/025zzc Action game
      • /m/02ntfj Action-adventure game
      • /m/0b1vjn Casual game
      • /m/02hygl Music video game
      • /m/04q1x3q Puzzle video game
      • /m/01sjng Racing video game
      • /m/0403l3g Role-playing video game
      • /m/021bp2 Simulation video game
      • /m/022dc6 Sports game
      • /m/03hf_rm Strategy video game

      Sports topics

      • /m/06ntj Sports
      • /m/0jm_ American football
      • /m/018jz Baseball
      • /m/018w8 Basketball
      • /m/01cgz Boxing
      • /m/09xp_ Cricket
      • /m/02vx4 Football
      • /m/037hz Golf
      • /m/03tmr Ice hockey
      • /m/01h7lh Mixed martial arts
      • /m/0410tth Motorsport
      • /m/066wd Professional wrestling
      • /m/07bs0 Tennis
      • /m/07_53 Volleyball

      Entertainment topics

      • /m/02jjt Entertainment
      • /m/095bb Animated cartoon
      • /m/09kqc Humor
      • /m/02vxn Movies
      • /m/05qjc Performing arts

      Lifestyle topics

      • /m/019_rr Lifestyle
      • /m/032tl Fashion
      • /m/027x7n Fitness
      • /m/02wbm Food
      • /m/0kt51 Health
      • /m/03glg Hobby
      • /m/068hy Pets
      • /m/041xxh Physical attractiveness [Beauty]
      • /m/07c1v Technology
      • /m/07bxq Tourism
      • /m/07yv9 Vehicles

      Other topics

      • /m/01k8wb Knowledge
      • /m/098wr Society
  • videoCaption

    • Noneย โ€“ย ์บก์…˜ย ์‚ฌ์šฉย ์—ฌ๋ถ€์—ย ๋”ฐ๋ผย ๊ฒฐ๊ณผ๋ฅผย ํ•„ํ„ฐ๋งํ•˜์ง€ย ์•Š์Šต๋‹ˆ๋‹ค.
    • closedCaptionย โ€“ย ์บก์…˜์ดย ์žˆ๋Š”ย ๋™์˜์ƒ๋งŒย ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
    • noneย โ€“ย ์บก์…˜์ดย ์—†๋Š”ย ๋™์˜์ƒ๋งŒย ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • regionCode

return

  • ์ง€์ •ํ•œ ๊ฒฝ๋กœ ๋‚ด videoIdsํด๋”์— ์ €์žฅ
    • ํŒŒ์ผ์ด๋ฆ„ : ์‹คํ–‰ํ•œ ์‹œ๊ฐ„_videoIds.csv
  • pandas.DataFrame ๋ฐ˜ํ™˜

make_caption

cap_no = c.make_captions(df)

#์˜์–ด ์ž๋ง‰๋„ ๊ฐ€์ ธ์˜ค๊ธฐ
cap_no = c.make_captions(df,True)

ํ•ด๋‹น ์˜์ƒ๋“ค์˜ ์ž๋ง‰์„ ๊ฐ€์ ธ์˜จ๋‹ค

parameter

  • df
    • youtube_search๋ฅผ ํ†ตํ•ด ๋ฐ˜ํ™˜๋œ DataFrame
    • columns= [ ' id', 'title']
  • if_En
    • ์˜์–ด ์ž๋ง‰ ๊ฐ€์ ธ์˜ฌ์ง€ ๋ง์ง€

return

  • ์ง€์ •ํ•œ ๊ฒฝ๋กœ ๋‚ด catptionsํด๋”์— ์ €์žฅ
  • ์ž๋ง‰์ด ์—†๋Š” ์˜์ƒ id ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜
  • columns=['index', 'contents']

get_comments

com_no = c.get_comments(df)

ํ•ด๋‹น ์˜์ƒ๋“ค์˜ ๋Œ“๊ธ€๋“ค๊ณผ ์ข‹์•„์š” ์ˆ˜(๋Œ“๊ธ€)๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค

parameter

  • df
    • youtube_search๋ฅผ ํ†ตํ•ด ๋ฐ˜ํ™˜๋œ DataFrame
    • columns = ['id' ,'title]

return

  • ์ง€์ •ํ•œ ๊ฒฝ๋กœ ๋‚ด commentsํด๋”์— ์ €์žฅ
  • columns = ['author','comment','like']

get_descriptions

desc = c.get_descriptions(df)

ํ•ด๋‹น ์˜์ƒ๋“ค์˜ ์ƒ์„ธ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

parameter

  • df
    • youtube_search๋ฅผ ํ†ตํ•ด ๋ฐ˜ํ™˜๋œ DataFrame
    • columns = ['id' ,'title]

return

  • ์ง€์ •ํ•œ ๊ฒฝ๋กœ ๋‚ด descriptionํด๋”์— ์ €์žฅ
  • columns=['id', 'title', 'desc']

์ง„ํ–‰ ๊ธฐ๋ก

0814

์ฝ”๋“œ ํ•จ์ˆ˜ํ™”

์ข€ ๋” ์ตœ์ ํ™” ๊ฐ€๋Šฅํ• ๊ฑฐ๊ฐ™์ง€๋งŒ ์ผ๋‹จ ์—ฌ๊ธฐ๊นŒ์ง€

์ง€์—ญ์„ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ์ž…๋ ฅํ•˜๋ฉด ์ž๋™์œผ๋กœ ๋ชจ๋‘ ์ฐพ์•„์คŒ

apiํ‚ค๋ฅผ ์ž๋™์œผ๋กœ ๋ฐ”๊ฟ”์คŒ

์ฒ˜์Œ์— ์‚ฌ์šฉํ•  apiํ‚ค ์ธ๋ฑ์Šค ๋„ฃ์–ด์ค˜์•ผํ•จ

๋ชจ๋“  ํ‚ค ์‚ฌ์šฉ๋˜๋ฉด ๋๋‚จ

ํ•œ ์˜์ƒ์— ์žˆ๋Š” ๋ชจ๋“  ์ž๋ง‰(ํ•œ, ์˜) ์ €์žฅ

0824

์ฝ”๋“œ ๋ฆฌํŒฉํ† ๋ง

ํ‚ค ๋‹ค์‹œ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋„์ค‘ ์ƒ๊ธธ์ˆ˜๋„ ์žˆ๋Š” ์ค‘๋ณต ์ œ๊ฑฐ

์ฃผ์„ ์ถ”๊ฐ€

ํ”„๋กœ๊ทธ๋žจ ์ง„ํ–‰์‚ฌํ•ญ ์•Œ๋ฆผ ์ถ”๊ฐ€

0825

title_description_parser ๊ธฐ๋Šฅ ์ถ”๊ฐ€

  • ๋”๋ณด๊ธฐ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค

0926

pypi ๋ฐฐํฌ

Disclaimer

์‚ฌ์šฉ์ž๊ฐ€ ํ•ด๋‹น ํ”„๋กœ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์„œ ์ƒ๊ธฐ๋Š” ๋ถˆ์ด์ต ๋˜๋Š” ์ฑ…์ž„์— ๊ฐœ๋ฐœ์ž๋Š” ์ฑ…์ž„์„ ์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค.