Scraping Amazon Prime Video With Python and Selenium

Question

Welcome To Ask or Share your Answers For Others

Scraping Amazon Prime Video With Python and Selenium

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Scraping Amazon Prime Video With Python and Selenium

So I have a list of amazon prime video URL which is already cleaned to only the specific video's page like this.

Movie https://primevideo.com/detail/0MRW4VXSA05J3TO7GEIONBBZU7/ref=atv_hm_hom_1_c_r1Fy1u_4_3

Series with Multiple episodes https://www.primevideo.com/detail/0QSTI37OUTTRLSVF40P3REDM3K/ref=atv_hm_hom_1_c_8pZiqd_2_1

And I would like to scrape Title, Duration, Year, Synopsis, Directors, Genre, Starring.

However, I realized that movies have different html layouts as series. And when I scraped the entire list, sometimes it couldn't find specific elements and thus skip url and head on to the next url.

I would to output a dataframe with all those data stated above. Another question is I am not sure if I scraped the correct XPath as it sometimes have 3 casts, 4 casts, 5 casts etc, or more than one Genre. I picked the upper tag that contains both

What's wrong with my code? It returned a blank dataframe.

AmazonPV = pd.DataFrame()

def ScrapeAmazonPV(urls, df): """ Pass in the URLs to access and also a blank dataframe to write onto. Known Problem: Some of the amazon PV sites seem to be missing some data, so when the Try-Exception ran, for those films missing on any of the data, it will be skipped. I need to find a solution to not skip.

"""
TitleList = []
YearList = []
SynopsisList = []
GenreList = []
LengthList = []
StarringList = []
CreatorsList = []
IMDBList = []
with webdriver.Chrome(ChromeDriverManager().install()) as driver:
    for url in urls:
        try:
            driver.get("https://www." + url)
            #Pause for website to load
            time.sleep(1)
            elements='//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/div/label/span'
            if(len(elements) == 0):
                #Movies/Films
                Title = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/h1').text
                Genre = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[3]/dd/a').text
                Year = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[3]/span').text
                Length = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[2]/span').text   
                IMDBRating = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[1]/span[2]').text
                Creators = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[1]/dd').text
                Starring = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[2]/dd').text
                Synopsis = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div/div/div/div/div/text()').text
                
                #Append List to values
                YearList.append(Year)
                TitleList.append(Title)
                SynopsisList.append(Synopsis)
                GenreList.append(Genre)
                LengthList.append(Length)
                StarringList.append(Starring)
                CreatorsList.append(Creators)
                IMDBList.append(IMDBRating)
            else:
                #Series
                Title = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/h1').text
                Genre = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[2]/dd').text
                Year = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/span[2]/span').text
                Length = driver.find_element_by_xpath('//*[@id="tab-content0"]/div/div/div/h1').text   
                IMDBRating = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/span[1]/span[2]').text
                Creators = driver.find_element_by_xpath('//*[@id="btf-product-details"]/div/dl[1]/dd').text
                Starring = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[1]/dd').text
                Synopsis = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[3]/div/div/div/div/div/text()').text
                
                #Append List to values
                YearList.append(Year)
                TitleList.append(Title)
                SynopsisList.append(Synopsis)
                GenreList.append(Genre)
                LengthList.append(Length)
                StarringList.append(Starring)
                CreatorsList.append(Creators)
                IMDBList.append(IMDBRating)
        except:
            continue
            
    AmazonPV['Title']= TitleList
    AmazonPV['Year']= YearList
    AmazonPV['Genre']= GenreList
    AmazonPV['Length']= LengthList
    AmazonPV['Starring']= StarringList
    AmazonPV['Creators']= CreatorsList
    AmazonPV['Synopsis']= SynopsisList
    AmazonPV['Rating'] = IMDBList
    return AmazonPV

ScrapeAmazonPV(amazonpv, AmazonPV)

question from:https://stackoverflow.com/questions/65947940/scraping-amazon-prime-video-with-python-and-selenium

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

Scraping Amazon Prime Video With Python and Selenium

Scraping Amazon Prime Video With Python and Selenium

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags