Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
91 views
in Technique[技术] by (71.8m points)

Scraping Amazon Prime Video With Python and Selenium

So I have a list of amazon prime video URL which is already cleaned to only the specific video's page like this.

Movie https://primevideo.com/detail/0MRW4VXSA05J3TO7GEIONBBZU7/ref=atv_hm_hom_1_c_r1Fy1u_4_3

Series with Multiple episodes https://www.primevideo.com/detail/0QSTI37OUTTRLSVF40P3REDM3K/ref=atv_hm_hom_1_c_8pZiqd_2_1

And I would like to scrape Title, Duration, Year, Synopsis, Directors, Genre, Starring.

However, I realized that movies have different html layouts as series. And when I scraped the entire list, sometimes it couldn't find specific elements and thus skip url and head on to the next url.

I would to output a dataframe with all those data stated above. Another question is I am not sure if I scraped the correct XPath as it sometimes have 3 casts, 4 casts, 5 casts etc, or more than one Genre. I picked the upper tag that contains both

What's wrong with my code? It returned a blank dataframe.

AmazonPV = pd.DataFrame()

def ScrapeAmazonPV(urls, df): """ Pass in the URLs to access and also a blank dataframe to write onto. Known Problem: Some of the amazon PV sites seem to be missing some data, so when the Try-Exception ran, for those films missing on any of the data, it will be skipped. I need to find a solution to not skip.

"""
TitleList = []
YearList = []
SynopsisList = []
GenreList = []
LengthList = []
StarringList = []
CreatorsList = []
IMDBList = []
with webdriver.Chrome(ChromeDriverManager().install()) as driver:
    for url in urls:
        try:
            driver.get("https://www." + url)
            #Pause for website to load
            time.sleep(1)
            elements='//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/div/label/span'
            if(len(elements) == 0):
                #Movies/Films
                Title = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/h1').text
                Genre = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[3]/dd/a').text
                Year = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[3]/span').text
                Length = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[2]/span').text   
                IMDBRating = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[1]/span[2]').text
                Creators = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[1]/dd').text
                Starring = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[2]/dd').text
                Synopsis = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div/div/div/div/div/text()').text
                
                #Append List to values
                YearList.append(Year)
                TitleList.append(Title)
                SynopsisList.append(Synopsis)
                GenreList.append(Genre)
                LengthList.append(Length)
                StarringList.append(Starring)
                CreatorsList.append(Creators)
                IMDBList.append(IMDBRating)
            else:
                #Series
                Title = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/h1').text
                Genre = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[2]/dd').text
                Year = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/span[2]/span').text
                Length = driver.find_element_by_xpath('//*[@id="tab-content0"]/div/div/div/h1').text   
                IMDBRating = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/span[1]/span[2]').text
                Creators = driver.find_element_by_xpath('//*[@id="btf-product-details"]/div/dl[1]/dd').text
                Starring = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[1]/dd').text
                Synopsis = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[3]/div/div/div/div/div/text()').text
                
                #Append List to values
                YearList.append(Year)
                TitleList.append(Title)
                SynopsisList.append(Synopsis)
                GenreList.append(Genre)
                LengthList.append(Length)
                StarringList.append(Starring)
                CreatorsList.append(Creators)
                IMDBList.append(IMDBRating)
        except:
            continue
            
    AmazonPV['Title']= TitleList
    AmazonPV['Year']= YearList
    AmazonPV['Genre']= GenreList
    AmazonPV['Length']= LengthList
    AmazonPV['Starring']= StarringList
    AmazonPV['Creators']= CreatorsList
    AmazonPV['Synopsis']= SynopsisList
    AmazonPV['Rating'] = IMDBList
    return AmazonPV
    

ScrapeAmazonPV(amazonpv, AmazonPV)

question from:https://stackoverflow.com/questions/65947940/scraping-amazon-prime-video-with-python-and-selenium

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...