So I have a list of amazon prime video URL which is already cleaned to only the specific video's page like this.
Movie
https://primevideo.com/detail/0MRW4VXSA05J3TO7GEIONBBZU7/ref=atv_hm_hom_1_c_r1Fy1u_4_3
Series with Multiple episodes
https://www.primevideo.com/detail/0QSTI37OUTTRLSVF40P3REDM3K/ref=atv_hm_hom_1_c_8pZiqd_2_1
And I would like to scrape Title, Duration, Year, Synopsis, Directors, Genre, Starring.
However, I realized that movies have different html layouts as series. And when I scraped the entire list, sometimes it couldn't find specific elements and thus skip url and head on to the next url.
I would to output a dataframe with all those data stated above. Another question is I am not sure if I scraped the correct XPath as it sometimes have 3 casts, 4 casts, 5 casts etc, or more than one Genre. I picked the upper tag that contains both
What's wrong with my code? It returned a blank dataframe.
AmazonPV = pd.DataFrame()
def ScrapeAmazonPV(urls, df):
"""
Pass in the URLs to access and also a blank dataframe to write onto.
Known Problem: Some of the amazon PV sites seem to be missing some data, so when the Try-Exception ran,
for those films missing on any of the data, it will be skipped. I need to find a solution to not skip.
"""
TitleList = []
YearList = []
SynopsisList = []
GenreList = []
LengthList = []
StarringList = []
CreatorsList = []
IMDBList = []
with webdriver.Chrome(ChromeDriverManager().install()) as driver:
for url in urls:
try:
driver.get("https://www." + url)
#Pause for website to load
time.sleep(1)
elements='//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/div/label/span'
if(len(elements) == 0):
#Movies/Films
Title = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/h1').text
Genre = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[3]/dd/a').text
Year = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[3]/span').text
Length = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[2]/span').text
IMDBRating = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[1]/span[1]/span[2]').text
Creators = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[1]/dd').text
Starring = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[2]/dd').text
Synopsis = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/div/div/div/div/div/text()').text
#Append List to values
YearList.append(Year)
TitleList.append(Title)
SynopsisList.append(Synopsis)
GenreList.append(Genre)
LengthList.append(Length)
StarringList.append(Starring)
CreatorsList.append(Creators)
IMDBList.append(IMDBRating)
else:
#Series
Title = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/h1').text
Genre = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[2]/dd').text
Year = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/span[2]/span').text
Length = driver.find_element_by_xpath('//*[@id="tab-content0"]/div/div/div/h1').text
IMDBRating = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[2]/span[1]/span[2]').text
Creators = driver.find_element_by_xpath('//*[@id="btf-product-details"]/div/dl[1]/dd').text
Starring = driver.find_element_by_xpath('//*[@id="meta-info"]/div/dl[1]/dd').text
Synopsis = driver.find_element_by_xpath('//*[@id="a-page"]/div[2]/div[2]/div/div/div[2]/div[2]/div/div[3]/div/div/div/div/div/text()').text
#Append List to values
YearList.append(Year)
TitleList.append(Title)
SynopsisList.append(Synopsis)
GenreList.append(Genre)
LengthList.append(Length)
StarringList.append(Starring)
CreatorsList.append(Creators)
IMDBList.append(IMDBRating)
except:
continue
AmazonPV['Title']= TitleList
AmazonPV['Year']= YearList
AmazonPV['Genre']= GenreList
AmazonPV['Length']= LengthList
AmazonPV['Starring']= StarringList
AmazonPV['Creators']= CreatorsList
AmazonPV['Synopsis']= SynopsisList
AmazonPV['Rating'] = IMDBList
return AmazonPV
ScrapeAmazonPV(amazonpv, AmazonPV)
question from:
https://stackoverflow.com/questions/65947940/scraping-amazon-prime-video-with-python-and-selenium 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…