Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
237 views
in Technique[技术] by (71.8m points)

selenium - Capturing info from console using Python

I'm creating a script where I'm trying to rip m4a files from a website specifically. I'm using BS4 and selenium for this purpose presently.

I'm having some trouble getting the info. The file link is not located in the HTML source for the page. Instead, I can only find it in the console. The link I'm trying to get is here in this image (https://imgur.com/a/DLwcE0p) labeled "audio_url_m4a:".

Here's some sample code I'm using:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

d = DesiredCapabilities.CHROME
d['loggingPrefs'] = {'browser':'ALL ' }
driver = webdriver.Chrome(r'chromedriver path', desired_capabilities = d)

~~lots of code doing other things not relevant to the post~~

for URL in audm_URL: #this is referencing a line of code where I construct a list of URLs
            driver.get(audm)
            time.sleep(3)

            for entry in driver.get_log('browser'):
                print(entry)

Here is the output I get:


{'level': 'SEVERE', 'message': 'https://audm.herokuapp.com/favicon.ico - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1611291689357}
{'level': 'SEVERE', 'message': 'https://cdn.segment.com/analytics.js/v1/5DOhLj2nIgYtQeSfn9YF5gpAiPqRtWSc/analytics.min.js - Failed to load resource: net::ERR_NAME_NOT_RESOLVED', 'source': 'network', 'timestamp': 1611291689357}

Most questions relating to grabbing things from the console point me towards grabbing the logs, but nothing that seems to let me know how to grab those other variables. Any ideas?

Here's a link to a random audio page that I want to grab the file from: https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c

Thanks everyone!

question from:https://stackoverflow.com/questions/65839595/capturing-info-from-console-using-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
driver.get(
    "https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"button"))).click()
src=WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".react-player video"))).get_attribute("src")



print(src)

if you just want to get src you can use above code .

you need to import

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

If you want to get it through console log then use : IT SEEMS ITS WORKING ONLY FOR HEADLESS I AM INVESTIGATING:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()

options.headless = True

capabilities = webdriver.DesiredCapabilities().CHROME.copy()

capabilities['loggingPrefs'] = {'browser': 'ALL'}
driver = webdriver.Chrome(options=options,desired_capabilities=capabilities)

driver.maximize_window()


time.sleep(3)

driver.get(
    "https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")



for entry in driver.get_log('browser'):
    print(entry)

Update

in headless mode w3c is false and hence it is working ,

For non headless mode you have to use:

options.add_experimental_option('w3c', False)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...