web scraping - How to get the right url after redirection (the one given by the browser) using python

Question

Welcome To Ask or Share your Answers For Others

web scraping - How to get the right url after redirection (the one given by the browser) using python

asked Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

web scraping - How to get the right url after redirection (the one given by the browser) using python

I'm working on a project whose aim is to retrieve all the information from a news article (media website), for this I'm using the library newspaper3K which works quite well.

however I have a problem concerning some urls (redirected link), according to my research newspaper3k does not load the redirection url, it only treats the sent url as a parameter.

Here is an example of a link I would like to deal with:

url = "wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"

so the goal here with this url is to get the right url (after redirection) and then send it to newspaper3K.

I have tried the following solutions but they don't work on my side;

1 - using the library resquests as follows response = requests.get(url, verify=False, allow_redirects=True)

2- using the mechanize library as follows:

br = mechanize.Browser()
resp = br.open(url)

I would like to have the same process as when I use webbrowser (without opening the browser)

import webbrowser
webbrowser.open_new(url)

and finally have the right

url : https://www.20minutes.fr/monde/2943823-20210103-bahamas-disparition-bateau-20-personnes-bord?xtor=EREC-182-[actualite]

thank you in advance for your reply :)

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-19T03:43:16+0000

The redirect is not happening from path forwarding but instead from the actual html content. You can verify this by downloading the text from response with the following code.

with open ("actualite.html", "w") as f:
    f.write(response.text)

If you open the local file, it will then redirect. The browser does the redirect instead of a domain server.

To solve this you could use a tool that uses the browser like selenium.

Edit: Here is how you could use selenium to do this:

from selenium import webdriver
url = "https://wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"

options = webdriver.ChromeOptions()
options.add_argument('ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options, executable_path=r"C:/Users/james/Documents/Selenium/chromedriver.exe")
driver.get(url)
print(driver.current_url)

Categories

web scraping - How to get the right url after redirection (the one given by the browser) using python

web scraping - How to get the right url after redirection (the one given by the browser) using python

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags