I'm working on a web scraping project in Python. I get a daily email from a service that has links in it. A typical link looks like:
http://clicks.serviceprovider.com/track/click/12345/www.serviceprovider.com?p=eyJzI...<snip>...JdfSJ9
In a browser, I can see that the server redirects from http://clicks.serviceprovider.com
to https://www.serviceprovider.com?pageId=12345
. Naturally, I want to scrape pageId 12345 with my Python code.
If I just do a requests.get(url)
, the server never responds. I suspect, but don't know for sure, that this is because requests isn't including a Host header.
If I set headers={'Host':'clicks.serviceprovider.com'}
, I end up getting an HTTP 403
error. What I think is happening, but cannot demonstrate, is that requests is sending the original http GET request, is getting the HTTP 301 redirect, but when it does a GET for the https:// redirected page, it is still using the Host header for clicks.serviceprovider.com
instead of www.serviceprovider.com
from the redirected URL.
How can I tell requests to change the Host header with the redirect?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…