The issues with your script as it stand are:
- The
url
has a trailing /
which gives an invalid page when requested, not listing the files you want to download.
- The CSS selector in
soup.select(...)
is selecting div
with the attribute webpartid
which does not exist anywhere in that linked document.
- You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
- The
try:...except:
block is stopping you seeing the errors generated when trying to download the file. Using an except
block without a specific exception is bad practise and should be avoided.
A modified version of your code that will get the correct files and attempt to download them is as follows:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden
exception is thrown, even though the file is downloadable in the browser.
At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that
the initial http://
request is blocked there also, and then Chrome attempts a https://
request for the same file.
In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http:
to https:
before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join
:
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…