I’m very new to scrapy, python and coding in general. I have a project where I’d like to collect blog posts to do some content analysis on them in Atlas.ti 8. Atlas supports files like .html, .txt., docx and PDF.
I’ve built my crawler based on the scrapy tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
My main issue is that I’m unable to save the posts in their own files. I can download them as one batch with scrapy crawl <crawler> -o filename.csv
but from the csv I’ve to use VBA to put the posts in their own files row by row. This is a step I’d like to avoid.
My current code can be see below.
import scrapy
class BlogCrawler(scrapy.Spider):
name = "crawler"
start_urls = ['url']
def parse(self,response):
postnro = 0
for post in response.css('div.post'):
postnro += 1
yield {
'Post nro: ': postnro,
'date': post.css('.meta-date::text').get().replace('
on', '').replace('',''),
'author': post.css('.meta-author i::text').get(),
'headline': post.css('.post-title ::text').get(),
'link': post.css('h1.post-title.single a').attrib['href'],
'text': [item.strip() for item in response.css('div.entry ::text').getall()],
}
filename = f'post-{postnro}.html'
with open(filename, 'wb') as f:
f.write(???)
next_page = response.css('div.alignright a').attrib['href']
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I’ve no idea how I should go about saving the results. I’ve tried to input response.body
, response.text
and TextResponse.text
to f.write()
to no avail. I’ve also tried to collect the data in a for loop and save it like: f.write(date + ‘
’, author + ‘
'...)
Approaches like these produce empty, 0 KB files.
The reason I’ve set the file type to ‘html’ is because Atlas can take it as it is and the whitespace won’t be an issue. In principle the filetype could also be .txt. However, if I manage to save posts as html, I evade the secondary issue in my project. The getall()
creates a list which is why strip(), replace() as well as w3lib methods are hard to implement to clean the data. The current code replaces the whitespace with commas which is readable but it could be better.
If anyone has ideas on how to save each blog post in separate file, one post per file, I'd be happy to hear them.
Best regards,
Leeward
question from:
https://stackoverflow.com/questions/65849346/scrapy-how-to-save-crawled-blogs-in-their-own-files 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…