Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
139 views
in Technique[技术] by (71.8m points)

python - Scrapy: how to save crawled blogs in their own files

I’m very new to scrapy, python and coding in general. I have a project where I’d like to collect blog posts to do some content analysis on them in Atlas.ti 8. Atlas supports files like .html, .txt., docx and PDF.

I’ve built my crawler based on the scrapy tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html

My main issue is that I’m unable to save the posts in their own files. I can download them as one batch with scrapy crawl <crawler> -o filename.csv but from the csv I’ve to use VBA to put the posts in their own files row by row. This is a step I’d like to avoid.

My current code can be see below.

import scrapy

class BlogCrawler(scrapy.Spider):
name = "crawler"
start_urls = ['url']

def parse(self,response):
    postnro = 0
    for post in response.css('div.post'):
        postnro += 1
        yield {
            'Post nro: ': postnro, 
            'date': post.css('.meta-date::text').get().replace('
on', '').replace('',''),
            'author': post.css('.meta-author i::text').get(),
            'headline': post.css('.post-title ::text').get(),
            'link': post.css('h1.post-title.single a').attrib['href'],
            'text': [item.strip() for item in response.css('div.entry ::text').getall()],
        }

        filename = f'post-{postnro}.html'
        with open(filename, 'wb') as f:
            f.write(???)

    next_page = response.css('div.alignright a').attrib['href']  
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

I’ve no idea how I should go about saving the results. I’ve tried to input response.body, response.text and TextResponse.text to f.write() to no avail. I’ve also tried to collect the data in a for loop and save it like: f.write(date + ‘ ’, author + ‘ '...) Approaches like these produce empty, 0 KB files.

The reason I’ve set the file type to ‘html’ is because Atlas can take it as it is and the whitespace won’t be an issue. In principle the filetype could also be .txt. However, if I manage to save posts as html, I evade the secondary issue in my project. The getall() creates a list which is why strip(), replace() as well as w3lib methods are hard to implement to clean the data. The current code replaces the whitespace with commas which is readable but it could be better.

If anyone has ideas on how to save each blog post in separate file, one post per file, I'd be happy to hear them.

Best regards,

Leeward

question from:https://stackoverflow.com/questions/65849346/scrapy-how-to-save-crawled-blogs-in-their-own-files

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Managed to crack this after good night's sleep and some hours of keyboard (and head) bashing. It is not pretty or elegant and does not make use of Scrapy's advanced features, but suffices for now. This does not solve my secondary issue, but with that I can live with this being my first crawling project. There were multiple issues with my code:

  • "postnro" was not being updated so the code kept writing the same file over and over again. I was unable to make it work, so I used "date" instead. Could have used post's unique id as well, but those were so random, I would not haven known what file I was working with without opening the said file.

  • I could not figure out how to save yield to a file so I for looped what I wanted and saved the results one by one.

  • I switched the filetype from .html to .txt, but it took me some time to figure out and switch 'wb' to plain 'w'.

For those interested, working code (so to speak) below:

def parse(self,response):
    for post in response.css('div.post'):
        date = post.css('.meta-date::text').get().replace('
on ', '').replace('','')
        author = post.css('.meta-author i::text').get()
        headline = post.css('.post-title ::text').get()
        link = post.css('h1.post-title.single a').attrib['href']
        text = [item.strip() for item in post.css('div.entry ::text').getall()]
        filename = f'post-{date}.txt'
        with open(filename, 'w') as f:
            f.write(str(date) + '
' + str(author) + '
' + str(headline) + '
' + str(link) + '
'+'
'+ str(text) + '
')
        
        next_page = response.css('div.alignleft a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...