Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
925 views
in Technique[技术] by (71.8m points)

python - normalize-space just works with xpath not css selector

i am extracting data using scrapy and python.

the data sometimes include spaces. i was using normalize-space with xpath to remove those spaces like this:

xpath('normalize-space(.//li[2]/strong/text())').extract()

It words very good. However, now i want to use normalize-space with css selector.

I tried this:

car['Location'] = site.css('normalize-space(div[class=location]::text)').extract()

I got empty result though i get correct result if i removed the normalize-space..

please how to use it with css selector?

i tried

def normalize_whitespace(str):
        import re
        str = str.strip()
        str = re.sub(r's+', ' ', str)
        return str

and i called this fucntion like this:

car['Location'] = normalize_whitespace(site.css('div[class=location]::text').extract())

but i got empty result. why please?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Unfortunately, XPath functions are not available with CSS selectors in Scrapy.

You could first translate your div[class=location]::text CSS selector to the equivalent XPath expression and then wrap it in normalize-space() as input to .xpath().

Anyhow, as you are only interested in a final "whitespace-normalized" string, you could achieve the same with a Python function on the output of the CSS selector extract.

See for example http://snipplr.com/view/50410/normalize-whitespace/ :

def normalize_whitespace(str):
    import re
    str = str.strip()
    str = re.sub(r's+', ' ', str)
    return str

If you include this function somewhere in your Scrapy project, you could use it like this:

    car['Location'] = normalize_whitespace(
        u''.join(site.css('div[class=location]::text').extract()))

or

    car['Location'] = normalize_whitespace(
        site.css('div[class=location]::text').extract()[0])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...