Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
116 views
in Technique[技术] by (71.8m points)

html - How to avoid joining all text from Nodes when scraping

When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.

For instance:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
    <p>baz</p>
  </body>
</html>
EOT

doc.search('p').text # => "foobarbaz"

But what I want is:

["foo", "bar", "baz"]

The same happens when scraping XML:

doc = Nokogiri::XML(<<EOT)
<root>
  <block>
    <entries>foo</entries>
    <entries>bar</entries>
    <entries>baz</entries>
  </block>
</root>
EOT

doc.search('entries').text # => "foobarbaz"

Why does this happen and how do I avoid it?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).

The NodeSet documentation says text will:

Get the inner text of all contained Node objects

Which is what we're seeing happen with:

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
    <p>baz</p>
  </body>
</html>
EOT

doc.search('p').text # => "foobarbaz"

because:

doc.search('p').class # => Nokogiri::XML::NodeSet

Instead, we want to get each Node and extract its text:

doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"

which can be done using map:

doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]

Ruby allows us to write that more concisely using:

doc.search('p').map(&:text) # => ["foo", "bar", "baz"]

The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.

A Node has several aliased methods for getting at its embedded text. From the documentation:

#content ? Object

Also known as: text, inner_text

Returns the contents for this Node.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...