This is an easily solved problem that results from not reading the documentation about how text
behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text
will:
Get the inner text of all contained Node objects
Which is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:
doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map
:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ? Object
Also known as: text
, inner_text
Returns the contents for this Node.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…