html - How to avoid joining all text from Nodes when scraping

Question

Welcome To Ask or Share your Answers For Others

html - How to avoid joining all text from Nodes when scraping

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

html - How to avoid joining all text from Nodes when scraping

When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.

For instance:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
    <p>baz</p>
  </body>
</html>
EOT

doc.search('p').text # => "foobarbaz"

But what I want is:

["foo", "bar", "baz"]

The same happens when scraping XML:

doc = Nokogiri::XML(<<EOT)
<root>
  <block>
    <entries>foo</entries>
    <entries>bar</entries>
    <entries>baz</entries>
  </block>
</root>
EOT

doc.search('entries').text # => "foobarbaz"

Why does this happen and how do I avoid it?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T21:23:22+0000

This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).

The NodeSet documentation says text will:

Get the inner text of all contained Node objects

Which is what we're seeing happen with:

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
    <p>baz</p>
  </body>
</html>
EOT

doc.search('p').text # => "foobarbaz"

because:

doc.search('p').class # => Nokogiri::XML::NodeSet

Instead, we want to get each Node and extract its text:

doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"

which can be done using map:

doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]

Ruby allows us to write that more concisely using:

doc.search('p').map(&:text) # => ["foo", "bar", "baz"]

The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.

A Node has several aliased methods for getting at its embedded text. From the documentation:

#content ? Object

Also known as: text, inner_text

Returns the contents for this Node.

Categories

html - How to avoid joining all text from Nodes when scraping

html - How to avoid joining all text from Nodes when scraping

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags