FIX: server-side HtmlToMarkdown improvements (#9586)

FIX: server-side HtmlToMarkdown improvements (#9586)

TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML to markdown tables.

The previous ‘remove_whitespaces!’ method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found here.

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the CSS’ White Space Processing Rules. They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

  • Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
  • Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do ‘HTML.gsub!(/[[:space:]]+/, " ")’. We would also need to hoist

 elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more ‘.strip!’ calls than I can bear.

I decided to “emulate” the browser’s handling of whitespaces and came up with a solution in 4 parts

  1. remove_not_allowed!

The HtmlToMarkdown library is recursively “visiting” all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren’t handled by the library (eg. , or any non-textual HTML tags) are “swallowed”. In order to reduce the number of nodes visited, the method ‘remove_not_allowed!’ will automatically delete all the nodes that have no “visitor” (eg. a ‘visit_’ method) defined.

  1. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there’s no point trying to convert something that is hidden. The ‘remove_hidden!’ method removes any nodes that was hidden using the “hidden” HTML attribute, some CSS or with a width or height equal to 0.

  1. hoist_line_breaks!

The ‘hoist_line_breaks!’ method is there to handle
tags. I know those tiny
don’t do much but they can be quite annoying. The
tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML “Foo
Bar
”, it ends up visually similar to “Foo
Bar”. The latter being much more easy to process than the former, so that’s what this method is doing. The “hoist_line_breaks” will hoist
tags out of inline tags until their parent is a block element.

  1. remove_whitespaces!

The “remove_whitespaces!” is where all the whitespace removal is happening. It’s broken down into 4 methods as well

  • remove_whitespaces!
  • is_inline?
  • collapse_spaces!
  • remove_trailing_space!

The ‘remove_whitespace!’ method is recursively walking the HTML tree (skipping

 tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the “collapse_space!” and “remove_trailing_space!” methods.
For each chunks of block elements, it will call “remote_whitespace!” to keep walking the HTML tree recursively.

The “is_inline?” method determines whether a node is part of a inline context. A node is inline iif it’s a text node or it’s an inline tag, but not
, and all its children are also inline.

The “collapse_spaces!” method will collapse any kind of (white) space into a single space (" “) character, even accros tags. For example, if we have " Foo \n Bar \t42”, it will return “Foo Bar 42”.

Finally, the “remove_trailing_space!” method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the ‘allowed_href_schemes’ site setting when converting HTML to Markdown FIX: added support for ‘mailto:’ scheme when converting from HTML to Markdown FIX: added support for dimensions when converting from HTML to Markdown FIX: added support for

,
and
when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for when converting from HTML to Markdown DEV: remove unused ‘sanitize’ gem

Wow, did you just read all that?! Congratz, here’s a cookie: :cookie:.

diff --git a/Gemfile b/Gemfile
index 39b714d..a0f43d3 100644
--- a/Gemfile
+++ b/Gemfile
@@ -118,7 +118,6 @@ gem 'rake'
 gem 'thor', require: false
 gem 'diffy', require: false
 gem 'rinku'
-gem 'sanitize'
 gem 'sidekiq'
 gem 'mini_scheduler'
 
diff --git a/Gemfile.lock b/Gemfile.lock
index b059691..b428497 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -533,7 +533,6 @@ DEPENDENCIES
   ruby-prof
   ruby-readability
   rubyzip
-  sanitize
   sassc (= 2.0.1)
   sassc-rails
   seed-fu
diff --git a/lib/html_to_markdown.rb b/lib/html_to_markdown.rb
index a3eb5e8..448a0cd 100644
--- a/lib/html_to_markdown.rb
+++ b/lib/html_to_markdown.rb
@@ -1,267 +1,365 @@
 # frozen_string_literal: true
 
-require "nokogiri"
+require "nokogumbo"
+require "securerandom"
 
 class HtmlToMarkdown
 
-  class Block < Struct.new(:name, :head, :body, :opened, :markdown)
-    def initialize(name, head = "", body = "", opened = false, markdown = +"")
-      super
-    end
-  end
-
   def initialize(html, opts = {})
-    @opts = opts || {}
-    @doc = fix_span_elements(Nokogiri::HTML(html))
+    @opts = opts
+
+    # we're only interested in <body>
+    @doc = Nokogiri::HTML5(html).at("body")
 
-    remove_whitespaces!
+    remove_not_allowed!(@doc)
+    remove_hidden!(@doc)
+    hoist_line_breaks!(@doc)
+    remove_whitespaces!(@doc)
+  end
+
+  def to_markdown
+    traverse(@doc)
+      .gsub(/\n{2,}/, "\n\n")
+      .strip
   end
 
-  # If a `<div>` is within a `<span>` that's invalid, so let's hoist the `<div>` up
-  INLINE_ELEMENTS ||= %w{span font}
-  BLOCK_ELEMENTS ||= %w{div p}
-  def fix_span_elements(node)
-    if (INLINE_ELEMENTS.include?(node.name) && BLOCK_ELEMENTS.any? { |e| node.at(e) })
-      node.swap(node.children)
+  private
+
+  def remove_not_allowed!(doc)
+    allowed = Set.new
+
+    HtmlToMarkdown.private_instance_methods.each do |m|
+      if tag = m.to_s[/^visit_(.+)/, 1]
+        allowed << tag
+      end
     end
 
-    node.children.each { |c| fix_span_elements(c) }
-    node
+    @doc.traverse { |node| node.remove if !allowed.include?(node.name) }
+  end
+
+  HIDDEN_STYLES ||= /(display\s*:\s*none)|(visibility\s*:\s*hidden)|(opacity\s*:\s*0)|(transform\s*:\s*scale\(0\))|((width|height)\s*:\s*0)/i
+
+  def remove_hidden!(doc)
+    @doc.css("[hidden]").remove
+    @doc.css("[style]").each { |n| n.remove if n["style"][HIDDEN_STYLES] }
+    @doc.css("img[width]").each { |n| n.remove if n["width"].to_i <= 0 }
+    @doc.css("img[height]").each { |n| n.remove if n["height"].to_i <= 0 }
   end
 
-  def remove_whitespaces!
-    @doc.traverse do |node|
-      if node.is_a?(Nokogiri::XML::Text) && node.parent.name != "pre"
-        node.content = node.content.gsub(/\A[[:space:]]+/, "") if node.previous_element&.description&.block?
-        node.content = node.content.gsub(/\A[[:space:]]+/, "") if node.previous_element.nil? && node.parent.description&.block?
-        node.content = node.content.gsub(/[[:space:]]+\z/, "") if node.next_element&.description&.block?
-        node.content = node.content.gsub(/[[:space:]]+\z/, "") if node.next_element.nil? && node.parent.description&.block?
-        node.content = node.content.gsub(/\r\n?/, "\n")
-        node.remove if node.content.empty?
+  # When there's a <br> inside an inline element, split the inline element around the <br>
+  def hoist_line_breaks!(doc)
+    klass = "_" + SecureRandom.hex
+    doc.css("br").each { |br| br.add_class(klass) }
+
+    loop do
+      changed = false
+
+      doc.css("br.#{klass}").each do |br|
+        parent = br.parent
+
+        if parent.description.block?
+          br.remove_class(klass)
+        else
+          before, after = parent.children.slice_when { |n| n == br }.to_a
+
+          if before.size > 1
+            b = Nokogiri::XML::Node.new(parent.name, doc)
+            before[0...-1].each { |c| b.add_child(c) }
+            parent.previous = b if b.inner_html.present?
+          end
+
+          if after.present?
+            a = Nokogiri::XML::Node.new(parent.name, doc)
+            after.each { |c| a.add_child(c) }
+            parent.next = a if a.inner_html.present?
+          end
+
+          parent.replace(br)
+
+          changed = true
+        end
       end
+
+      break if !changed
     end
   end
 
-  def to_markdown
-    @stack = [Block.new("root")]
-    @markdown = +""
-    traverse(@doc)
-    @markdown << format_block
-    @markdown.gsub!(/\n{3,}/, "\n\n")
-    @markdown.strip!
-    @markdown
+  # Removes most of the unnecessary white spaces for better markdown conversion
+  # Loosely based on the CSS' White Space Processing Rules (https://www.w3.org/TR/css-text-3/#white-space-rules)
+  def remove_whitespaces!(node)
+    return true if "pre" == node.name
+
+    node.children.chunk { |n| is_inline?(n) }.each do |inline, nodes|
+      if inline
+        collapse_spaces!(nodes) && remove_trailing_space!(nodes)
+      else
+        nodes.each { |n| remove_whitespaces!(n) }
+      end
+    end
   end
 
-  def traverse(node)
-    node.children.each { |n| visit(n) }
+  def is_inline?(node)
+    node.text? || ("br" != node.name && node.description.inline? && node.children.all? { |n| is_inline?(n) })
   end
 
-  def visit(node)
-    return if node["style"] && node["style"][/display\s*:\s*none/]
-
-    if node.description&.block? && node.parent&.description&.block? && @stack[-1].markdown.size > 0
-      block = @stack[-1].dup
-      @markdown << format_block
-      block.markdown = +""
-      block.opened = true
-      @stack << block
+  def collapse_spaces!(nodes, was_space = true)
+    nodes.each do |node|
+      if node.text?
+        text = String.new
+
+        node.text.chars.each do |c|
+          if c[/[[:space:]]/]
+            text << " " if !was_space
+            was_space = true
+          else
+            text << c
+            was_space = false
+          end
+        end
+
+        node.content = text
+      else
+        node.children.each { |n| was_space = collapse_spaces!([n], was_space) }
+      end
     end
 
-    visitor = "visit_#{node.name}"
-    respond_to?(visitor) ? send(visitor, node) : traverse(node)
+    was_space
   end
 
-  BLACKLISTED ||= %w{button datalist fieldset form input label legend meter optgroup option output progress select textarea style script}
-  BLACKLISTED.each do |tag|
-    class_eval <<-RUBY
-      def visit_#{tag}(node)
-        ""
-      end
-    RUBY
+  def remove_trailing_space!(nodes)
+    last = nodes[-1]
+
+    if last.text?
+      last.content = last.content[0...-1] if last.content[-1] == " "
+    elsif last.children.present?
+      remove_trailing_space!(last.children)
+    end
   end
 
-  def visit_pre(node)
-    code = node.children.find { |c| c.name == "code" }
-    code_class = code ? code["class"] : ""
-    lang = code_class ? code_class[/lang-(\w+)/, 1] : ""
-    pre = Block.new("pre")
-    pre.markdown = +"`‍``#{lang}\n"
-    @stack << pre
-    traverse(node)
-    pre.markdown << "\n`‍``\n"
-    @markdown << format_block
+  def traverse(node)
+    node.children.map { |n| visit(n) }.join
   end
 
-  def visit_blockquote(node)
-    @stack << Block.new("blockquote", "> ", "> ")
-    traverse(node)
-    @markdown << format_block
-  end
-
-  BLOCK_WITH_NEWLINE ||= %w{div p}
-  BLOCK_WITH_NEWLINE.each do |tag|
-    class_eval <<-RUBY
-      def visit_#{tag}(node)
-        @stack << Block.new("#{tag}")
-        traverse(node)
-        @markdown << format_block
-        @markdown << "\n"
-      end
-    RUBY
+  def visit(node)
+    visitor = "visit_#{node.name}"
+    send(visitor, node) if respond_to?(visitor, true)
   end
 
-  BLOCK_LIST ||= %w{menu ol ul}
-  BLOCK_LIST.each do |tag|
-    class_eval <<-RUBY
-      def visit_#{tag}(node)
-        @stack << Block.new("#{tag}")
-        traverse(node)
-        @markdown << format_block
-      end
-    RUBY
+  ALLOWED_IMG_SRCS ||= %w{http:// https:// www.}
+
+  def allowed_hrefs
+    @allowed_hrefs ||= begin

[... diff too long, it was truncated ...]

GitHub sha: 501b19b6

This commit appears in #9586 which was approved by eviltrout. It was merged by ZogStriP.