Sanitize invalid XML characters in text content
All checks were successful
CI Pipeline / build (push) Successful in 49s

Strip invalid XML 1.0 control characters (0x00-0x08, 0x0B-0x0C, 0x0E-0x1F)
from text to prevent corrupted docx files that fail to open in LibreOffice.

Fixes SAXParseException 'PCData Invalid Char value' errors.
This commit is contained in:
2026-01-22 09:10:33 +01:00
parent 8b4f538cbb
commit 64c8679044
6 changed files with 108 additions and 2 deletions

View File

@@ -8,7 +8,7 @@ module Notare
def initialize(text, bold: false, italic: false, underline: false,
strike: false, highlight: nil, color: nil, style: nil)
super()
@text = text
@text = XmlSanitizer.sanitize(text)
@bold = bold
@italic = italic
@underline = underline

View File

@@ -1,5 +1,5 @@
# frozen_string_literal: true
module Notare
VERSION = "0.0.5"
VERSION = "0.0.6"
end

View File

@@ -0,0 +1,15 @@
# frozen_string_literal: true
module Notare
module XmlSanitizer
# Invalid XML 1.0 characters: 0x00, 0x01-0x08, 0x0B-0x0C, 0x0E-0x1F
# Valid whitespace preserved: 0x09 (tab), 0x0A (LF), 0x0D (CR)
INVALID_XML_CHARS = /[\x00-\x08\x0B\x0C\x0E-\x1F]/
def self.sanitize(text)
return text unless text.is_a?(String)
text.gsub(INVALID_XML_CHARS, "")
end
end
end