Sanitize invalid XML characters in text content
All checks were successful
CI Pipeline / build (push) Successful in 49s

Strip invalid XML 1.0 control characters (0x00-0x08, 0x0B-0x0C, 0x0E-0x1F)
from text to prevent corrupted docx files that fail to open in LibreOffice.

Fixes SAXParseException 'PCData Invalid Char value' errors.
This commit is contained in:
2026-01-22 09:10:33 +01:00
parent 8b4f538cbb
commit 64c8679044
6 changed files with 108 additions and 2 deletions

View File

@@ -111,4 +111,21 @@ class ParagraphTest < Minitest::Test
# Newlines should be preserved in the text
assert_includes xml, "Line 1\nLine 2\nLine 3"
end
def test_invalid_xml_characters_are_stripped
xml = create_doc_and_read_xml do |doc|
doc.p "infrastruktur\x02bidrag"
doc.p "hello\x00world"
doc.p "test\x01\x03\x04value"
end
# Invalid characters should be stripped
assert_includes xml, "infrastrukturbidrag"
assert_includes xml, "helloworld"
assert_includes xml, "testvalue"
# Verify the XML is valid by parsing it (will raise if invalid)
doc = Nokogiri::XML(xml, &:strict)
assert doc.errors.empty?, "XML should be valid: #{doc.errors}"
end
end