Sanitize invalid XML characters in text content
All checks were successful
CI Pipeline / build (push) Successful in 49s
All checks were successful
CI Pipeline / build (push) Successful in 49s
Strip invalid XML 1.0 control characters (0x00-0x08, 0x0B-0x0C, 0x0E-0x1F) from text to prevent corrupted docx files that fail to open in LibreOffice. Fixes SAXParseException 'PCData Invalid Char value' errors.
This commit is contained in:
@@ -111,4 +111,21 @@ class ParagraphTest < Minitest::Test
|
||||
# Newlines should be preserved in the text
|
||||
assert_includes xml, "Line 1\nLine 2\nLine 3"
|
||||
end
|
||||
|
||||
def test_invalid_xml_characters_are_stripped
|
||||
xml = create_doc_and_read_xml do |doc|
|
||||
doc.p "infrastruktur\x02bidrag"
|
||||
doc.p "hello\x00world"
|
||||
doc.p "test\x01\x03\x04value"
|
||||
end
|
||||
|
||||
# Invalid characters should be stripped
|
||||
assert_includes xml, "infrastrukturbidrag"
|
||||
assert_includes xml, "helloworld"
|
||||
assert_includes xml, "testvalue"
|
||||
|
||||
# Verify the XML is valid by parsing it (will raise if invalid)
|
||||
doc = Nokogiri::XML(xml, &:strict)
|
||||
assert doc.errors.empty?, "XML should be valid: #{doc.errors}"
|
||||
end
|
||||
end
|
||||
|
||||
Reference in New Issue
Block a user