Error saving files with embedded XML text and CDATA marker

Description

This issue has been reported by the Google team today. The Ocelot version is v3.0-rc7.
They got an error when trying to save some files. I've attached one of those files to the issue.

The Ocelot log reports the following error:
String ']]>' not allowed in textual content, except as the end marker of CDATA section

The attached file contains some xml text embedded in iws tags. In this text '<' and '>' characters are encoded as '&amp;lt;' and '&amp;gt;'
When Okapi reads the file, it changes the '&amp;gt;' to '>'. The issue is generated when the final part of a CDATA marker is found: the string "&amp;lt; ! [ CDATA[....]]&amp;gt;" becomes "&amp;lt; ! [ CDATA[...]]>" and the parser gets an error when Ocelot tries to save this content.

Environment

None

Activity

Show:
Chase Tingley
July 8, 2017, 7:56 AM

Hmm, maybe this is a bug in Okapi after all. Okapi usually writes out &gt; as > in PCDATA, because stuff like this is valid XML:

However, in this one case the escaping seems bad.

There's a way to specify a more conservative escaping mode in Okapi by setting the escapeGT flag on the encoder object. I will need to do some more digging to see if that's easy for Ocelot to get at, when it's not so late at night.

Marta Borriello
July 10, 2017, 8:33 AM

I've tried to set the escapeGT flag in the OkapiXLIFF12Parser.parse method and the file is properly saved.

Anyway Yves Savourel said that it could be a bug in Okapi as the case without flag should work. Indeed there are many CDATA markers in this file and the Okapi filter only fails to escape the sequence at line 3922.

However I suggest to set the escapeGT flag at the moment, so we are sure that all CDATA sequences will be properly escaped.

Chase Tingley
July 10, 2017, 5:45 PM

Sounds good!

Chase Tingley
July 10, 2017, 5:47 PM

Oh, looking at the code in Okapi, I think I know what's wrong. the XMLEncoder class has a bunch of different ways to call it and they do not all apply the same logic. Some of them check for the [> case but it looks like not all of them do, so that may be where the bug is coming from.

I was thinking about refactoring that code a while back, I knew I should have done it!

Anyways escapeGT is the safe move here. I will file a bug on Okapi.

Assignee

Marta Borriello

Reporter

Marta Borriello

Labels

None

Priority

Major