Data files for training and testing risk prediction include at least 3 columns - the original, the machine translation and final human post-edited translation or label. Common parallel data file formats like XML and XLIFF do not support that.
The ModelFront console supports uploading and downloading the Linear TSV file format for training, testing and using risk prediction because it's simple, standardized and scalable.
The control characters
carriage return (
\r) can be included by escaping them with a preceding
The literal character
\ should also be escaped with
\. Therefore the literals
\r should be double escaped.
The segments can be plaintext or XML/HTML.
For a post-edited segment where the original is
Apple machines, the machine translation is
Äpfelmaschinen and the final human post-edited translation is
Apple-Geräte, the line will just be:
Apple machines Äpfelmaschinen Apple-Geräte
Note the 2 tabs separating the 3 columns!
For a segment that includes linebreaks when displayed, the newline character is escaped:
Machines:\nApple Maschinen:\nÄpfel Geräte:\nApple
For a segment with quotation marks, there is no escaping:
Clinton: "It depends on what the meaning of the word 'is' is." Clinton: "Depende del significado de la palabra 'es'". Clinton: "Depende del significado de la palabra 'es'".
If there is metadata, it can be provided in additional columns.
⚠️ Do not just use a CSV a library and replace commas with tabs! The quotations marks will be incorrectly escaped, and the control characters may not be escaped at all.
def escape(s): # Split by literal slash, to preserve them parts = s.split('\\') # Just a single slash, slash-escaped # Replace control characters with their slash-escaped versions esc = lambda s: s.replace('\t', '\\t').replace('\n', '\\n').replace('\r', '\\r')) parts = map(esc, parts) # Replace the literal slashes with their slash-escaped versions return '\\\\'.join(parts) # Just 2 slashes, slash-escaped
For natural language text data, especially parallel data, TSV has key advantages over CSV, XML and JSON.
CSV, XML and JSON use delimiters and control characters like commas, brackets and quotes, which occur often in natural language text, and therefore require quoting or encoding.
The tab delimiter is whitespace and does not occur frequently.
There are many conflicting dialects and specifications of CSV and XML.
The Linear TSV standard is the main TSV standard.
ModelFront is built to handle very large files. TSV can be read in line by line - without reading the whole file into memory. TSV is also more compact than CSV, XML or JSON
The built-in Unix command-line tools like
paste read and write TSV by default and fundamentally operate at the line level.
en.wikipedia.org › wiki › Tab-separated values