ModelFront docs

Linear TSV documentation

About the Linear TSV standard format

Data files for training and testing risk prediction include at least 3 columns - the original, the machine translation and final human post-edited translation or label. Common parallel data file formats like XML and XLIFF do not support that.

The ModelFront console supports uploading and downloading the Linear TSV file format for training, testing and using risk prediction because it's simple, standardized and scalable.

Format specification

The control characters tab (\t), newline (\n) and carriage return (\r) can be included by escaping them with a preceding \.

The literal character \ should also be escaped with \. Therefore the literals \t, \n and \r should be double escaped.

The segments can be plaintext or XML/HTML.

Examples

For a post-edited segment where the original is Apple machines, the machine translation is Äpfelmaschinen and the final human post-edited translation is Apple-Geräte, the line will just be:

Apple machines	Äpfelmaschinen	Apple-Geräte

Note the 2 tabs separating the 3 columns!

Control characters

For a segment that includes linebreaks when displayed, the newline character is escaped:

Machines:\nApple	Maschinen:\nÄpfel	Geräte:\nApple

Quotation marks

For a segment with quotation marks, there is no escaping:

Clinton: "It depends on what the meaning of the word 'is' is."	Clinton: "Depende del significado de la palabra 'es'".	Clinton: "Depende del significado de la palabra 'es'".

If there is metadata, it can be provided in additional columns.

Example code

⚠️ Do not just use a CSV a library and replace commas with tabs! The quotations marks will be incorrectly escaped, and the control characters may not be escaped at all.

Python:

def escape(s):

  # Split by literal slash, to preserve them
  parts = s.split('\\') # Just a single slash, slash-escaped

  # Replace control characters with their slash-escaped versions
  esc = lambda s: s.replace('\t', '\\t').replace('\n', '\\n').replace('\r', '\\r'))
  parts = map(esc, parts)

  # Replace the literal slashes with their slash-escaped versions
  return '\\\\'.join(parts) # Just 2 slashes, slash-escaped

Why TSV

For natural language text data, especially parallel data, TSV has key advantages over CSV, XML and JSON.

Human-readability

CSV, XML and JSON use delimiters and control characters like commas, brackets and quotes, which occur often in natural language text, and therefore require quoting or encoding.

The tab delimiter is whitespace and does not occur frequently.

Standardization

There are many conflicting dialects and specifications of CSV and XML.

The Linear TSV standard is the main TSV standard.

Scalability

ModelFront is built to handle very large files. TSV can be read in line by line - without reading the whole file into memory. TSV is also more compact than CSV, XML or JSON

Convenience

The built-in Unix command-line tools like cut and paste read and write TSV by default and fundamentally operate at the line level.

More reading

Conventions for lossless conversion to TSV

en.wikipedia.org › wiki › Tab-separated values

Linear TSV

google.com

© 2022 ModelFront Inc.