Pivot joining 19 million lines of CSV with unix tools

I’m participating in Kodoeba, an open hackathon around the Tatoeba example sentence project.

My plan is to incorporate Mandarin-English example sentence pairs from Tatoeba into the Chinese example sentence search on Chinese Boost, mainly to allow searching them via pinyin with or without tone-marks.

The first challenge was actually to extract the Mandarin-English example sentence pairs from Tatoeba’s data files.

The data is structured in three separate TSV files (tab-separated values, equivalent to CSV):

These have columns of data, most importantly an id column which identifies each sentence:

6539139	cmn	原來如此。
1284	eng	I will be back soon.

The links.tsv file is like a pivot table that links the pairs of sentences together:

995325	989154

This would be fairly easy to join together in a naïve way using some loops and hashes, but the total number of lines across the files is quite large:

wc -l cmn_sentences.tsv eng_sentences.tsv links.tsv
    62829 cmn_sentences.tsv
  1327755 eng_sentences.tsv
 17670835 links.tsv

 19061419 total

The join tool can handle this with a bit of fiddling around, though. The join tool seems to be using a binary search to do the joining, as it requires that the input is sorted. We have the sort tool for that.

Sorting one of the TSV files by the id column with sort looks like this:

sort -t $'\t' -k 1 cmn_sentences.tsv

The -t is indicating that the column separator is a tab character, and the -k is saying that we want to sort on the first column.

That gives output like:

989154	cmn	你能证明吗
989157	cmn	他的故事肯定是真的。

We can then piece that together with join to join the sorted Mandarin sentences with the sorted links like this:

join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv)

That gives us a list of rows with the Mandarin sentence and the English sentence id on the end:

995177	cmn	他有太多書。	302096

We can then use that to join with the English sentences, but we need to sort it by the English sentence id on the end first, like this:

join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4

Finally, we can put the whole thing together to join the English and Mandarin sentences via the links:

join -t $'\t' -1 1 -2 4 <(sort -t $'\t' -k 1 eng_sentences.tsv) <(join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4)

This joins across the 19 million rows pretty quickly, according to time:

real	0m14.032s
user	0m6.555s
sys	0m0.374s

And we get a final output of Mandarin-English sentence pairs like this:

995333	eng	I really like travelling.	984480	cmn	我很喜欢旅游。

Tech mentioned