I’m participating in
Kodoeba, an open
hackathon around the Tatoeba example sentence project.
My plan is to incorporate Mandarin-English example sentence pairs from Tatoeba
into the Chinese example sentence
search on Chinese
Boost, mainly to allow searching them via pinyin with or without tone-marks.
The first challenge was actually to extract the Mandarin-English example
sentence pairs from Tatoeba’s data files.
The data is structured in three separate TSV files (tab-separated values,
equivalent to CSV):
These have columns of data, most importantly an id column which identifies each
1284 eng I will be back soon.
links.tsv file is like a pivot table that links the pairs of sentences
This would be fairly easy to join together in a naïve way using some loops and
hashes, but the total number of lines across the files is quite large:
wc -l cmn_sentences.tsv eng_sentences.tsv links.tsv
join tool can handle this with a bit of fiddling around, though. The
join tool seems to be using a binary search to do the joining, as it requires
that the input is sorted. We have the
sort tool for that.
Sorting one of the TSV files by the id column with
sort looks like this:
sort -t $'\t' -k 1 cmn_sentences.tsv
-t is indicating that the column separator is a tab character, and the
-k is saying that we want to sort on the first column.
That gives output like:
989154 cmn 你能证明吗
989157 cmn 他的故事肯定是真的。
We can then piece that together with
join to join the sorted Mandarin
sentences with the sorted links like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv)
That gives us a list of rows with the Mandarin sentence and the English sentence
id on the end:
We can then use that to join with the English sentences, but we need to sort it
by the English sentence id on the end first, like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4
Finally, we can put the whole thing together to join the English and Mandarin
sentences via the links:
join -t $'\t' -1 1 -2 4 <(sort -t $'\t' -k 1 eng_sentences.tsv) <(join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4)
This joins across the 19 million rows pretty quickly, according to
And we get a final output of Mandarin-English sentence pairs like this:
995333 eng I really like travelling. 984480 cmn 我很喜欢旅游。
Pivot joining 19 million lines of CSV with unix tools