Skip to content

Commit

Permalink
parallelized fixing alignments when language tags are used
Browse files Browse the repository at this point in the history
  • Loading branch information
onadegibert committed Jul 2, 2024
1 parent 0a571e5 commit d0dd5ab
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion pipeline/alignment/generate-alignment-and-shortlist.sh
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,10 @@ rm -rf "${dir}"

# If there are language tags, we need to modify the alignments by adding index 1 to every source token
if [ $o2m_student == "True" ]; then

echo "###### Correcting alignments taking into account language tags"
pigz -dc "${output_dir}/corpus.aln.gz" | sed -E 's/([0-9]+)-([0-9]+)/echo $((\1+1))"-\2"/ge' | sed 's/echo //g' | gzip > "${output_dir}/corpus.aln.fixed.gz"
pigz -dc "${output_dir}/corpus.aln.gz" | parallel --no-notice --pipe -k -j "${threads}" --block 50M \
'sed -E "s/([0-9]+)-([0-9]+)/echo \$((\1+1))-\2/ge" | sed "s/echo //g"'| gzip > "${output_dir}/corpus.aln.fixed.gz"
mv "${output_dir}/corpus.aln.fixed.gz" "${output_dir}/corpus.aln.gz"
fi
echo "###### Done: Generating alignments and shortlist"

0 comments on commit d0dd5ab

Please sign in to comment.