I recently finished collecting the decisions from several thousand Russian court cases; these are the statements that the judge reads from the bench–as in the image above–and can run to thousands of words. Altogether, the corpus I collected included 85 million characters. My goal was to translate these documents into English, so that undergraduate research assistants can assist in coding the outcomes of these cases. Human translation of these texts would be prohibitively expensive in terms of time and resources, so I turned to machine translation.
Translaton services from Google, AWS, or Microsoft were priced at around $1,500 USD—not bad, given the scale of the project, but enough to convince me to spend some time looking into doing my own, hand-crafted, artisanal machine translation. It turns out that doing so is relatively straightforward using pre-built models developed by the Hugging Face collective. I consider myself a reasonably strong programmer in R, and a complete novice in Python, yet I was still able to piece together a translation program that does the job. This guide is meant to help anyone else in a similar position (as well as being a reminder to my future self).
Pre-processing the data
A significant hurdle to this kind of homebrew machine translation is that most off-the-shelf models are limited in the amount of text they can translate at once—something like 500 tokens at a time, which includes words, numerals, and punctuation. To get around this problem, I first had to break my long texts up into machine-parseable chunks. I did this by calling spacyR
to cut the text up into individual sentences. The sentences from each text are then saved as a .json file in order to be imported into Python later. To run this script, you’ll need the tidyverse
, jsonlite
, and spacyr
packages. Once you have run this chunk once, you can comment out the spacy_download_langmodel
command; it’s only needed on the first run. I have my initial data saved as an RDS file; it contains the full Russian-language text for each document. In this case, it includes a text vector called decision.text, and an ID variable called caseid.
library(jsonlite)
library(tidyverse)
library(spacyr)
library(reticulate) #If running python from RStudio
spacy_download_langmodel("ru_core_news_md") #Only needed once; downloads Russian language model
spacy_initialize(model = "ru_core_news_md") #Loads the Russian language model
data_full <- readRDS(here::here("Data", "combined_dataset.RDS"))
The next step is to run a loop that uses spacy_tokenize
to break the text up into sentences and write the results as a list in .json format. The tokenizer will basically look for end-of-sentence punctuation followed by whitespace and then a capital letter. This leads to some false positives in Russian, especially with respect to the expression ‘г.’, which can often be a stand-in for “город” (city) or “год” (year). For example, the phrase “г. Москва” (city of Moscow) will be split as two sentences by the tokenizer. The result is lots of little phrases wrongly listed as sentences.
To ‘fix’ this, the inner loop looks for short phrases and appends them to the sentence above. This is of course completely hackish, and you may better served using regex’s to actually find the offending phrases that cause splits. The end of the outer loop cleans up the resulting NAs, names a new file based on the text ID, and saves it to json format.
for(i in 1:nrow(data_full)){
if(is.na(data_full$decision.text[i])==TRUE) next
sentence_list <- spacy_tokenize(x = data_full$decision.text[i], what = "sentence", output = "list")
list_max <- length(sentence_list$text1)
for(j in list_max:2){
if(nchar(sentence_list$text1[j]) <= 30) {
sentence_list$text1[j-1] <- paste(sentence_list$text1[j-1], sentence_list$text1[j])
sentence_list$text1[j] <- NA
}
}
sentence_list$text1 <- sentence_list$text1[!is.na(sentence_list$text1)]
filename <- paste0("caseid", "_", data_full$caseid[i], ".json")
write_json(sentence_list, here::here("Data", "Text_lists", filename))
}
spacy_finalize() #Ends spacy process
Translation
Now we move over to Python. I have Anaconda set up on my machines, but I expect you could run the whole program from start to finish in R if you wanted to use the reticulate
package. First, it’s best to set up an environment for the process to run in: Anaconda makes this pretty straightforward. Then some set-up in the code:
## Working directory
import os
os.getcwd() #Get your working direction
os.chdir('Target_Folder')
os.chdir('Target_Subfolder') #Replace with your file structure; bottom level is where saved texts go
## For counting tokens
import nltk
## Setting source files
path = 'C:/<yourfilepath>/Text_lists' #This is the source folder, where json files are stored
files = os.listdir(path)
files = files[0:999] #Change numbers to reflect which texts you want to translate; this captures the first 1000
The package nltk
will be used to count tokens in the sentences we attempt to translate. This is imperfect, because the translation program will sometimes return a very different number of tokens than nltk, but at least it gives us an idea. We then get the list of files in the source directory, where we’ve stored the pre-processed .json files.
The bulk of the translation work is done through the following lines of code:
## Translation code
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
translation = pipeline("translation_ru_to_en", model=model, tokenizer=tokenizer)
The Hugging Face transformers package gives us the Russian-to-English tokenizer, as well as the sequence-to-sequence translation model. We import them here into a new function called translation
. And finally the real work:
## Loading a json file of sentences and translating in a loop
import pandas as pd
for i in range(len(files)): #Change if you are only translating a subset of files
filepath_current = [path, files[i]]
filepath_current = '/'.join(filepath_current)
sentence_data = pd.read_json(filepath_current)
sentence_data.head
## Convert to list
sentence_list = sentence_data.iloc[:, 0].tolist()
## Selecting the text in a for loop
#Getting caseid
filename_current = files[i]
caseid = str ( ''.join(filter(str.isdigit, filename_current) ) )
translated_text = list() #Empty list for translations
for j in range(1, len(sentence_list)):
text = sentence_list[j]
text = " ".join(text.split()) #Remove excess whitespace (this should be okay since each is a sentence, no newlines)
text = text.replace('«', "'")
text = text.replace('»', "'") #Remove Russian quote marks
nltk_tokens = nltk.word_tokenize(text) #Gets approx number of tokens
if len(nltk_tokens) > 300 and ';' in text: #If a large sentence, follow this path to break up by semi-colons
print("Long string detected; splitting into clauses...")
text_split = text.split(';')
# translated_longsent = list() #Empty list for long sentence clauses
for q in range(len(text_split)):
clause = text_split[q]
translated_clause = translation(clause, max_length = 512)[0]['translation_text']
# translated_longsent.append(translated_clause)
#' '.join(translated_longsent) #Joins list items together with a space in between
translated_text.append(translated_clause)
else:
translated_sentence = translation(text, max_length = 512)[0]['translation_text']
translated_text.append(translated_sentence)
print("Current translation:", str(j), "of", str(len(sentence_list)))
full_text = ' '.join(translated_text) #Joins list items together with a space in between
full_text = full_text.replace(" ' ", "' ") #This fixes an error with apostrophes
print(full_text)
## Write to file
filename_save = ["caseid", caseid] #The string "caseid" plus the actual id number
filename_save = '_'.join(filename_save)
filename_save = [filename_save, ".txt"]
filename_save = ''.join(filename_save) #I assume there is a less hackish way to do this...
with open(filename_save,"w", encoding="utf-8") as f:
f.write(full_text)
print("Document", str(i), "completed.")
This long loop first reads in the .json file, and converts it to a list variable in Python. It then saves the ID number from the .json file name and saves it as a string. Then, for each sentence in the list, it:
- Removes excess whitespace
- Replaces nuisance characters (you could add more here, according to your needs)
- Counts the number of tokens in the sentence
- If the sentence is short enough, it passes it to the translator
- Saves the translation as a list, and combines the list into one text once the loop is finished.
- It then saves the result as a .txt file with a name based on the ID number.
Sometimes a sentence is too long to be translated by the model. In my use, this has often been because of long sentences full of semi-colons. The inner loop thus looks for semi-colons in long sentences, and splits them up. Unfortunately, this is not the only way that the loop can break when sentences are too long, and it is occasionally necessary to more manually de-bug the loop.
Conclusion
At the end of my workday I was honestly kind of astonished that this worked. Machine translation on a personal device was not really a thing when I first started doing social science, and it seems like this technology opens up whole frontiers of study. It does so, frankly, in part by eroding the importance of language knowledge in social science training. Not entirely, however. The translations are often a little wonky, and a little language knowledge and case expertise goes a long way in correcting them. Nevertheless, it’s not hard to imagine this technology allowing for much more rapid and wide-ranging text analysis and mixed-methods work, which I look forward to.
If you are interested in where this particular project goes moving forward, my academic website can be found at colejharvey.com.