I must begin with a disclaimer that I am absolutely not a coder, or anything of the sort, just someone obsessed with language and vey rudimentary R skills.
The truth is that I’ve been certain for a long time that there’s some NLP magic out there that can be really helpful in language learning, especially in figuring out the most efficient way to approach a new text in a foreign language.
It’s a super practical question. In Korean, where I’m an absolute beginner, I’m most interested in finding the most frequent words in very short texts, or in kdrama episodes I’m about to watch. In Spanish, where I have a basic command of the language, I’m looking for the most topical words, or two-to-three word phrases, in a book like Harry Potter.
I recently discovered LingQ and Learning with Texts, both very great tools that keep a library of words or phrases you know, so that when you input a new text, you can immediately see what you don’t know.
But it quickly became apparent that those tools don’t help with a most basic problem: that words take on a zillion forms. If you know the word “eat,” you know the words “eating”,”eats” and “eaten.” Things get way crazier in a language like Korean where particles get added directly to words; verbs have basically infinite forms thanks to verb endings and politeness levels, and there’s a wonderful tendency to bric-a-brac all of the above. It’s fabulous and fascinating, but makes it impossible to use simple frequency to get a sense of the text.
So all that being said, my first goal was very simple: strip each word of a new text to its stem, and then use the frequency of the stems (or lemmas) to choose vocabulary words.
Luckily, I found the answer super easily using the udpipe package, which produces a table with tokenized text, creating a row for every word with associated data, including its lemmatized form, which looks as follows (plus way more columns):
Here’s the code I used to get from the raw text to a list of vocab words to accompany my study of the text:
install.packages("udpipe") library(udpipe) library(tidyverse) #load text and remove punctuation txt <- readLines("my_text.txt") txt <- gsub('[[:punct:] ]+',' ',txt) #apply udpipe to text tokens <- udpipe(txt, object = "korean") #separate lemma column by '+' to produce a new column of stems tokens <- tokens %>% separate(lemma, into = c("lemma1","lemma2"), sep="\\+", extra = "merge", fill = "right") #find top stems top <- tokens %>% count(lemma1) %>% arrange(desc(n)) %>% top_n(10) #merge back into tokens list to find vocal words vocab <- top %>% left_join(tokens, by="lemma1") %>% select(token, sentence, lemma1, lemma2, upos, xpos)
One thing to note is that you’ll need to remove ‘stopwords,’ aka very common words in your language/text, so that your top words/lemmas aren’t ones you already know. (For example, I ran this on the transcript of an episode of the kdrama Flower of Evil. Out of 3,119 words in the transcript, 369 were associated with the 10 most common roots, including 있다, 없다, 아니다, 하다, 나, 그, and 것.) I downloaded the stopwords list provided here and then filtered those out.
You can use the same script as above, just download the list and add this:
stopwords <- "stopwords-ko.txt" top <- tokens %>% filter(!token %in% stopwords) %>% #remove stopwords count(lemma1) %>% arrange(desc(n)) %>% top_n(10)
And that’s it.
Of course, there’s way more to all of this, first and foremost using the if-tdf algorithm to find words that are more common in your specific text than in a comparison corpus, suggesting they’re the most topical/useful to be familiar with. But to do that, you have to have something to compare your text against. But that’s for another time.