Learning Languages with the Help of Algorithms

Suppose you’re learning a new language and want to boost your vocabulary in a very time-efficient way.

People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.

Here is a formalization. Suppose for a large set of m books of average length n words, you want to pick the one book that has the highest vocabulary impact from the set of books. This vocabulary impact of a book is measured by a weighted sum across all vocabulary words in the book, each word weighted by how common the word is in the language, as measured by word frequency across all books; this essentially gives a probability weight of each word in the language.

That’s an easy problem to solve. First, optionally filter out stop words like (in the case of English) “the“ and “and” considered in some sense to have “not much meaning.“ Second, across all books build a unique word list along with counts of number of occurrences of each word. Finally, for each book evaluate the coverage of those words, computing a score as described above.

Computing the unique word list and scores can be done by a hashing process that runs in linear order mn time typically, worst case mn log(mn). To compute the score also costs average linear time. So the entire process can be done in linear time.

What if you want to find the best two books to read, with the best joint vocabulary coverage? To find the optimal solution, the best known time is not linear but is quadratic for the general case.

How about the best k books for arbitrary k > 0?

This is an NP-hard problem, meaning that the compute time to solve this problem exactly for the best known algorithms for the general case grows exponentially in k as the size k of your desired set of reading books increases.

So you cannot expect to solve this problem exactly for large k. All hope is not lost, however. This is a maximal weighted cover problem, residing in a subset of the NP hard problems known as submodular problems. Because of this, approximate algorithms are known that guarantee approximation accuracy within a certain known factor of the true best solution (see [1-5]).

These algorithms are described in [6], [7], [8]. The basic idea of the algorithms is to add a single high-impact book at a time to the running set of high-impact books—a greedy algorithm. It is not guaranteed to be the best book list, but it is reasonable.

... continue reading