I have started to collect my vocabulary list, in the form of CSV (Comma Separated Value) files (the spreadsheet data interchange format). These can be read by my mulvoc (Multi-lingual Vocabulary) program for emacs.

File format

The vocabulary files are spreadsheet files, with a row for each word and a column for each language. The top row gives the SIL codes for the languages, and there should be a column (typically the first column) labelled "#TYPE" instead of a language name, and the entries in this column give the parts of speech for the word on their rows. (This, for example, distinguishes the noun "fear" from the verb "fear".)

The files are encoded in UTF-8 (Unicode).

Files for download

Here are some sample vocabulary files:

Lots of useful vocabulary I've picked up at various times.
test1.csv, test2.csv
Torture test files for mulvo. These contain words of the same spelling but different meaning in different languages, words of the same spelling but different parts of speech in the same language, and so forth. There are two separate files, to test mulvo's ability to combine data from multiple files. These are mostly entered from dictionaries.


I'll add to the collection as I go along, trying to make everything unencumbered. I am pondering the ethics of entering words which I have looked up in dictionaries, in terms of freedom from copyright. I think that the small sample files I have here can not be expected to alarm paper dictionary publishers, and I hope they come under ``fair use''. I would understand paper dictionary publishers being upset were I to enter large amounts of material looked up in their products (to the extent of largely reproducing that aspect of a work), although I'm not sure about the copyright status of translations of individual words. I'd be interested to find what professional lexicographers think of looking words up in existing dictionaries when writing new dictionary entries.

However, there is also a set of unencumbered dictionaries, which include translations, growing at wiktionary. I have written a script to fetch and convert data from there, into my format.

Another systematic vocabulary collection is the Swadesh List. I have imported some of the languages from this into my format.

