A Corpus Factory for Many Languages

A. Kilgarriff, Siva Reddy, Jan Pomikálek, Avinesh P.V.S

LREC

Abstract

For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of ‘seed words’ for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a ‘search hits’ page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * ‘clean’ (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.