Why you should start using corpora now: the benefits to consulting and/or compiling corpora in the translation of specialised texts (EN)

The corpus-based approach to translation enables to compare and analyse written or spoken texts in order to provide a quantitative evidence of the existence of discourses. What do people really say? What do people really write? How do people say/write it? What is typical in language and what is rare, unusual or emerging usage?

 Since the 1990s it is employed in translation studies (TS), but rarely to analyse specialised discourses. Šarčević (1997) is one of the first scholars to stress the importance of using corpora to identify the main discursive features of legal genres, which are a type of specialised discourse.  

But first... what is a corpus?

 It stems from the Latin term “corpus” and can be translated into the English term “body”. In fact, in a broad sense, it refers to a body of texts which collects linguistic and paralinguistic features (e.g., gesture in videos) and is (nowadays) stored in a computer database for language research. It provides “authentic” materials (Richards, 2010) and naturally occurring language use taken from the real world contexts. 

Which are the types of corpora?

 The corpus-based approach enables to identify and analyse the main discursive features/metadata of genre and register (comparable corpora) as well as find translation solutions (parallel corpora). Then, other distinctions may depend on the number of languages involved (monolingual, bilingual, and multilingual), channel (written, spoken, or multimodal), time (synchronic or diachronic), field (general or specialised), and representativeness of sub-cultures (age group, sex, social class, and geographical region). Corpora can be extracted from various sources, such as books, emails, research articles, and so on.

 Comparable corpora consist of extracted texts, which are not the translation of each other but display similar content. In particular, a comparable corpus is very helpful to identify repetitive lexical patterns of concrete language use (i.e., collocation) and interrelated syntactic patterns in fixed phrases (i.e., colligation). In general, professional translators (above all scholars) make use of comparable corpora to produce an accurate translation product, since corpora provide words in context and help translators to understand whether the terms they choose are viable equivalents. On the other hand, the most widely used corpora for translation purposes are parallel corpora, i.e. a text and its translation. Parallel corpora are, indeed, termed by some scholars as “translation corpora” because they provide translators with translation solutions, i.e. equivalents. TMs too can be used to build parallel corpora by aligning ST segments and their corresponding translated segments.


NB: You can consult pre-existing corpora, some of them are provided involuntarily by institutional sites (e.g., EUR-lex can be used as a corpus for linguistic/translational scope), or compile ad hoc corpora by extracting texts from the web and uploading to professional tools.


What about the corpus tools?

At the core of corpus linguistics, there are the corpus tools which show such phenomena as concordances, collocations, colligations, keywords, and n-grams. Concordancers provide a list of variable textual units (e.g., words or phrases) with their immediate contexts from the corpus analysed by the linguist in order to create a concordance. It can be either bilingual or multilingual, and also serves as an input to TM system. The most known concordancer is Key Word In Context (KWIC), which provides an indexing method aligning each word in an article title that is searchable alphabetically. A helpful tool to analyse and mine information is Sketch Engine. It reveals such phenomena as collocation and colligation, i.e. the co-occurrence of lexical items (a node and its collocates) and the co-occurrence of a class of grammatical items depending on the way they function in a syntactic structure (Nordquist, 2020). Other analysable phenomena through search queries are keywords, i.e. words appearing more statistically frequently than expected, and n-grams, i.e. sequences of n letters that can be generated from a given string after removing any spaces.


7 steps to compile a corpus: a guide for beginners

1. set parameters: type of corpus (general or specialised), language/s, dimension (i.e., tokens, which are similar to words), text type or genre (e.g., Italian/English judgements), and time span (e.g., a specific period or the latest ones).

2. search for texts on the web, using Google Scholar or websites containing the texts you need.

2.1. in the case of parallel corpora, the documents must be translations of one another and not random texts about a similar topic.

3. upload the documents to the tools designed to build corpora, e.g. Sketch Engine or WordSmith tool.

4. align the documents.

5. analyse the corpus for your scope, make sure that there aren’t errors in the alignment.

6. download the corpus in one of the available formats (e.g., XLSX).



Extracts from University research thesis, Désirée Russo, Dr. (to learn more about it or to simply discuss with me, find me on LinkedIn: https://www.linkedin.com/in/d-russo31/ / https://www.linkedin.com/in/d-russo31/?locale=en_US)



Nordquist R., Colligations in English: Words That Keep Each Other

Company, ThoughtCo, New York, 2020

Richards J. C., Series Editor's Preface in Using Corpora in the Language

Classroom ed. by Reppen R., Cambridge University Press, 2010

Šarčević S., New Approach to Legal Translation, Kluwer Law International,


Further readings

Gatto M., The Web as a Corpus: Theory and Practice., Bloomsbury USA Academic, 2014


Désirée Russo, laureata prima in Mediazione linguistica e interculturale e poi in Traduzione specialistica, traduce da e verso inglese, francese e italiano in ambito giuridico, economico-finanziario, medico, enogastronomico e turistico.