Multi-Word Units in Imaginative and Informative Domains Mustafa Aksan, Yeşim Aksan Mersin University 1. Introduction An important contribution of corpus analysis to study of language comes as the identification of recurrent forms in language use. In other words, a corpus analysis makes it possible to identify textually significant structures that function in the overall organization of discourse from a collection of texts. Additionally, a corpus analysis gives us quantitative data that would help derive qualitative conclusions in a more concrete and reliable manner regarding defining characteristics of the text in question. Recent developments in corpus analysis tools brought new options in identifying recurrent lexical structures and their distribution in the corpus. Further analysis of lexical structures and their role in the textual organization required more detailed study of the so-called multi-word units (MWUs). The issues discussed vary from identification and definition of multi-word units in discourse to their role in minimizing cognitive processing of information and their role in defining register properties. In this study, we will present our initial observations on emerging MWUs in two specially designed corpora, which involve texts representing imaginative and informative domains in Turkish. While the corpus representing imaginative domain includes samples from fictional prose, the one built for informative domain comprising samples from informative texts. Following a previous study on MWUs (Biber, Conrad and Cortes, 2004), we will first describe structural patterns found in both corpora and then we will present the functional types of multi-word expressions as they appear in the specialized corpora. Our analysis will concentrate more on quantitative aspects of the MWUs in two types of registers in Turkish and will highlight their distributional differences. A more detailed analysis of the MWUs in question and their specific discourse functions demands a different type of study. 2. What is a MWU? In a very simple definition provided by Biber, Conrad and Cortes (2004:371) a multi-word unit is “the most frequent sequences of words in a register.” In most cases, a multi-word unit, which is named also as a lexical bundle, a chunk or an ngram, is not a complete grammatical unit. Usually, a MWU is part of a well-defined 2 grammatical phrase or a clause where some constituent of the phrase or clause is missing. In other words, fragments of lexical sequences or syntactically incomplete but meaningful strings are forming the MWUs (e.g. süre sonra ‘after time’, başta olmak üzere ‘being in the first’) though semantically full expressions are also automatically retrieved as lexical chunks (e.g. ne de olsa ‘after all’) In the context of corpus analyses, the patterns of lexical structures are interpreted in a special manner. In this sense, the choice of a lexical structure is not merely determined via grammatical patterns in which a syntactic position is determined by formal choices but rather their occurrence is governed by systematic patterns of use. As Sinclair (1991:108) observes “By far the majority of text is made of the occurrence of common words in common patterns, or in slight variants of those common patterns. Most everyday words do not have an independent meaning, or meanings, but are components of a rich repertoire of multi-word patterns that make up a text. This is totally obscured by the procedures of conventional grammar.” Lacking a complete formal grammatical form, MWUs do not allow for complete or compositional semantic interpretation. However, they have identifiable discourse functions. As argued in most recent research, they are an important part of the communicative repertoire of speakers and writers, even though they do not correspond to the well-formed structures. For example, it is possible to decide on the genre or register properties of a text or text excerpt by simply looking at the recurrent MWUs, as they are prefabricated expressions specialized or conventionalized to a particular type and are used over and over again (Wray & Perkins, 2000; McCarthy & Carter, 2006; Hyland, 2008; OKeeffe, McCarthy & Carter, 2007; Breeze, 2013). Thus, it turns out that different registers tend to rely on different sets of lexical sequences. As for the purposes of this study, there are three basic questions: 1. What are the structural types of MWUs in Turkish? 2. What are functional categories of MWUs? 3. Do MWUs distinguish one domain/register from another? In our analysis we will mainly follow the approach presented by (Biber, Conrad & Cortes, 2004). In simple terms, their analysis adopts a frequency perspective. Upon identifying MWUs and their relative distribution in different text types, they present how such recurrent lexical bundles through their distinct use contribute the study of particular registers. 3. Data and method To investigate the domain/register based use of MWUs, two equal size sub-corpora covering a period of 20 years (1990-2009) were constructed from the databases of Turkish National Corpus (TNC) (Aksan, Aksan, Koltuksuz et al., 2012). These are Corpus of Contemporary Turkish Fiction (CCTF) and Corpus of Contemporary Turkish Informative Prose (CCTIP). Including a wide range of texts through equally sized samples ensures representativeness and balance of both corpora. CCTF is a 1 3 million-word corpus and it consists of samples from the novels and short stories of contemporary Turkish authors. CCTIP on the other hand, contains text samples of informative texts compiled from social sciences, applied sciences, world affairs, commerce and finance, art, belief and thought, leisure. Both corpora include samples taken from 200 different texts. Ngram Statistical Package software tool (Banerjee & Pedersen, 2003) is used to generate rank order frequency lists of MWUs, i.e., n-grams. The cut-off for including n-grams in frequency list is set for 1 for a million words. For the comparative and detailed analysis of n-grams across the registers, the cut-off is determined as 100 for bi-grams and 14 for tri-grams. It means that the lists contain bi-grams used at least 100 times per million; and tri-grams used at least 14 times per million. To identify and analyze functions of n-grams, concordance lines extracted and sorted via AntConc 3.2.5 (Anthony, 2010) are examined. These concordance lines show the extended discourse context of searched n-grams. N-grams being the part of larger n-grams were ignored. For instance, ya da çok ‘or more’ is ignored since it is the part of a four-word expression az ya da çok ‘more or less’. A total of 240 multi-word units of meaning, consisting of 130 bi-grams and 110 tri-grams were analyzed. 4. Quantitative findings The following tables show the result of the observed frequencies of distribution of bi-grams and tri-grams across the imaginative and the informative domains. As is seen in Table 1, the use of 2-word n-grams derived from two corpora is almost the same (11%). Table 1. Frequency of bi-grams in imaginative and informative domains Domains Frequency % Imaginative domain 106,673 11 Informative domain 107,389 11 The use of 3-word sequences in the informative domain is slightly higher (6%) than the ones observed in the imaginative domain (4%). Table 2. Frequency of tri-grams in imaginative and informative domains Domains Frequency % Imaginative domain 41,041 4 Informative domain 60,968 6 In terms of rank frequency, the 20 top-ranked bi-grams and tri-grams are listed on the basis of their observed frequencies in the imaginative and the 4 informative domains. Table 3 and 4 below demonstrate similarities and differences in the occurrence of n-grams. Table 3. The 20 top-ranked bi-grams in imaginative and informative domains Rank Imaginative domain Freq. Informative domain Freq. 1 bir şey ‘something’ 1200 ya da ‘or’ 1712 2 ya da ‘or’ 1036 bir şey ‘something’ 710 3 ben de ‘me too’ 676 böyle bir ‘such a’ 522 4 belki de ‘maybe’ 622 ne kadar ‘how much’ 497 5 ne kadar ‘how much’ 608 ve bu ‘and this’ 467 6 bir süre ‘for a while’ 503 büyük bir ‘something big’ 451 7 o da ‘s/he/it either’ 484 başka bir ‘something other’ 432 8 başka bir ‘something 451 bir de ‘one more’ 422 other than’ 9 bir de ‘one more’ 451 bir başka ‘another’ 389 10 o kadar ‘that much’ 451 yeni bir ‘something new’ 362 11 bir an ‘for a 426 bir süre ‘for a while’ 350 moment’ 12 bir gün ‘one day’ 413 ben de ‘me too’ 344 13 değil mi ‘isn’t it’ 403 daha çok ‘much more’ 337 14 bu kadar ‘this much’ 400 o kadar ‘that much’ 321 15 o zaman ‘then’ 399 belki de ‘maybe’ 313 16 büyük bir ‘something 372 o zaman ‘then’ 302 big’ 17 böyle bir ‘such a’ 363 sonra da ‘and then’ 300 18 her şey ‘everything’ 361 gibi bir ‘something like’ 291 19 bir şeyler ‘something’ 361 bir şekilde ‘in a way’ 285 20 sonra da ‘and then’ 353 bu arada ‘meanwhile’ 278 The top two bi-grams are the same in the two domains. While 13 bi-grams overlap in the 20 top-ranked list, 7 different bi-grams are observed in different rank orders. Different bi-grams in each domain are indicated in bold typeface. Table 4. The 20 top-ranked tri-grams in imaginative and informative domains Rank Imaginative domain Freq. Informative domain Freq. 1 bir kez daha ‘once more’ 134 bir süre sonra ‘after a 134 while’ 2 bir yandan da ‘and besides’ 132 ne var ki ‘however’ 103 3 başka bir şey ‘something 108 bir kez daha ‘one more 88 else’ time’ 4 bir süre sonra ‘after a while’ 108 başka bir şey ‘something 86 else’ 5 bir an önce ‘immediately’ 93 bir yandan da ‘and 78 besides’ 5 6 63 ya da bir 7 ‘nothing exists’ her zamanki gibi ‘as usual’ 61 68 8 ne yazık ki 61 9 ne de olsa 10 51 12 böyle bir şey ‘something like this’ ama yine de ‘but still/again’ ne var ki ‘however’ her ne kadar ‘no matter how’ kısa bir süre ‘for a moment’ ne olursa olsun ‘in any case’ bir başka deyişle ‘in other words’ ne yazık ki ‘unfortunately’ 49 13 ya da bir 42 14 ‘or a/ one thing’ ne olursa olsun ‘in any case’ 15 kısa bir süre 39 16 belki de bu 39 çok önemli bir ‘a very important’ bu nedenle de ‘and therefore’ her şeyden önce ‘first and foremost’ ama yine de ‘but still/again’ bir an önce ‘immediately’ 17 en ufak bir 38 bir şey yok 38 18 işte o zaman 11 bir şey yok ‘unfortunately’ ‘after all’ ‘for a moment’ ‘maybe this’ ‘a smallest’ 60 48 44 41 ‘or a/one’ ‘nothing exists’ o kadar çok ‘that much’ 73 62 60 59 58 47 45 43 39 ‘now at this 36 38 time’ 19 öyle değil mi ‘isn’t it so’ 35 çok büyük bir ‘a very big’ 38 20 ne kadar çok ‘the more’ 33 daha sonra da ‘and then’ 37 12 tri-grams overlap in the top 20 in various rank orders and 8 different trigrams, displayed in bold typeface in Table 4, rank in various orders in imaginative and informative domains. 5. Structural types of MWUs in Turkish Our initial observations suggest that MWUs in Turkish are not very much different than the MWUs identified in corpus studies in English (among many others see Biber, Conrad & Cortes, 2004; Carter & McCarthy, 2006; Hyland, 2008). What comes out frequency-driven analysis are mostly noun phrases or noun phrase (NP) fragments, a similar situation with English (see Biber, 2009). The following types that we have identified are almost exclusively NPs yet we have determined more categories to underlie their special role in the text due to their respective frequencies. For example, degree expressions and quantifiers as well as demonstratives are in fact NP elements. Similarly, those that combine with conjunctions are also part of the following NP or NP fragment. Furthermore, some n-grams appear with identical 6 items in alternative orders. As noted in previous studies, some bi-grams appear in tri-grams or some trigrams are expansions of bi-grams (Cortes, 2004). The following classifications reveal the structural types of n-grams retrieved from the corpora. Type 1 bi-grams: (Generic) NPs or NP fragments 1a. Indefinite Article+Head Noun (full phrases): bir adam ‘a man’ 1b. Quantifier+Head Noun (full phrase): her gün ‘everyday’ 1c. Adjective/modifying expression+indefinite article (NP fragment): güzel bir ‘something good’, önemli bir ‘something important’ 1d. Postpositional phrases: bir anda ‘in a minute’, bir yandan ‘on one hand’ Type 2 bi-grams: Conjunctions with fragments of conjuncts Type 2a. bi-grams: 1st conjunct fragment+conjunction: sonra da ‘and then’ Type 2b. bi-grams: conjunction+2nd conjunct fragment: ve bir ‘and a/one’ Type 2c. bigrams: Connectives: yazık ki ‘unfortunately’, biraz da ‘just a bit’ Type 3 bi-grams: Degree / quantification expressions Type 3a. bi-grams: Degree expression+Adjective: çok önemli ‘very important’ daha iyi ‘better’, en az ‘the least’ Type 3b. bi-grams: Quantifier+Degree/Degree+Quantifier: biraz daha ‘some’ Type 4 bi-grams: Postpositional Phrases/fragments Type 4a. bi-grams: Full Postpositional Phrases: benim için ‘for me’ Type 4b. bi-grams: PP fragments: süre sonra ‘after time’ Type 1 tri-grams: NPs or NP fragments Type 1a. tri-grams: Full NPs: ‘day by day’ Type 1b.tri-grams:NP fragments (Quantifier+Adjective+Indefinite article): çok büyük bir ‘a very big’, en küçük bir ‘a smallest’ Type 2 tri-grams: Conjunctions Type 2a. tri-grams: Conjuctions followed by phrases or fragments: ve bu arada ‘and meanwhile’, ve sonra da ‘and then’ Type 2b. tri-grams: Conjunctions preceded by phrases or fragments: daha önce de ‘as before’, diğer yandan da ‘and besides’ Type 2c. tri-grams: ya da forms: ya da başka ‘or another’, ya da daha ‘or more’, ya da bir ‘or a/one’ Type 2d. tri-grams: Ne-forms: ne de olsa ‘already known’, ne var ki 7 ‘however’, ne yazık ki ‘unfortunately’, her ne kadar ‘no matter how’ Type 3 tri-grams: Postpositions (with complements or fragments) Type 3a. tri-grams: Full Phrases: her şeyden önce ‘first and foremost’, bir süre sonra ‘after a while’, her zamanki gibi ‘as usual’ Type 3b. tri-grams: Postposition+fragments: gibi bir şey ‘something like’, Type 3c tri-grams: Secondary (NPs with oblique cases): bir başka deyişle ‘in other words’ Type 4 tri-grams: Olarak nominalizations bir bütün olarak ‘as a whole’, bunun sonucu olarak ‘as a result of this’ Type 5 tri-grams: Clauses bir şey yok ‘nothing exists’, bir şey var ‘something exists’ Excluding light verb constructions, one can rarely find a MWU with a verbal element as a member of it. This is probably related to the nature of function words in Turkish. Those that would appear with verb are generally bound affixes rather than free words in their written forms. All forms of bi-grams and tri-grams are composed of either entirely or partially with function words. Those that are not function words undergo semantic bleaching and form non-compositional formulaic expressions. 6. Functional categories of MWUs in Turkish Functions of n-grams and their sub-categories observed in fictional prose and informative texts are determined on the basis of the classes proposed by Biber, Conrad & Cortes, 2004; Cortes, 2004; Carter & McCarthy, 2006; Hyland, 2008. We employ three primary function groups proposed by Biber, Conrad & Cortes (2004: 384) for MWUs in English and then extend the sub-categories in line with the above-mentioned studies. Three major functions comprise referential expressions, discourse organizers and stance expressions. Accordingly, referential expressions make direct reference to physical and abstract entities to identify the entity or to single out some particular aspects of the entity as important. Discourse organizers show relationships between prior and coming discourse; and stance expressions convey the writer’s attitudes and evaluations. In addition to these categories, ngrams serve a set of functions such as reporting and questioning typically found in conversational interactions. We group these functions under the category of conversational features. As noted by Biber, Conrad & Cortes (2004) single n-gram can have multiple functions even in a single occurrence. Bi-grams and tri-grams obtained in imaginative and informative texts are then classified according to their functions performed in their extended contexts. Table 5 8 presents samples of the MWUs with respect to their major functional categories along with their relevant sub-categories. Table 5. MWUs classified according to their functions in context Category Sub-category N-gram Referential expressions Time reference daha sonra ‘later’ Place reference bir yer ‘a place’ Person reference ben de ‘me too’ Vague expression gibi bir şey ‘something like’ Quantification çok daha fazla ‘a lot more’ Description iyi bir ‘something good’ Text organizers Transitional signals ne olursa olsun ‘no matter how’ Resultative signals bunun sonucu olarak ‘as a result of this’ Focusing signals bu da ‘and this’ Framing signals söz konusu olan ‘the given’ Stance expressions Epistemic stance belki de ‘maybe’, belki de en ‘maybe the most’ Conversational features Interactional markers öyle değil mi ‘isn’t it so’ Reporting dedi kendi kendine ‘said to himself/herself’ Questioning var mı ‘is there’ From the frequency distribution of functions of bi and tri-grams, we observe that referential expressions are the most frequent discourse function with 75% in the use of bi-grams and 67% in tri-grams, as seen in Table (6) and (7). Tri-grams as text organizers have slightly higher frequency (23%) when compared to the use of bigrams (19%) under the same function. Table 6. Functions of bi-grams Functions Referential Expressions Text Organizers Stance Expressions Conversational Features Total Frequency 135 35 2 7 179 % 75,41 19,55 1,11 3,91 100 9 Table 7. Functions of tri-grams Functions Referential Expressions Text Organizers Stance Expressions Conversational Features Total Frequency 101 35 4 9 149 % 67,78 23,48 2,68 6,04 100 7. Domain/register specific MWUs The observed frequencies of bi-grams and tri-grams in imaginative and informative domains are found to be statistically significant via proportion test conducted by Minitab 16. According to the proportion analysis, low ratio (between 0-30%) between the uses of n-grams in both domains signals a difference. In other words, it indicates domain specific preference of a multi-word unit. When the ratio between the uses of n-grams across the domains is 50%, it marks that the occurrence of a MWU in one domain is half of the other domain. When the ratio between the deployments of n-grams is high (between 80-100%), it expresses a similarity in the use of MWUs across the domains. In this respect, we find out that diye düşündü ‘s/he thought like that’, diye sordu ‘s/he asked like that’, biliyor musun ‘do you know’, sen de ‘you too’ are some of the bi-grams that seem to reflect the characteristic properties of fictional prose constituting the imaginative domain. Fictional prose includes direct/indirect speech representation of characters to that effect a variety of bi-grams carry out the discourse functions of reporting (example 1). In creating a fictional world, pronominal reference is essential part of thematic development. While there is a high number of person reference in fictional prose with 240 occurrences, 2-word strings in the informative texts are utilized less, just 66 times, to maintain person reference as seen in Table 8. Table 8. Bi-grams specific to imaginative domain N-gram ImaginativeFrequency diye düşündü ‘s/he 137 thought’ diye sordu ‘s/he asked’ 234 biliyor musun ‘do you 128 know’ sen de ‘you too’ 240 InformativeFrequency 0 % 0.00 34 25 0.15 0.20 66 0.28 (1) “Nişanlı değil miyiz biz? Zaten evlenecek değil miydik,” diye sordu. (Ayşe Kulin-Gece Sesleri) 10 “He asked “Aren’t we engaged? Weren’t we already to marry?”” There are cases in which the deployment of a MWU in one domain is half of the figure in the other domain. The figures in the following table are interpreted as such. For instance, while the bi-gram bir şekilde ‘in a way’ is used 100 times in the informative domain, it is used 50 times in the imaginative domain. Table 9 lists some of the relevant examples. Table 9. Bi-grams representing the half of the figure across the domains N-gram ImaginativeInformative% Frequency Frequency belki de ‘maybe’ 622 313 0.50 yine de ‘even 324 163 0.50 though’ bir anda ‘in a 155 78 0.50 moment’ bir şekilde ‘in a 142 285 0.50 way’ (2) Adam bu yazıyı belki de, intihara karar vermeden önce yazıhanesine bırakmıştı... (Erhan Bener-Gece Gelen Ölüm) “It also maybe the situation that the man had left that writing in his office before he decided to commit suicide.” In some other cases, imaginative and informative domains exhibit similarity in use of the bigrams. Some of these bi-grams are maintained in Table 10. Examples (3) and (4) are the excerpts taken from the Corpus of Contemporary Turkish Fiction and Corpus of Contemporary Turkish Informative Prose respectively. Table 10. Bi-grams similar across the domains N-gram ImaginativeInformativeFrequency Frequency gibi bir ‘like a’ 301 291 ama bu ‘but this’ 217 210 ve o ‘and that’ 133 131 kısa bir ‘a short’ 116 114 % 0.97 0.97 0.98 0.98 (3) Buna bile bir diyeceğim olmaz. Ama bu bina Hürriyet'te çalışanlara servet filan kazandırmamıştı. (Azize Bergin-Babalide Topuk Tıkırtıları) “I have nothing to say for even this. But this building hadn’t provided the Hürriyet employees with fortune or alike.” 11 (4) Maliye Bakanı için Kordon'da ayrılan eve annesini yerleştirmiş ve o evi alacağını söylemişti. (Betül Uncular-Dünden Bugüne Lacililer) “He had his mother settled in the house that had been booked for the Finance Minister in Kordon and had declared to buy that house..” In our corpus data tri-grams are deployed to express referential links and they achieve discourse organization via a multitude of sub-categories which are found to be specific to the informative domain. In this respect, the use of lexical sequences functioning as transitional signals and framing signals such as, bir başka deyişle ‘in other words’ and başta olmak üzere ‘being in the first’ are likely to index an informative text. Some of the sample tri-grams typical of informative domain are given in Table 11. Table 11. Tri-grams specific to informative domain N-gram InformativeFrequency bir başka deyişle ‘in other words’ 59 başta olmak üzere ‘being in the 25 first’ bunun sonucu olarak ‘as a result’ 24 bu tür bir ‘of this kind’ 27 bu nedenle de ‘because of this’ 47 bir bütün olarak ‘as a whole’ 18 kendine özgü bir ‘a unique’ 17 söz konusu olan ‘ the given’ 16 ImaginativeFrequency 1 2 2 3 5 2 4 3 % 0.02 0.07 0.08 0.11 0.11 0.11 0.24 0.31 (5) Kanadalılar daha sonra misyonerlik de başta olmak üzere birçok şey yapacaktı altın işletmeciliği adı altında. (İbrahim Türkhan-Tanrı Dağlarının Yankısı) “The Canadians were to do lots of things, with missionary being in the first, under the frame of gold industry.” Tri-grams sensitive to imaginative domain again reflect the register properties of the fictional prose. Seven top-ranking lexical chunks used for reporting function, highlighting narrative time sequence and emphasizing interactional links respectively are given in Table 12. Table 12. Tri-grams specific to imaginative domain N-gram ImaginativeFrequency dedim kendi kendime ‘said to to 18 myself’ InformativeFrequency 1 % 0.06 12 dedi kendi kendine ‘said to her/himself’ bir hali vardı ‘as if s/he was’ tam o sırada ‘at that moment’ bir an için ‘just for a moment’ öyle değil mi ‘isn’t it so’ her zamanki gibi ‘as always’ 17 1 0.06 16 33 30 35 61 1 4 5 8 18 0.06 0.12 0.17 0.23 0.30 (6) Hepsi bir ağızdan konuşup durdular. Kelebeğime baktığımda, bir ölüye benzemiyor, dedim kendi kendime. (Şebnem İşigüzel-Öykümü Kim Anlatacak) “All continued to speak at the same time. When I looked at my butterfly, it doesn’t look like a dead, I said to myself.” Finally, on the basis of the proportion analysis we identify that informative and imaginative texts display similarity in the use of some of the tri-grams presented in Table 13. Table 13. Tri-grams similar across domains N-gram ImaginativeFrequency bir o kadar ‘at least that much’ 18 her geçen gün ‘day by day’ 19 ne yazık ki ‘unfortunately’ 61 diye bir şey ‘something like that’ 23 InformativeFrequency 17 18 58 22 % 0.94 0.95 0.95 0.96 (7) Yüzü aşkın ölü, yaralı. (Uğur Kökden-Bin dokuz Yüze Veda) “Dead over a hundred, and also injured at least that much” (8) Her kırmızı ışıkta durulacak yok tabii de, ama ille durulması gereken bir durum olduğunda, n'apılacak? (Ferhan Şensoy-Rum Memet) “Of course, there is not something like to stop at each red light, but if there exists a situation to stop necessarily, what will be done? 7. Conclusion In this short paper, we have presented initial results of our analysis of MWUs over data from two specially constructed corpora. We aimed (i) to provide a preliminary typology of MWUs identified in the corpora, and (ii) to discuss their particular functions in different domains representing two registers, namely fictional prose and informative texts. 13 We observe that compared to English, the structural types of multi-word units in Turkish are far less in number. The main reason for this difference may be attributed to morphological properties of both languages. Apart from quantificational differences, in both languages, similar structural types of MWUs are found. As for their respective functions and more specifically, the distribution of MWUs and their recurrent use help define domain/register differentiation. These lexical structures appear more frequently compared to others contribute to sense and distinctiveness of a register as suggested via the proportion test implemented in this study. References Aksan, Y., Aksan, M., Koltuksuz, A. et al. 2012. Construction of the Turkish National Corpus (TNC). Proceedings of Eight International Conference on Language Resources and Evaluation (LREC2012). Anthony, L. 2010. AntConc (Version 126.96.36.199w) [Computer Software]. Tokyo, Japan:Waseda University. http://www.antlab.sci.waseda.ac.jp/ Banerjee, S. & Pederson, T. 2003. The design, implementation and use of the (N)gram (S)tatistic (P)ackage. Proceedings of the fourth international conference on intelligent text processing and computational linguistics, 370-381. Biber, D., Conrad, S. and Cortes, V. 2004. If you look at ... : Lexical bundles in university teaching and textbooks. Applied Linguistics 25, 371–405. Biber, D. 2009. A corpus-driven approach to formulaic language in English. International Journal of Corpus Linguistics 14, 275-311. Breeze, R. 2013. Lexical bundles across four legal genres. International Journal of Corpus Linguistics 18, 229-253. Carter, R. A. & McCarthy, M. J. 2006. Cambridge grammar of English. Cambridge: Cambridge University Press. Cortes, V. 2004. Lexical Bundles in Published and Student Disciplinary Writing: Examples from History and Biology. English for Specific Purposes 23, 397–423. Hyland, K. 2008. As can be seen: Lexical bundles and disciplinary variation. English for specific purposes 27, 4–21. McCarthy, M.J. & Carter, R.A. 2006. This that and the other: Multi-word clusters in spoken English as visible patterns of interaction. In: McCarthy, M.J. (ed.) Explorations in corpus linguistics. Cambridge: Cambridge University Press, 7-26. Minitab 16 Statistical Software 2010.[Computer Software]. State College, PA: Minitab, Inc. (www.minitab.com). O’Keeffe, A., McCarthy, M.J. & Carter, R.A. 2007. From corpus to classroom. Cambridge: Cambridge University Press. Sinclair, J. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press. Wray, A. & Perkins, M.R. 2000. The functions of formulaic language: An integrated model. Language and Communication 20, 1-28.