Термин
|
Цитата
|
Corpus
|
Corpora are sources of quantitative information beyond compare.
|
Corpus
|
Leech (1992) argues that the corpus is a more powerful methodology from the point of view of the scientific method, as it is open to objective verification of results.
|
Corpus
|
Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.
|
Corpus
|
However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
|
Corpus
|
In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition.
|
Corpus
|
We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions.
|
Corpus
|
Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.
|
Corpus
|
There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.
|
Corpus
|
Part-of-speech annotation is useful because it increases the specificity of data retrieval from corpora, and also forms an essential foundation for further forms of analysis (such as syntactic parsing and semantic field annotation).
|
Corpus
|
Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby users will take a corpus, either already annotated, or unannotated, and add to it their own form of annotation, oriented particularly towards their own research goal.
|
Corpus
|
In this session we will examine a few of the roles which corpora may play in the study of language. The importance of corpora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language.
|
Corpus
|
It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety.
|
Corpus
|
A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined.
|
Corpus
|
Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types of research which have used corpora.
|
Corpus
|
Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.
|
Corpus
|
Most European languages (not to mention Chinese, Japanese, Korean etc.) now have some sort of corpus already and there is a growing awareness that a good corpus can be put to many uses; hence their importance grows. Despite initial disapprovals voiced by some linguists, doubts are dispelled by obvious and indisputable facts: nobody has ever been able to manually collect and subsequently process so much data in his or her lifetime as the computer can in a very short time.
|
Corpus
|
It may still be premature to try to mark out exhaustively what corpora may do for language studies and linguists; undoubtedly, many new options are still to come while the appetite of linguists is gradually whetted and new ways of corpus exploitation are offered by corpus linguists. In fact, it is hard to see a linguistic discipline not being able to profit from a corpus one way or another, both written and oral. It is increasingly clearer that new ways and methods for retrieving information from corpora will have to be given more thought.
|
Corpus
|
Since any language needs a consistent, perpetual and next-to-exhaustive coverage of its data, it should have a corpus of corresponding qualities, although in practice it is a gradual business of taking many minor decisions in the course of its construction and maintenance. This is particularly important in the case of small languages, which, unlike English and other languages, cannot afford the luxury of having a variety and multitude of corpora for specific purposes, at least not at the moment. What is really needed is a steady increase and perpetual growth of even, by present standards, very large corpora of billions of words, which should be as much representative as possible.
|
Corpus
|
Although the degree of the coverage of language by a large corpus is considerable, it is by no means true that today's corpora reflect language as a whole. Moreover, some corpus linguists are becoming more and more susceptible to another challenge here, namely the degree of representativeness of this coverage, which is very much an open issue and matter of much dispute.
|
Corpus
|
As information is to be found coming from all fields of human life and activity, it is hard to imagine that corpora can be based on a collection of, perhaps, newspapers only. On the other hand, this diversity of sources suggests that a mapping of proportions in which various kinds of information occur should take place and be reflected in the design and structure of the corpus, should this be a general type of corpus. This raises the problem of the corpus representativeness, mentioned above.
|
Corpus
|
More generally, one may wonder where this trend actually fits in, in an attempt to pursue purely practical and utilitarian goals, or in one aiming at an exhaustive, systematic and non-eclectic description of one's language. Corpora definitely offer the latter possibility.
|
Corpus
|
Corpora are cross-sections of a discourse universe comprising all communication acts. The texts they monitor are principally transient communication acts.
|
Corpus
|
It is the task of the linguist to define and delimit the scope of the discourse universe she or he is interested in in such a way that it can be reduced to a corpus. Parameters can be language, time segment, region, situation, external and internal properties of texts, and many others.
|
Corpus collection
|
Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 - analysis was gathered from a large number of children with the express aim of establishing norms of development.
|
Early corpus linguistics
|
All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:
The sentences of a natural language are finite.
The sentences of a natural language can be collected and enumerated.
|
1. Основные понятия ………………………………………………......
|
3
|
1.1. Введение: корпусы и корпусная лингвистика ……………...
|
–
|
1.2. Репрезентативность …………………………………………..
|
5
|
1.3. Размер корпуса ……………………………………………….
|
–
|
1.4. Разметка ……………………………………………….............
|
6
|
1.5. Технология создания корпусов ……………………………...
|
7
|
1.6. Автоматическая разметка ……………………………………
|
8
|
1.7. Исправление ошибок и снятие неоднозначности ………….
|
9
|
1.8. Форматы данных и стандартизация ………………………...
|
–
|
1.9. Корпусные менеджеры ………………………………............
|
10
|
1.10. Пользователи и способы использования корпусов …..............
|
11
|
1.11. Типы корпусов ………………………………………………
|
12
|
1.12. Терминология ……………………………………………….
|
14
|
2. Программа учебной дисциплины «Корпусная лингвистика» ......
|
15
|
2.1. Организационно-методический раздел ……………………..
|
–
|
2.2. Содержание курса ……………………………………............
|
16
|
2.3. Часть 1. Введение в корпусную лингвистику ….…………..
|
–
|
2.4. Часть 2. Создание корпусов ……….………………………...
|
20
|
2.5. Часть 3. Использование корпусов ………………….……….
|
25
|
Приложение 1. Корпусы в сети Интернет…………………………….
|
29
|
Приложение 2. Метаданные текстов в «Национальном корпусе
русского языка» (НКРЯ) ………………………………….............
|
35
|
Приложение 3. Фрагмент словаря-тезауруса по корпусной
лингвистике ………………………………………………………..
|
38
|
Приложение 4. Миникорпус корпусной терминологии ……..……...
|
44
|