Higher revision available You are viewing revision 6 of this document. A higher revision of this document has been published: Revision 10.

ELTeC

A multilingual European Literary Text Collection (ELTeC), ultimately containing around 2,500 full-text novels in at least 10 different languages, permitting to test methods and compare results across national traditions.

The ELTeC is built in three iterations:

  1. First iteration: 6 subcollections (100 novels per language) for the period ca. 1850 to 1920, providing a starting point for research.
  2. Second iteration: at least an additional 4 subcollections (100 novels per language) for the same period, completing the “ELTeC core”.
  3. Third iteration: extensions to the “ELTeC core” with at least 6 additional subcollections (a) in additional languages, widening the range of ELTeC, (b) for languages already included, but the earlier period from ca. 1780 to 1850, enabling diachronic views on literary history and (c) with additional, larger but less strictly structured subcollections for languages already included, providing a broader empirical base for specific analyses.

Project:

Browse

All Works

Search for Regular Metadata (Facets Possible)

1. language --> edition.language:"eng"

2. timeSlot --> work.temporal.id.value:"timeSlot" AND work.temporal.value:"T3" (in ONE tag!)

3. firstEdition --> work.dateOfCreation.value:"1840"

Search for Project Specific Metadata (Search only)**

4. authorGender --> work.subject.id.value:"authorGender" AND work.subject.value:"male"] (in ONE tag!)

5. size --> work.subject.id.value:"size" AND work.subject.value:"medium" (in ONE tag!)

6. reprintCount --> work.subject.id.value:"reprintCount" AND work.subject.value:"high" (in ONE tag!)

Statistics:

Language Texts Words Author ♂/♀ Length s/m/l 1840-59 1860-79 1880-99 1900-20 Frequent Rare
cze 16 366626 14/2 16/0/0 5 6 5 0 0 15
deu 98 12086096 65/33 20/37/41 24 24 25 25 46 46
eng 100 11794738 49/51 28/28/44 22 22 30 26 32 68
fra 100 7986274 64/36 30/43/27 25 25 25 25 44 56
gre 11 42524 10/1 11/0/0 0 1 6 4 3 4
hun 100 7591321 85/15 44/33/23 24 24 25 27 41 31
ita 34 3328244 32/2 13/10/11 5 12 10 7 12 0
lit 18 516555 11/7 14/3/1 5 3 2 8 5 13
nor 27 1114092 22/5 18/9/0 2 2 19 4 26 1
por 96 6313980 80/16 39/39/18 11 37 18 30 23 20
rom 70 4205653 58/8 32/26/12 3 14 22 31 23 47
slv 100 5682120 89/11 53/39/8 2 13 36 49 48 52
spa 46 3989071 34/12 14/21/11 10 14 15 7 29 17
srp 53 2253171 46/7 35/18/0 1 5 23 24 15 28

The ELTeC core contains at least 10 linguistically annotated subcollections of 100 novels comparable in their internal structure in at least 10 different European languages, totalling at least 1,000 full-text novels. The extended ELTeC takes the total number of full-text novels to at least 2,500. Novels have been chosen among major literary genres for availability and size. Chronological limits are due to constraints related to copyright and availability of quality full texts.

For creating such a benchmark corpus, a corpus design which allows for a comparability of texts and individual sub-collections according to different metadata sets was needed. It should be possible to sample sub-collections from the ELTeC for specific tasks and research questions, and to reformat them in ways appropriate to own tools. The focus of the ELTeC encoding scheme is thus not to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.


Citation Suggestion for this Object
TextGrid Repository (2020). README.md. ELTeC Test. ELTeC conversion. https://hdl.handle.net/21.T11991/0000-001A-728A-8