Higher revision available You are viewing revision 5 of this document. A higher revision of this document has been published: Revision 10.

ELTeC

A multilingual European Literary Text Collection (ELTeC), ultimately containing around 2,500 full-text novels in at least 10 different languages, permitting to test methods and compare results across national traditions.

The ELTeC is built in three iterations:

  1. First iteration: 6 subcollections (100 novels per language) for the period ca. 1850 to 1920, providing a starting point for research.
  2. Second iteration: at least an additional 4 subcollections (100 novels per language) for the same period, completing the “ELTeC core”.
  3. Third iteration: extensions to the “ELTeC core” with at least 6 additional subcollections (a) in additional languages, widening the range of ELTeC, (b) for languages already included, but the earlier period from ca. 1780 to 1850, enabling diachronic views on literary history and (c) with additional, larger but less strictly structured subcollections for languages already included, providing a broader empirical base for specific analyses.

Project:

Browse

All Works

Search for Regular Metadata (Facets Possible)

1. language --> edition.language:"eng"

2. timeSlot --> work.temporal.id.value:"timeSlot" AND work.temporal.value:"T3" (in ONE tag!)

3. firstEdition --> work.dateOfCreation.value:"1840"

Search for Project Specific Metadata (Search only)**

4. authorGender --> work.subject.id.value:"authorGender" AND work.subject.value:"male"] (in ONE tag!)

5. size --> work.subject.id.value:"size" AND work.subject.value:"medium" (in ONE tag!)

6. reprintCount --> work.subject.id.value:"reprintCount" AND work.subject.value:"high" (in ONE tag!)

Statistics:

Language Texts Words Male Author Female Author Short Medium Long 1840-59 1860-79 1880-99 1900-20 Frequent Rare
cze 16 366626 14 2 16 0 0 5 6 5 0 0 15
deu 98 12086096 65 33 20 37 41 24 24 25 25 46 46
eng 100 11794738 49 51 28 28 44 22 22 30 26 32 68
fra 100 7986274 64 36 30 43 27 25 25 25 25 44 56
gre 11 42524 10 1 11 0 0 0 1 6 4 3 4
hun 100 7591321 85 15 44 33 23 24 24 25 27 41 31
ita 34 3328244 32 2 13 10 11 5 12 10 7 12 0
lit 18 516555 11 7 14 3 1 5 3 2 8 5 13
nor 27 1114092 22 5 18 9 0 2 2 19 4 26 1
por 96 6313980 80 16 39 39 18 11 37 18 30 23 20
rom 70 4205653 58 8 32 26 12 3 14 22 31 23 47
slv 100 5682120 89 11 53 39 8 2 13 36 49 48 52
spa 46 3989071 34 12 14 21 11 10 14 15 7 29 17
srp 53 2253171 46 7 35 18 0 1 5 23 24 15 28

The ELTeC core contains at least 10 linguistically annotated subcollections of 100 novels comparable in their internal structure in at least 10 different European languages, totalling at least 1,000 full-text novels. The extended ELTeC takes the total number of full-text novels to at least 2,500. Novels have been chosen among major literary genres for availability and size. Chronological limits are due to constraints related to copyright and availability of quality full texts.

For creating such a benchmark corpus, a corpus design which allows for a comparability of texts and individual sub-collections according to different metadata sets was needed. It should be possible to sample sub-collections from the ELTeC for specific tasks and research questions, and to reformat them in ways appropriate to own tools. The focus of the ELTeC encoding scheme is thus not to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.


Citation Suggestion for this Object
TextGrid Repository (2020). README.md. ELTeC Test. ELTeC conversion. https://hdl.handle.net/21.T11991/0000-001A-728A-8