ELTeC
A multilingual European Literary Text Collection (ELTeC), ultimately containing around 2,500 full-text novels in at least 10 different languages, permitting to test methods and compare results across national traditions.
The ELTeC is built in three iterations:
- First iteration: 6 subcollections (100 novels per language) for the period ca. 1850 to 1920, providing a starting point for research.
- Second iteration: at least an additional 4 subcollections (100 novels per language) for the same period, completing the “ELTeC core”.
- Third iteration: extensions to the “ELTeC core” with at least 6 additional subcollections (a) in additional languages, widening the range of ELTeC, (b) for languages already included, but the earlier period from ca. 1780 to 1850, enabling diachronic views on literary history and (c) with additional, larger but less strictly structured subcollections for languages already included, providing a broader empirical base for specific analyses.
Project:
Search for Regular Metadata (Facets Possible)
1. language --> edition.language:"eng"
2. timeSlot --> work.temporal.id.value:"timeSlot" AND work.temporal.value:"T3" (in ONE tag!)
3. firstEdition --> work.dateOfCreation.value:"1840"
Search for Project Specific Metadata (Search only)**
4. authorGender --> work.subject.id.value:"authorGender" AND work.subject.value:"male"] (in ONE tag!)
5. size --> work.subject.id.value:"size" AND work.subject.value:"medium" (in ONE tag!)
6. reprintCount --> work.subject.id.value:"reprintCount" AND work.subject.value:"high" (in ONE tag!)
Statistics:
Language | Texts | Words | Male Author | Female Author | Short | Medium | Long | 1840-59 | 1860-79 | 1880-99 | 1900-20 | Frequent | Rare |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cze | 16 | 366626 | 14 | 2 | 16 | 0 | 0 | 5 | 6 | 5 | 0 | 0 | 15 |
deu | 98 | 12086096 | 65 | 33 | 20 | 37 | 41 | 24 | 24 | 25 | 25 | 46 | 46 |
eng | 100 | 11794738 | 49 | 51 | 28 | 28 | 44 | 22 | 22 | 30 | 26 | 32 | 68 |
fra | 100 | 7986274 | 64 | 36 | 30 | 43 | 27 | 25 | 25 | 25 | 25 | 44 | 56 |
gre | 11 | 42524 | 10 | 1 | 11 | 0 | 0 | 0 | 1 | 6 | 4 | 3 | 4 |
hun | 100 | 7591321 | 85 | 15 | 44 | 33 | 23 | 24 | 24 | 25 | 27 | 41 | 31 |
ita | 34 | 3328244 | 32 | 2 | 13 | 10 | 11 | 5 | 12 | 10 | 7 | 12 | 0 |
lit | 18 | 516555 | 11 | 7 | 14 | 3 | 1 | 5 | 3 | 2 | 8 | 5 | 13 |
nor | 27 | 1114092 | 22 | 5 | 18 | 9 | 0 | 2 | 2 | 19 | 4 | 26 | 1 |
por | 96 | 6313980 | 80 | 16 | 39 | 39 | 18 | 11 | 37 | 18 | 30 | 23 | 20 |
rom | 70 | 4205653 | 58 | 8 | 32 | 26 | 12 | 3 | 14 | 22 | 31 | 23 | 47 |
slv | 100 | 5682120 | 89 | 11 | 53 | 39 | 8 | 2 | 13 | 36 | 49 | 48 | 52 |
spa | 46 | 3989071 | 34 | 12 | 14 | 21 | 11 | 10 | 14 | 15 | 7 | 29 | 17 |
srp | 53 | 2253171 | 46 | 7 | 35 | 18 | 0 | 1 | 5 | 23 | 24 | 15 | 28 |
The ELTeC core contains at least 10 linguistically annotated subcollections of 100 novels comparable in their internal structure in at least 10 different European languages, totalling at least 1,000 full-text novels. The extended ELTeC takes the total number of full-text novels to at least 2,500. Novels have been chosen among major literary genres for availability and size. Chronological limits are due to constraints related to copyright and availability of quality full texts.
- An overview of the current state in ELTeC corpus building can be found here: Distant Reading
- Work on the ELTeC collections is in progress here: COST-ELTeC on Github
- A collection of relevant documentation can be found here: Distant Reading on Github
For creating such a benchmark corpus, a corpus design which allows for a comparability of texts and individual sub-collections according to different metadata sets was needed. It should be possible to sample sub-collections from the ELTeC for specific tasks and research questions, and to reformat them in ways appropriate to own tools. The focus of the ELTeC encoding scheme is thus not to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.
- Rechtsinhaber*in
- ELTeC conversion
- Zitationsvorschlag für dieses Objekt
- TextGrid Repository (2020). README.md. ELTeC Test. ELTeC conversion. https://hdl.handle.net/21.T11991/0000-001A-728A-8