Higher revision available You are viewing revision 6 of this document. A higher revision of this document has been published: Revision 10.

ELTeC

A multilingual European Literary Text Collection (ELTeC), ultimately containing around 2,500 full-text novels in at least 10 different languages, permitting to test methods and compare results across national traditions.

The ELTeC is built in three iterations:

First iteration: 6 subcollections (100 novels per language) for the period ca. 1850 to 1920, providing a starting point for research.
Second iteration: at least an additional 4 subcollections (100 novels per language) for the same period, completing the “ELTeC core”.
Third iteration: extensions to the “ELTeC core” with at least 6 additional subcollections (a) in additional languages, widening the range of ELTeC, (b) for languages already included, but the earlier period from ca. 1780 to 1850, enabling diachronic views on literary history and (c) with additional, larger but less strictly structured subcollections for languages already included, providing a broader empirical base for specific analyses.

Project:

Browse

All Works

1. language --> edition.language:"eng"

2. timeSlot --> work.temporal.id.value:"timeSlot" AND work.temporal.value:"T3" (in ONE tag!)

3. firstEdition --> work.dateOfCreation.value:"1840"

Search for Project Specific Metadata (Search only)**

4. authorGender --> work.subject.id.value:"authorGender" AND work.subject.value:"male"] (in ONE tag!)

5. size --> work.subject.id.value:"size" AND work.subject.value:"medium" (in ONE tag!)

6. reprintCount --> work.subject.id.value:"reprintCount" AND work.subject.value:"high" (in ONE tag!)

Statistics:

Language	Texts	Words	Author ♂/♀	Length s/m/l	1840-59	1860-79	1880-99	1900-20	Frequent	Rare
cze	16	366626	14/2	16/0/0	5	6	5	0	0	15
deu	98	12086096	65/33	20/37/41	24	24	25	25	46	46
eng	100	11794738	49/51	28/28/44	22	22	30	26	32	68
fra	100	7986274	64/36	30/43/27	25	25	25	25	44	56
gre	11	42524	10/1	11/0/0	0	1	6	4	3	4
hun	100	7591321	85/15	44/33/23	24	24	25	27	41	31
ita	34	3328244	32/2	13/10/11	5	12	10	7	12	0
lit	18	516555	11/7	14/3/1	5	3	2	8	5	13
nor	27	1114092	22/5	18/9/0	2	2	19	4	26	1
por	96	6313980	80/16	39/39/18	11	37	18	30	23	20
rom	70	4205653	58/8	32/26/12	3	14	22	31	23	47
slv	100	5682120	89/11	53/39/8	2	13	36	49	48	52
spa	46	3989071	34/12	14/21/11	10	14	15	7	29	17
srp	53	2253171	46/7	35/18/0	1	5	23	24	15	28

The ELTeC core contains at least 10 linguistically annotated subcollections of 100 novels comparable in their internal structure in at least 10 different European languages, totalling at least 1,000 full-text novels. The extended ELTeC takes the total number of full-text novels to at least 2,500. Novels have been chosen among major literary genres for availability and size. Chronological limits are due to constraints related to copyright and availability of quality full texts.

An overview of the current state in ELTeC corpus building can be found here: Distant Reading
Work on the ELTeC collections is in progress here: COST-ELTeC on Github
A collection of relevant documentation can be found here: Distant Reading on Github

For creating such a benchmark corpus, a corpus design which allows for a comparability of texts and individual sub-collections according to different metadata sets was needed. It should be possible to sample sub-collections from the ELTeC for specific tasks and research questions, and to reformat them in ways appropriate to own tools. The focus of the ELTeC encoding scheme is thus not to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.

Holder of rights: ELTeC conversion

Citation Suggestion for this Object: TextGrid Repository (2020). README.md. ELTeC Test. ELTeC conversion. https://hdl.handle.net/21.T11991/0000-001A-728A-8

ELTeC

Project:

Search for Regular Metadata (Facets Possible)

Search for Project Specific Metadata (Search only)**

Statistics: