Higher revision available You are viewing revision 5 of this document. A higher revision of this document has been published: Revision 10.

ELTeC

A multilingual European Literary Text Collection (ELTeC), ultimately containing around 2,500 full-text novels in at least 10 different languages, permitting to test methods and compare results across national traditions.

The ELTeC is built in three iterations:

First iteration: 6 subcollections (100 novels per language) for the period ca. 1850 to 1920, providing a starting point for research.
Second iteration: at least an additional 4 subcollections (100 novels per language) for the same period, completing the “ELTeC core”.
Third iteration: extensions to the “ELTeC core” with at least 6 additional subcollections (a) in additional languages, widening the range of ELTeC, (b) for languages already included, but the earlier period from ca. 1780 to 1850, enabling diachronic views on literary history and (c) with additional, larger but less strictly structured subcollections for languages already included, providing a broader empirical base for specific analyses.

Project:

Browse

All Works

1. language --> edition.language:"eng"

2. timeSlot --> work.temporal.id.value:"timeSlot" AND work.temporal.value:"T3" (in ONE tag!)

3. firstEdition --> work.dateOfCreation.value:"1840"

Search for Project Specific Metadata (Search only)**

4. authorGender --> work.subject.id.value:"authorGender" AND work.subject.value:"male"] (in ONE tag!)

5. size --> work.subject.id.value:"size" AND work.subject.value:"medium" (in ONE tag!)

6. reprintCount --> work.subject.id.value:"reprintCount" AND work.subject.value:"high" (in ONE tag!)

Statistics:

Language	Texts	Words	Male Author	Female Author	Short	Medium	Long	1840-59	1860-79	1880-99	1900-20	Frequent	Rare
cze	16	366626	14	2	16	0	0	5	6	5	0	0	15
deu	98	12086096	65	33	20	37	41	24	24	25	25	46	46
eng	100	11794738	49	51	28	28	44	22	22	30	26	32	68
fra	100	7986274	64	36	30	43	27	25	25	25	25	44	56
gre	11	42524	10	1	11	0	0	0	1	6	4	3	4
hun	100	7591321	85	15	44	33	23	24	24	25	27	41	31
ita	34	3328244	32	2	13	10	11	5	12	10	7	12	0
lit	18	516555	11	7	14	3	1	5	3	2	8	5	13
nor	27	1114092	22	5	18	9	0	2	2	19	4	26	1
por	96	6313980	80	16	39	39	18	11	37	18	30	23	20
rom	70	4205653	58	8	32	26	12	3	14	22	31	23	47
slv	100	5682120	89	11	53	39	8	2	13	36	49	48	52
spa	46	3989071	34	12	14	21	11	10	14	15	7	29	17
srp	53	2253171	46	7	35	18	0	1	5	23	24	15	28

The ELTeC core contains at least 10 linguistically annotated subcollections of 100 novels comparable in their internal structure in at least 10 different European languages, totalling at least 1,000 full-text novels. The extended ELTeC takes the total number of full-text novels to at least 2,500. Novels have been chosen among major literary genres for availability and size. Chronological limits are due to constraints related to copyright and availability of quality full texts.

An overview of the current state in ELTeC corpus building can be found here: Distant Reading
Work on the ELTeC collections is in progress here: COST-ELTeC on Github
A collection of relevant documentation can be found here: Distant Reading on Github

For creating such a benchmark corpus, a corpus design which allows for a comparability of texts and individual sub-collections according to different metadata sets was needed. It should be possible to sample sub-collections from the ELTeC for specific tasks and research questions, and to reformat them in ways appropriate to own tools. The focus of the ELTeC encoding scheme is thus not to represent texts in all their original complexity of structure or appearance, but rather to facilitate a richer and better-informed distant reading than a transcription of its lexical content alone would permit.

Citation Suggestion for this Object: TextGrid Repository (2020). README.md. ELTeC Test. ELTeC conversion. https://hdl.handle.net/21.T11991/0000-001A-728A-8

ELTeC

Project:

Search for Regular Metadata (Facets Possible)

Search for Project Specific Metadata (Search only)**

Statistics: