European Literary Text Collection (ELTeC) in TextGrid Repository (XIII; 2022-08-29)
This is the project site of the European Literary Text Collection (ELTeC) in the TextGrid Repository. The goal of adding the ELTeC to TextGrid Repository is to publish and archive this valuable set of corpora in European languages and combine them with the technical possibilites that TextGrid Repository offers. Below, we list some of the possibilities that TextGrid Repository facilitates to researchers and readers who are interested in the ELTeC. Currently, we have imported 15 subcorpora of the ELTeC.
Browsing the ELTeC in TextGrid Repository
Here we present some possibilities of how to browse the ELTeC in TextGrid Repository:
- Navigate to the texts of one subcorpus of one specific language
- Navigate to all editions of the ELTeC
- Navitage to all objects of the ELTeC corpus (files and metadata of the editions, works and subcorpora)
In all these cases, you can add further filters with the facets on the left.
Subcorpora and Languages
Here are links to the subcorpus for each language:
- Czech Novel Corpus (ELTEC-cze), collected by the Institute of the Czech National Corpus
- German Novel Corpus (ELTeC-deu), collected by Fotis Jannidis, Leonard Konle and Carolin Odebrecht
- English Novel Corpus (ELTeC-eng), collected by Lou Burnard
- French Novel Corpus (ELTeC-fra), collected by Christof Schöch and Lou Burnard
- Hungarian Novel Corpus (ELTeC-hun), collected by Gábor Pálko
- Norwegian Novel Corpus (ELTeC-nor), collected by Michael Preminger and Christian Emil Smith Ore
- Polish Novel Corpus (ELTeC-por), collected by Joanna Byszuk and Jan Rybicki
- Portuguese Novel Corpus (ELTeC-por), collected by Diana Santos
- Romanian Novel Corpus (ELTeC-rom), collected by Roxana Patras
- Slovenian Novel Corpus (ELTeC-slv), collected by Tomaž Erjavec, Miran Hladnik, Marko Juvan and Katja Mihurko Poniž
- Spanish Novel Corpus (ELTeC-spa), collected by Borja Navarro Colorado
- Serbian Novel Corpus (ELTeC-srp), collected by Cvetana Krstev
- Swedish Novel Corpus (ELTeC-swe), collected by Ljubica Miočević and Cai Alfredson
- Swiss German Novel Corpus (ELTeC-gsw), collected by Giulia Grisot and Berenike Herrmann
- Ukranian Novel Corpus (ELTeC-ukr), collected by Dmytro Yesypenko and Mykhailo Nazarenko
Filtering through Specific Metadata of the ELTeC (Facets)
Because some specific metadata fields are relevant for the composition of the ELTeC, these have been incorporated as new searchable metadata (facets) to TextGrid Repository. For this, the metadata in the TEI files incorporated in the TextGrid metadata fiels. These facets can be used by selecting them in the menu on the left of the results page of the project, or use them in a query. Here we present some possible queries specific for the ELTeC:
- language
- timeSlot
- firstEdition
- Look for works being created in 1840 (work.dateOfCreation.value:1840)
- Look for works being created after 1840 (work.dateOfCreation.value:>1840)
- Look for works being created before 1910 (work.dateOfCreation.value:<1910)
- Look for works being created between 1840 and 1910 (work.dateOfCreation.value:>1840 work.dateOfCreation.value:<1910 )
- authorGender
- size
- reprintCount
Of course, queries combining these facets are possible. The facet search can be combined with fulltext queries, such as:
- Look for texts in the French ELTeC corpus which contain word Paris (Paris edition.language:fra)
- Look for texts in the English ELTeC corpus which contain the word London (London edition.language:eng)
For further information about querying TextGrid Repository, consider the documentation.
Basic Classification
The Basic Classification is a library classification system originally developed in the Netherlands and is similar to other systems such as the Dewey Decimal Classification (DDC) or the Regensburger Verbundklassifikation (RVK). It is used in several library networks in the German-speaking world, where it is one of the most widely used classification systems. In contrast to other library classification systems, it has a small number of classes (about 2,000) and is freely available and published as LOD resource. For more information, see the Wiki of the K10plus or its BARTOC entry.
With the integration of the ELTeC corpora, TextGrid now supports Basic Classification in several ways. First, the classes are displayed in the left menu when selecting a project that has assigned them. Second, they can be used for queries, both simple and complex. For example, we can select a class such as 18.37 Portuguese Literature and use the following query work.subject.id.value:18.37. Of course, that would be the same as using the metadata about the language. The benefit of the Basic Classification comes with its hierarchical structure, which allows for example to query all corpora of different language groups. Here are some examples:
- ELTeC corpora of Germanic languages: work.subject.id.value:18.0* OR work.subject.id.value:18.1*
- ELTeC corpora of Romance languages: work.subject.id.value:18.2* OR work.subject.id.value:18.3*
- ELTeC corpora of Slavic languages: work.subject.id.value:18.5* OR work.subject.id.value:18.6*
- ELTeC corpora of Finno-Ugric languages: work.subject.id.value:18.8*
These queties can be combined with further possibilities presented before.
Benefits of ELTeC in TextGrid Repository
The ELTeC corpora are already available as GitHub repositories and in Zenodo. So, what is the motivation to publish it also in TextGrid Repository? In our opinion, TextGrid Repository can offer a series of advantages to the ELTeC and its community of users:
- Long-term archive: TextGrid Repository is a long-term repository awarded with the CoreTrustSeal
- Findability through Harvesting: By including the ELTeC editions in TextGrid Repository, these texts can be found in further platforms. Aggregators or registries like re3data, OpenAIRE, VLO (CLARIN Virtual Language Observatory) or DARIAH Collection Registry harvest the information of the TextGrid Repository. The corpora of ELTeC will become more visible and easier to find for interested scholars
- Identification: TextGrid Repository assigns persistent identifier to all subcorpora, works and editions of the ELTeC
- Integration: in TextGrid Repository, the ELTeC is integrated in one of the largest literary corpus openly available
- Queries using TextGrid metadata: users can query the corpora using the metadata organized in TextGrid's work, edition, and text objects
- Queries using project-specific metadata: users can query the corpora using project-specific metadata; in the case of ELTeC, this could be the gender of the author, the period of publication, or the size of the text
- Queries using library classes: users can query the corpora using the library classification system Basic Classification, as they would do in a classical library catalog, e.g. query only specific language groups
- Full text queries: users can also search for words or phrases in the texts
- Combined queries: users can combine different types of queries into a single complex query
- Combination with other corpora: users can combine easily some texts of the ELTeC with other corpora, for example filtering the entire TextGrid Repository by language or year of publication
- Shelf function: TextGrid Repository offer the shelf function, with any user can combine
- Publication in HTML: in contrast to other platforms, the TEI files are also published as HTML, enabling search engines to find them easily
- Transformation: Besides the HTML format, all texts in TextGrid Repository are authomatically transformed in other formats (zip, ePUB, plaintext)
- Analysis: TextGrid allows the sending single texts or entire subcorpora to Natural Language Processing (via Switchboard) and Digital Humanities tools (Voyant)
- Integration in the NFDI Consortium Text+ Portfolio: TextGrid Repository is part of the services of the Consortium Text+ as part of the German National Strategy of Research Data
- Future integration in future services: TextGrid Repository is further developed in association with several ongoing projects. With its integration, the ELTeC will profit from future features and development, such as the currently in development Python library
TextGrid Metadata Files
The basic metadata is covered by the TextGrid Metadata schema in Edition and Work metadata, all additional project specific metadata is covered by the metadata added to the works. Please see the following examples from the Digital Library:
Citation Suggestion
To cite each corpus, please, click on them in the previous links, you will find a citation suggestion at the bottom of the page. To cite all ELTeC subcorpora in TextGrid Repository, we suggest following reference:
- European Literary Text Collection (ELTeC) in TextGrid Repository (2023). Edited by Carolin Odebrecht, Lou Burnard and Christof Schöch. Version 1.0.0, based on ELTeC release 1.1.0 (April 2021). COST Action Distant Reading for European Literary History (CA16204) & TextGrid Repository. https://sandbox.staging.textgridrep.org/project/TGPR-5d9f2f27-7019-3901-1ab1-630dc237b4df?lang=en.
- Zitationsvorschlag für dieses Objekt
- TextGrid Repository (2023). README.md. European Literary Text Collection (ELTeC). . https://hdl.handle.net/21.T11991/0000-001D-796F-E