GC4 Corpus

The German colossal, cleaned Common Crawl corpus.

This is a German text corpus which is based on Common Crawl. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. For example, for the self-supervised training of language models.

GC4 has been created by Philipp Reißel from ambeRoad with support from Philip May from Deutsche Telekom. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.

For download scroll way down.

In a very simplified matter one can say:

HEAD: Consists of high quality text (e.g. newspaper, government websites)
MIDDLE: More colloquial language like forum entries, commentary sections
TAIL: The dark side of the Internet

As it is classified through n-gram occurrences in comparison with the German wikipedia n-gram from our practical experience it worked quite well.

Necessary Steps before usage

The tar.gz files consist of only one line per file (due to a weird mongodb export format)

To split them up into one Json per line you can run the simple code from this gist

We recommend to filter these Jsons once again to only include ['language_score'] > 0.98 . This only includes websites were German is really dominant and on top it helps avoid websites with random enumerations or large disclaimers.

After running the gist on every part you simple filter it again for string occurences through zcat de_head_0000_2015-48.tar.gz | grep 'Aachen' or do more advanced filtering like n-gram based approaches.

Preprocessing

Preprocessing was done through the cc_net library

CC Dump	Total Size: German tail + middle + head (before dedup)	CC Original Size (TB)
2015-22 (2015-27 not working)	stopped
2015-48	8.9 GB	151
2016-18	17.7 GB
2016-44	119 GB
2017-13	84 GB	March 250
2017-30	~80 GB
2017-39	75.6 GB	September 250
2017-51	~120 GB	December 240
2018-09	128 GB	Februrary 270
2018-17	~120-130 GB	April 230
2018-30	131 GB	July 255
2018-39	~120 GB	September 220
2018-51	~120 GB	December 250
2019-09	119.8 GB	February
2019-18	119 GB	April 198
2019-30	120 GB
2019-47	150.2 GB	November
2020-10	134 GB	February 240
2020-34 / 2020-29	NOT WORKING ENCODING ISSUE - see issue	August 235

This preprocessing is filtering duplicates only inside the same dump. This step took approx. 50,000 CPU hours and 400 TB of network traffic to the common crawl s3 bucket.

Common Crawl News Dataset

You can convert the Common Crawl News Shards with a WARC to WET library and then feed to cc_net just as the normal Common Crawl dumps.

Given the average sizes of the hourly News shards:

WARC: 1100 mb
WARC -> WET: 124mb
Filtering WET for german only: 2.6 mb
Head only (highest quality): 0.8 mb

One can approximate that all news shards combined are just the extent of one monthly common crawl. As large portions of news articles are already included, we decided to not include this because of the low overall quantity.

Clean-Up

To also delete duplicates which occur in more than one Common Crawl dump, we did a deduplication through mongoDB. The Code is available here

It requires approx 1+ TB of RAM and at least 64+ CPUs to complete the task in ~3 days.

The tail section is not further processed, as these are nearly always scam/porn websites.

This table sums up the sizes of all preprocessed monthly dumps and 2 different approaches of deduplication:

Deduplication through similar Text:
Every content field is hashed and same hashes are removed
Problem here is that urls are included multiple times when very little change on the website has happend (e.g. just the year in the imprint has changed)
This approach filters out webpages which are copied/mirrored (e.g. there are a lot of wikipedia mirrors out there)
Deduplication through similar Text and URLs:
To avoid having sites with the same URL from different monthly dumps in more strict filtering we filtered out every URL that appeared twice
One edge case which we couldn’t solve, were URLs were a hashed parameter is appended, but this is pretty rare

Size here is with metadata but compressed as .gz so raw text is approx. 2x the given size	Head pages/size	Middle pages/size	Tail pages/size
Original	263 Mio 392 GB	332 Mio 499 GB	—— 1 TB
Deduplication through similar Text (available on request)	181 Mio 278 GB	251 Mio not yet exported
Deduplication through similar Text and URLs (this is the one you can download here)	142 Mio 181 GB	186 Mio 273 GB

Comparison to other Datasets which are based on Common Crawl and very large

C4/mC4 Dataset used by Google:
- Multilingual one is only available on request (see issue here)
- As the English only is already a requester pays, it requires you to have approx. 100 $ of Google Cloud Credits
- The way of filtering here is different (and in our mind inferior to cc_net library
  - They doesn’t split for quality like cc_net
  - Filtering out every article were one swearword occures, so your models can’t learn at all how to deal with them
CC100 download here
- This data is also preprocessed with cc_net
- But only dumps from January-December 2018 are used, so a lack of size and temporal diversity
ORCAS dataset
- It only uses one monthly dump (November 2018) from Common Crawl
- The filtering for quality seems to be less precise than cc_net

Terms of Use

Since this dataset is based on Common Crawl we would like to just refer to their terms of use. Nevertheless, we would like to ask you to publish the work based on it under open source license.

Download

The corpus is split into multiple files. Below are links to each single archive.

Instead of downloading the single links you can download two files with all URLs and then use wget to download the single archives:

wget https://german-nlp-group.github.io/_static/file/gc4_corpus_head_urls.txt
wget -i gc4_corpus_head_urls.txt
wget https://german-nlp-group.github.io/_static/file/gc4_corpus_middle_urls.txt
wget -i gc4_corpus_middle_urls.txt