Potsdam Commentary Corpus

The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.

The central subcorpus that we are making publicly available consists of 176 MAZ texts, which are annotated with

The corpus is released under a Creative Commons Attribution-NonCommercial-ShareAlike license and can be freely downloaded here. The publication to cite when using the data is Stede/Neumann 2014.

All the annotation guidelines (in German) have been published as an open access book, which can be found here.

A sample of two commentaries drawn from the corpus can be queried online as part of our ANNIS demo .

Fig. 1. Visualization of several annotation layers of the PCC in ANNIS.

Morphosyntactic and Syntactic Annotations

POS and syntax annotations in TIGERSearch
Fig. 2. Parts of speech and syntax annotations as visualized by TIGERSearch

The entire corpus was semimanually annotated for constituent syntax in accordance with the specifications of the TIGER corpus using the @nnotate tool ( Brants et al. 2004 ). [ Guidelines ]


Coreference annotation with MMAX
Fig. 3. Coreference annotation with MMAX

The corpus is annotated for nominal and pronominal coreference according to guidelines that build upon the Potsdam Coreference Scheme (PoCoS core scheme, Krasavina & Chiarcos 2007 ) using the MMAX2 tool ( Müller & Strube 2001 ).  Currently, the annotations cover strict coreference (identity) only. Indirect anaphora (bridging) has not been annotated yet.

Discourse Structure and Connectives

The PCC is one of very few corpora with annotations for Discourse Structure, i.e., the hierarchical and relational structure of entire texts (or other discourse types). The MAZ subcorpus (176 texts) has been annotated in accordance with Rhetorical Structure Theory (RST, Mann & Thompson 1988 ) using the RSTTool ( O'Donnell 2000 , Version 3.1).

Connectives are the most important surface signals for RST annotations. But their behavior need not always coincide completely with an overall rhetorical text structure. We thus introduced an independent annotation layer for connectives and their scopes (quite similar to the approach of the Penn Discourse Tree Bank). For doing semi-automatic connective annotation, we developed ConAno ( Stede & Heintze 2004 ), a tool that identifies potential German connectives in text and also  makes suggestions for the two arguments (which of course can be overwritten). The tool is available for download here.

RST annotation with the RSTTool
Fig. 4. RST annotation with the RSTTool.


The annotations of the Potsdam Commentary Corpus are provided in its various source formats:

Annotation Layers Formats Tool
Parts of Speech, Morphology, Syntax TIGER XML, NEGRA export format TIGERSearch
Coreference MMAX2 MMAX2
Connectives inline XML ConAno
Rhetorical Structure Theory RS3 RSTTool

For programmatic access to the corpus, we developed discoursegraphs, a graph-based converter and merging library. The tool is able to parse all the annotation formats used in the PCC and merges them into a single NetworkX-based graph representation. The graph can either be queried directly or exported to various generic graph formats (neo4j, dot, GEXF, GML, GraphML).


