ParlaCAP project image

Science cluster

SSHOC - Social Sciences and Humanities

Summary

The ParlaCAP project leverages advanced natural language processing to analyse political agendas and sentiments in debates from 27 European national parliaments. The automatic coding of agendas throughout a wide dataset of more than 7 million speeches, given in more than 20 languages, has become possible recently with significant developments in natural language processing and artificial intelligence, allowing for multilingual transformer models to provide both highly consistent and accurate codings. By integrating the ParlaMint dataset and the Comparative Agendas Project's coding scheme, the project will create a comprehensive, FAIR dataset for comparative political research, enhancing transparency and accountability in legislative discourse across Europe.

Research domains:
Social Science and Humanities
Partner(s):
Jožef Stefan Institute, Institute for Contemporary History, University of Zagreb, Bulgarian Academy of Sciences, Polish Academy of Sciences
Project team member(s):
Tomaž Erjavec, Taja Kuzman, Peter Rupnik, Katja Meden, Jure Skubic, Anna Kryvenko, Daniela Širinić, Petya Osenova, Maciej Ogrodniczuk, Łukasz Kobyliński.
External collaborators: Michal Mochtak (Radboud University), Matyáš Kopp (Institute of Formal and Applied Linguistics)

Challenge

Open Science project, Open Science Service, Cross-domain/Cross-RI,

Parliaments are the cornerstone of democracy in Europe, ensuring the political representation of citizens. Despite their empirical relevance, parliamentary studies have often limited their scope to a single parliamentary body or a small group of parliaments analysed in comparative perspective. 

The main challenge of the ParlaCAP project is to bridge the gap between existing parliamentary research data and how to utilise these data in political science research by integrating two key international and cross-disciplinary initiatives: the CLARIN ERIC ParlaMint project, which provides texts of parliamentary debates from 27 European national parliaments, and the Comparative Agendas Project (CAP), which offers a coding schema of 21 topics for tracking political agendas in parliamentary proceedings. 

Solution

The project will employ the Comparative Agendas Project's text-as-data methodology to analyse parliamentary debates of all the 27 parliaments, consisting of more than 7 million speeches, given in more than 20 languages, by automatically coding the agenda of each speech and transforming the ParlaMint corpora into a structured and tabular dataset, available for complete download through CESSDA ERIC. 

The project aims to further code each speech with the sentiment expressed, as well as cross-reference the data with the PartyFacts metadatabase on political party metadata and the V-DEM surveys on the state of democracies. With this enriched and fully-FAIR dataset, now suitable for quantitative research, it will be possible to acquire a comprehensive understanding of how political attention is distributed across policy areas by analysing topic and sentiment coding over an unprecedented number of parliaments for political science research. The dataset will be available through RIs, such as CESSDA, CLARIN, and DARIAH, along with a graphical user interface and API for broader accessibility.

Scientific Impact

ParlaCAP will revolutionise comparative parliamentary studies by providing a robust dataset for tracking political agenda-setting across European parliaments. Its open, FAIR data management approach will support a wide range of RIs and projects in social sciences, while promoting transparency in political discourse and accountability of legislative bodies. The findings will have societal relevance, fostering collaboration in political science and beyond.

Moreover, the engagement activities foreseen in the frame of the project will provide services and accompanying tutorials to raise awareness of the political studies community and ensure that the project's results are FAIR for further application and research by scientists across various Social Sciences and Humanities (SSH) domains. 

Open science added value

The new FAIR dataset will feature speech-level metadata on democracies, parties, speakers, topics, and sentiment, accompanied by both original and translated text of the debates as supporting information. By providing structured data, it will be possible to better serve the needs of the CESSDA ERIC infrastructure, the CAP infrastructure on agenda setting in political discourse, the MEDEM infrastructure on monitoring electoral democracies, and all RIs and research agendas interested in parliamentary debates that rely primarily on structured data analysis.

Results

  • ParlaCAP dataset v1.0: The ParlaCAP dataset consists of 8 million speeches from 28 European national and regional parliaments, with each speech coded with the sentiment expressed (ParlaSent coding from negative, over neutral, to positive) and the topic discussed (CAP (Comparative Agendas Project) coding with 21 topics), and rich metadata on the speakers, parties and democracies. The ParlaCAP dataset extends the ParlaMint 5.0 dataset by automatically coding topics and sentiment for each speech and simplifying the data to a tabular form. Repository · More Information (Paper in Development).
  • ParlaCAP transformer model: The ParlaCAP model is a multilingual text classification model that assigns topic categories to parliamentary speeches according to the CAP (Comparative Agendas Project) schema. The model was built by fine-tuning the XLM-R-Parla model on GPT-4o–annotated debates from multiple European parliaments. It achieves macro-F1 around 0.65–0.72 across English and South Slavic test sets. Model on Hugging Face | Guide for Automatic CAP Annotation of ParlaMint Data
  • ParlaCAP training dataset: The training dataset for the ParlaCAP topic classifier. The dataset comprises approximately 36,000 parliamentary speeches in 29 languages from the ParlaMint 4.1 corpus collection, annotated with the CAP topic labels by the GPT-4o model following the LLM Teacher-Student Framework for development of training data and BERT-like classifiers without manually-annotated data. (TBA) Repository · More Information
  • ParlaMint corpus v5.0: ParlaMint 5.0 is a collection of comparable corpora of parliamentary debates from 29 European countries and regions. The corpora are richly annotated with metadata on speakers and parties, and automatically assigned CAP top-level topics and sentiment information. While ParlaMint 5.0 and ParlaCAP 1.0 provide the same underlying data, ParlaMint is distributed in formats primarily intended for corpus linguistics research, whereas ParlaCAP is provided in simplified, analysis-ready formats that are more accessible to social scientists and other digital humanities researchers. Repository · Concordancer · More Information

DataCAP (Comparative Agendas Project) coding with 22 topics), and rich metadata on the speakers, parties and democracies. The ParlaCAP dataset extends the ParlaMint 5.0 dataset by automatically coding topics and sentiment for each speech and simplifying the data to a tabular form. Repository · More Information (Paper in Development)

  • ParlaCAP fine-tuning data - JSONL · 35,579 speeches - The training dataset for the ParlaCAP topic classifier. The dataset comprises approximately 36,000 parliamentary speeches in 29 languages from the ParlaMint 4.1 corpus collection, annotated with the CAP topic labels by the GPT-4o model following the LLM Teacher-Student Framework for development of training data and BERT-like classifiers without manually-annotated data.
    (TBA) Repository · More Information
  • ParlaCAP test data - JSONL · 3,443 speeches - The ParlaCAP test datasets comprise parliamentary speeches in Bosnian, Croatian, English, and Serbian, sourced from the ParlaMint 4.1 dataset and manually-annotated by a single expert annotator using the 21 CAP categories from the official CAP schema, along with an additional Other label. The datasets are approximately balanced across labels and languages with app. 800 instances per language. To prevent large language models from incorporating the test datasets during their training phase, the test datasets are not publicly available. However, we are happy to share them with interested researchers - contact us to be granted access to the datasets.
    Evaluation Dashboard · More Information

Models

Multilingual models fine-tuned on the tasks of CAP topic schema classification and sentiment identification.

  • ParlaSent model - The ParlaSent model is a multilingual transformer model for sentiment analysis in parliamentary speeches. The model was developed by fine-tuning the XLM-R-Parla model on the ParlaSent dataset, a manually-annotated selection of sentences of parliamentary proceedings from Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom. The model achieves high accuracy, with a mean absolute error of 0.68–0.71 in regression scenario, and macro-F1 scores of 0.70–0.73 when its outputs are mapped to the three sentiment categories (Positive, Neutral, Negative). Model on Hugging Face | More Information

Tutorials

Step-by-step guides for using the ParlaCAP dataset.

  • Parliamentary Speech Analysis with ParlaCAP - Python · Jupyter notebooks - Multiple tutorials for analyzing parliamentary speeches across multiple European countries using the python programming language. The 5 tutorial notebooks combine processing of ParlaMint data, sentiment analysis, party comparisons and cross-country analyses which enable students and researchers to study the tone and content of parliamentary debates systematically. View tutorial
  • Parliamentary Speech Analysis with ParlaCAP - R · Jupyter notebook - Multiple tutorials for analyzing parliamentary speeches across multiple European countries using the R programming language. TBA

Publications

  • Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić - State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?, Conference Paper, November 2025, arXiv; submitted to LREC 2026
  • Michal Mochtak, Peter Rupnik, Taja Kuzman, Nikola Ljubešić, Parlasent: mapping sentiment in political discourse with large language models, Research Note, Political Research Exchange, 7(1), June 2025
  • Nikola Ljubešić, Taja Kuzman Pungeršek, Daniela Širinić , ParlaCAP: Comparing Agenda-setting across Parliaments via the ParlaMint dataset, Conference Paper, Annual Conference of the Comparative Agendas Project (CAP) 2025, June 2025

Events

  • 11 June 2025 | Konstanz, Germany - CAP 2025 conference. Presentation: Nikola Ljubešić, Daniela Širinić: ParlaCAP - Comparing Agenda-Setting across Parliaments via the ParlaMint Dataset
  • 14 July 2025 | Lisbon, Portugal - Digital Humanities Conference 2025. Workshop: Darja Fišer, Anna Kryvenko, Kristina Pahor de Maiti Tekavčič: From the Dispatch Box - Unlocking Topics and Sentiments in Multilingual ParlaMint Corpora | Recording
  • 01 October 2025 | Vienna, Austria - CLARIN ERIC 2025 annual conference. Poster: Nikola Ljubešić, Taja Kuzman Pungeršek: ParlaCAP - Mining the ParlaMint Treasures with Multilingual Topic and Sentiment Classification
  • 20 November 2025 | Utrecht, The Netherlands - CDH/CLARIN Workshop in the CDH Training Programme. Training: Anna Kryvenko, Kristina Pahor de Maiti Tekavčič: ParlaMint – An introduction to Multilingual Parliamentary Data | Materials
  • 20 November 2025 | Ljubljana, Slovenia - ARNES Open Science conference. Presentation: Taja Kuzman Pungeršek: ParlaCAP - Primerjava tematskih prioritet v parlamentih na podlagi korpusov ParlaMint | Recording

Promotional material

  • POSTER | ParlaCAP - Mining the ParlaMint Treasures with Multilingual Topic and Sentiment Classification

 

Browse the ParlaCAP project's website to find out more about projects' results, publications and events

 


Keywords
natural language processing, parliamentary research data, ParlaMint, political science, AI, artificial intelligence
Project start date:
Project duration:
24 months

Principal investigator

Nikola Ljubešić - PI - ParlaCAP project
Nikola Ljubešić
Jožef Stefan Institute
BIO

Nikola Ljubešić is senior researcher from the Jožef Stefan Institute in Ljubljana. He is also affiliated with the Faculty of Computer and Information Science of the University of Ljubljana, and the Institute of Contemporary History in Ljubljana. His research interests lie in the areas of natural language processing, computational linguistics and computational social science, with a strong focus on the South-Slavic linguistic and cultural area.

QUOTE
"Open science is a snowball effect in itself. Our ParlaCAP project would not have been possible without the upstream FAIR project ParlaMint, whose results this project will, inter alia, make significantly more useful for social science research. This snowball is picking up in both pace and size!"