A few months ago we started with a project that aimed to convert speeches from the Macedonian Parliament (represented in textual form) into a meaningful format that would be understandable by a more general audience. After months of hard work, on the 4th of November we arrived at a point where we could present the fruits of our labor on VoxPolitico.org.
The VoxPolitico Homepage
The website combines text-mining algorithms with visualization techniques in order to convert every speech in the Macedonian Parliament into a set of visual clues. The goal is for non-experts to be able to understand what the legislature was focused on in a given period of time and how individual politicians “changed their stripes” (or not) as the political winds in the country shifted. Users now can see what has been spoken in every session of the Macedonian parliament going back nearly twenty years. They can also analyze trends and identify “hot” topics raised by lawmakers in speeches and debates. Fundamentally, what we’re trying to do is demystify the often-lofty political rhetoric that dominates the standard vocabulary of the Macedonian (and other) legislatures.
Methodology
VoxPolitico consists of two main parts: the web crawler and the website.
We have implemented a web crawler that runs in the background and continuously monitors the official website of the Macedonian parliament. Whenever a new transcript of a parliamentary session is published, the crawler will download the file and process the information. Meaningful information extracted from these transcripts is organized and stored in two different database management systems (a relational and a NoSql database).
The information that we are extracting has a statistical nature: descriptive statistics, relationships, rules, and time series data. This kind of information is presented to the general audience using visual clues in the form of graphs and shapes on the user friendly front-end website. To build the front-end we implemented a feedback-driven development process. Prospective users evaluated all our visuals; their feedback was used to alter our presentation methods to better suit their needs. For this purpose we presented the early system to large audiences in the form of workshops, seminars, as well as international conferences, all the while gathering feedback to improve the platform.
What Information The System Presents
Our aim was to develop a system that would promote transparency in the Macedonian political ecosystem. We envisioned VoxPolitico providing the following information to the general audience:
- General statistics that summarize each parliamentary session of the Macedonian parliament. This way we can present the most popular topic being discussed, the phrases used most often, as well as the most active and passive members of the legislature (according to volume of speech).
- Trends for every speech. We have collected information such as the exact date of the speech, the name of the representative giving the speech, word and phrase frequency, etc. This kind of information allows us to generate trends in the form of “hot” topics being invoked or debated in given periods of time in Parliament. Additionally one can execute comparative statistics against a set of topics (e.g. looking for trends involving more than one word).
- Similarities between members of the legislature. A very interesting feature of our system is the ability to evaluate the similarity between representatives according to their political speech in Parliament. By applying cosine similarity against document vectors, we can determine the similarity between two politicians based on the speeches they have delivered in Parliament during their career.
What We Do Not Show
When calculating the statistical properties of the downloaded documents, we realized that the statistics were skewed towards some frequently occurring words mentioned in Parliamentary debate. For example, the phrase “thank you” appears in every speech and thus has a very high frequency, but it is meaningless for our purposes. To overcome this issue we adopted a two-step approach:
- We eliminated words based on their importance using the well-known Term Frequency – Inverse Document Frequency (tf-idf) algorithm; and
- We created a blacklist of words that are simply not processed by VoxPolitico; this helps us eliminate words that we deem to be unimportant or marked as unimportant by our users.
As previously mentioned, we built the system to increase accountability and transparency in the Macedonian Parliament. During some of the early public presentations we organized to share the platform, it was not a surprise that people wanted to know whether VoxPolitico would also process and visualize the voting records of representatives alongside their speeches. Processing voting records would be indeed an enhancement to our system but would require a separate major effort. Parliamentary voting records in Macedonia are maintained only in hard copy by both the Parliament’s archive as well as the national library.
How Much Data is There?
We have processed data going back to the first parliamentary session of the independent Macedonian state in 1999; the first document recorded in VoxPolitico dates to January 8th, 1991. To-date, more than 20,000 speeches have been retrieved containing more than 100,000 distinct words. Some of VoxPolitico’s tables containing the statistical properties of our data have up to 20 million records.
One mistake we made early on was to overlook the challenges of big data. We believed that a standard relational database would suffice to store all the information. We were quickly proven wrong; the website would take forever to display the information we had stored. Therefore, a whole redesign of the platform was needed, and we substituted non-relational databases for the standard relational one. A separate technical post will cover these issues and how we overcame them.
What’s Next?
One of our major goals was to create an architecture and a platform that could be used by anyone, anywhere, to generate similar insights for any legislature in the world. With that in mind, our next task is to publish VoxPolitico using a suitable open source license and to make it available to anyone. For this to work we will be writing technical documentation and eventually uploading the final software package to an open source software repository. However, we are excited that non-governmental organizations from neighboring countries have already reached out to us to express their interested in implementing VoxPolitico in their countries. We are already working with them to support the set up of the system, assisting with the implementation of custom parsers to scrape their legislatures’ websites, and offering limited hosting infrastructure. We’re also busy working to localize the system in other languages, starting with Albanian (the second official language in Macedonia).
— Visar Shehu, South East European University