Abgeordle is a sleek and straightforward quiz designed to help you learn about the members of the German Bundestag and their party affiliations. For the project I used the awesome abgeordnetenwatch.de API and incorporated additional data scraped from the Bundestag website, all combined into a user-friendly web app.
Go try it now 🙂 https://abgeordle.jakobpara.com/
Technologies & Languages: WebBot (Selenium), Webscraping (BeautifulSoup4), Web Development, API handling, Python, HTML/CSS, JS
Interesting:
You can find the Bundestag members of this election period under: https://www.bundestag.de/abgeordnete. Unfortunately, scraping the portrait images is not trivial. The gallery view only shows 12 MPs at a time and you would have to dynamically load the individual gallery pages one by one in order to scrape the images. Fortunately, there is also a list view. BeautifulSoup4 just didn’t help me, because this list view is also loaded dynamically and is not already available in the initial DOM. For this reason, I use Selenium to write a simple bot that opens the page like a human user and thus clears the way to the object of my desire: the HTML code with the links to the individual politician profiles.
I had to take this laborious route in the first place because the URLs to the individual MP websites contain not just the name but also an ID, which I couldn’t find in any directory.
E.g. https://www.bundestag.de/abgeordnete/biografien/A/abel_valentin-860100
The data I from abgeordnetenwatch did include a value “external_id” which unfortunately didn’t match the Bundestag member id.
The actual scraping of the portraits was easy so far. However, I had to send over 756 requests to the Bundestag servers. Apparently they don’t think that’s so cool when it happens in a short time (ok, I get that :D). After first experimenting with alternating proxies, I finally decided to just add an error-handling function that triggered a cool-down period as soon as the servers rejected my request.
Since I had already spent a lot of time merging various JSON files and didn’t want to store even more data locally, I decided to query the individual politician data from abordnetenwatch in real time. This also allows me to always have the current status of the data.