Houston, we have a problem. Each time I see the ticket price to go from Madrid to Sevilla or Ponferrada or… my god, any f***ing part of Spain, we cannot say anything but WTF?! Then, the choice of eating pasta half of the month becomes a reality if you really want to travel… And when I say travel I mean travel fast.
The alternatives to the train are not so encouraging. For example, if you want to go from Madrid to Sevilla you can take a 6 hours bus or you can do car sharing (5 hours). This situation led us to think about a solution of “How can we travel fast and cheap?”.
We knew from previous experiences that the train prices are not fixed, and they change depending on the demand and the departure date. So, it sounded good for us to start scraping renfe.com (Spanish rail website) and see how the prices actually change and, of course, how we could benefit from it.
So, let’s go deep into this (at least technically speaking). We made a scraper with Selenium and Firefox headless browser. For those who are not very into the web scraping, it is some kind of web browser automation (to summarize it into one sentence). In order to parse the prices from the html we used BeautifulShop. We save all these data into a Postgres database and the orchestration of every scraping process is managed by Apache Airflow. Needless to say that everything is running on our self-hosted environment which, as a good Guru hosting, is managed by Ubuntu OS.
We are also glad to announce that the scraping has been running successfully for over two months and the data appears to be very promising! :) We also have uploaded the dataset to Kaggle so that you can play with the data and check if what we are telling you is actually true. It is publicly accessible and you can find it on the following link:
We’ll let you know more about our next steps in following posts ;)