My First Hackathon as a Data Scientist
Since the beginning of my data scientist career I had only participated in small kaggle competitions and I was curious about what it would be like to participate in a bigger live event where I could benchmark my skills with other local professionals. Hence, I decided to participate in the BBVA HACKATHON 2019 in Mexico City. https://www.bbva.com/es/mx/hackathon-bbva-2019-pone-a-prueba-a-jovenes-que-quieran-pensar-fuera-de-la-caja/
I invited two great friends and professionals, Carlos Alberto Haro and David Rivera, to participate in the competition with me as a team.
The hackathon lasted a whole weekend (September 6–8) and was hosted at BBVA Mexico’s main headquarters located in Paseo de la Reforma avenue in Mexico City. The event consisted of each team signing up for a specific challenge. This year there were 5 different categories that required a range of skills in the fields of: programming, math, sensors and big data. As you can imagine, we choose the big data category called “GlobalHITSS challenge”, sponsored by a Mexican company with the same name and whose main focus is to provide full B2B big data projects.
As for the prizes, the overall winning team of the entire hackathon was awarded 80 thousand pesos (really good money) and smaller prizes where given to the winner of each challenge. Regarding GlobalHITSS, the first place prize was a trip to SparkCognition timemachine.ai event in Austin Texas.
The challenge consisted in the development of a big data algorithm that could update a historical big dataset (50Gigs) of credit card user complaints with status changes and new incidents that occurred in later periods (also big datasets). The algorithm could be developed using any programming language as long as it could be executed in a apache spark cloud environment provided by the challenge organizers. Besides the functionality, a time benchmark had to be accomplished in order for the challenge to be successful. The time benchmark for the algorithm to be successfully completed was 10 minutes based on the actual time the company takes to accomplish this task.
After the challenge was presented we were very enthusiastic that we could deliver a good project given that it was within our area of expertise. Also, I forgot to mention that the category’s first place prize was a trip to SparkCognition timemachine.ai event in Austin, TX. The possibility of attending the timemachine congress along with spending a couple of days in Austin was very appealing to us (those Texas barbecue ribs are awesome haha).
It was only after the challenge was thoroughly explained that the competition really began. The organizers gave each team a sample of the complete datasets in order for us to start developing the algorithms, however we weren’t provided with the complete version since this was the input needed for the real test.
Before we started coding we made sure we had an in-depth understanding of the project in order to accomplish the task in the best possible way and to figure out if we could provide additional value besides the challenge requirements.
First we analyzed the data thoroughly, giving us a good idea of its schema, data types and missing values. After this, we decided to structure our projects in two parts: Firstly, to develop the data algorithm (the challenge’s main requirement) and then we would generate a deep descriptive analysis of the data and the development of a machine learning unsupervised algorithm (k-means), providing an additional value to the challenge. So after the initial planning we started to do what we like most, coding !!!
We started developing the algorithm on a local laptop with pyspark installed on it, made several iterations, got a few mistakes as normal, but after a few hours of thought and coding we managed to complete our first successful version.
After our algorithm development, we launched a cloud environment on Amazon Web Services to test our algorithm in a similar environment which later on would be employed by the the organizers. We started a very basic elastic mapreduce cluster (emr) with 3 nodes (1 master and 2 worker) in order only invest the necessary funds to test our hypothesis.
We started testing our algorithm in our cloud environment and it ran smoothly, successfully completing the task. This gave us plenty of confidence, so we then approached the challenge organizers to test it on their cloud environment with the complete datasets.
In the first execution on the complete cloud environment our algorithm ran in 11:36 minutes, which was not enough to beat the benchmark that was established for the challenge. Also, two other teams were already below that time (9 and 8 minutes), so clearly we had to make adjustments.
After several iterations in the next hours, a lot of coffee and pizza, we managed to decrease the execution time of our algorithm to 6:36 and actually beat the challenge benchmark, putting us in first place vs the others teams. Given this accomplishment, we then focused on the analysis and machine learning algorithm planned since the beginning of the challenge.
The analysis and machine learning was the icing on the cake in terms of our project since it portrayed our skills in both parts of the data science processes, in data engineering and also our statistics and machine learning skills.
By now it was Sunday at 2:00 pm, we had reached the hackathon’s deadline and projects were to be revised by the judges. So we made our submission and hoped for the best.
After presenting our project with the judges, we were nervous, excited and anxious, all three at once, yet nothing was left to do but to wait for the final verdict. After some minutes, the judges first announced the winners of each challenge, and to our surprise, we won first place within GlobalHITSS. This immediately made us eligible to win the entire hackathon, yet this major prize was awarded to another team. Nevertheless we were extremely happy with our results and excited about our future trip to Austin TX.
Final Remarks
The experience of participating in this hackathon was very pleasant and rewarding for us as team. We proved to ourselves that we can deliver data solutions that provide true value to today’s enterprises regarding challenges faced within the data science and engineering fields. Also, we were able to learn and interact with professionals in other technology fields such as mobile applications development, agrotech, cyber security, among others. This enabled us to catch up with current trends in these fields providing us with a more complete vision of today’s technological landscape.
We were very pleased with the event’s organization, the challenges given, and our interactions with other participants. Most definitely looking forward to participate again in the next BBVA hackathon!
Our algorithm can be found on my github account in the following link: https://github.com/al102964/bbva
Our R markdown can be found on Charlie’s google drive account in the following link: https://drive.google.com/open?id=1i7GbTIneG2ixP4FQYlaIs6BWHUlmvDGk
The event highlights sponsored by @BBVA and Global Hitss can be found below: (We appear in 2:31)
Next post will be about our experience at SparkCognition timescale.ai event in Austin TX.