-
Notifications
You must be signed in to change notification settings - Fork 1
Data Sharks
This report will start from the point we have finished watching all the tutorials and acquired all the necessary knowledge in order to start with the task we have been assigned to.
In the first place, to check the veracity of the hypothesis data was needed. After searching for some time we landed in different database organizations that were trustworthy, but a lot of them had abundant information. We arrived to the conclusion due to limitations of the hardware that not all indicators could be selected and therefore only around 30 of them would make it in the final analysis. Moreover, we had a difficult time choosing the indicators as a lot of them had a lot of missing information and for this reason we had to inspect each table manually.
When the selection of databases was completed we tried importing each database using the URL through python code, however, not all databases had a direct link. For this reason, downloading the .csv and converting all files to a .zip was a simpler approach. The first battle with the data was a really difficult process, each one had a different format of columns, many of the columns were in different dtypes and there were rows that had missing values or were in blank. A specific treatment for each provider of the database was needed, so all the files were named using a same approximation: 'NameOfTheProvider - Indicator Name'. A long time was used in the making a huge method called 'preprocess' and 'rename_value_column' that reads the name of the .csv and makes a special treatment taking into account from were the file was downloaded.
After a few issues, we managed to make the merge method work properly, resulting in a comprehensive database of around 70,000 rows and 34 columns, including those for the country and year.
Once all of the data was placed into one dataframe, we could refine the integration, identify the total of missing values for each country or period of years, and narrow our study to a sample with enough data.
In a first approach, we decided to ignore the entries older than 1990, and drop the countries that had no more than 15 indicators to study. This way, we reduced the size of our dataframe without compromising the quality of the data. This trimming process will be refined in the near future, as we finish the main part of the code and become able to focus more on the subject of study.
At the same time, a model definition diagram is created and still needs some improvements due to the fact that the python code is still in process. The structure and dimensions of the code made jupyter notebook no longer suitable for the development, so we migrated our code to Visual Studio and divided methods, functions and classes in different modules.
This change allowed for the creation of a poetry file and a more agile sharing via GitHub, but required a considerable effort to make adapt the code to the new environment. It was detected that the code can be further reestructured to make it more readable and tidy in this new environment, but it had to be postponed and instead we focused on making the code work again properly.
Finally, we have spent some time planning the next step on the treatment of the data. The null values are the first sensible option, and we decided to fill the blanks with values estimated through interpolation between the closest non-null values.
The second stage involved dealing with outliers. First of all, we had to identify which values were just anomalies due to explainable circumstances, and which ones were true outliers. This process was made first manually, and later with help from a primitive algorithm to roughly estimate if the values were part of a tendency. Those that weren't, had to be treated, and we decided to apply the same criteria than with the null values and use interpolation methods.
This second week has been a really big challenge in terms of normalizing the data and structuring code properly, however, we think a huge progress is made and learned a lot of code and what is the best way to work in a team.