progress#1
Conversation
stoufa
left a comment
There was a problem hiding this comment.
Good job overall. I will merge this PR as soon as you update this notebook as explained in this review's comments.
| } | ||
| ], | ||
| "source": [ | ||
| "df1 = pd.read_csv(\"data.csv\",low_memory = False,skiprows=4,parse_dates=True,index_col=[\"Country\",\"Date\"])\n", |
There was a problem hiding this comment.
"data.csv" refers to what period? (2020Q1, 2020Q2, 2020Q3, 2020Q4, 2019Q1, 2019Q2, ...)? please specify this in a comment above this line or in a markdown cell above this one
| } | ||
| ], | ||
| "source": [ | ||
| "df_ug = pd.read_csv(\"Ugandacovid.csv\",low_memory = False,parse_dates=True)\n", |
There was a problem hiding this comment.
did you filter out Uganda's data manually or using code? In the latter case, please add that code to better document the process
| ], | ||
| "source": [ | ||
| "plt.figure(figsize=[12,8])\n", | ||
| "plt.plot(df_ug[df_ug['Specie']=='temperature'].loc[:30,'Date'], df_ug[df_ug['Specie']=='temperature'].loc[:30,'median'])\n", |
There was a problem hiding this comment.
you already started visualizing the data, great job. starting with the median is a good choice, I would recommend visualizing more statistical measures (on the same plot) next, you can, for example, use a box plot to visualize the min, max, median, and variance values altogether. here's an example.
https://stackoverflow.com/questions/33328774/box-plot-with-min-max-average-and-standard-deviation/33330997
stoufa
left a comment
There was a problem hiding this comment.
EDA.ipynb
I believe you forgot to change column_name to a more meaningful name in the following code:
missing_value_df = pd.DataFrame({'column_name': ind1.columns,
'percent_missing': percent_missing})you can, for example, change it to numerical measure
https://www.britannica.com/science/statistics/Numerical-measures
In this cell:
import seaborn as sns
correlation_mat = france[["min","max","median","variance","type"]].corr()
sns.heatmap(correlation_mat, annot = True)
plt.show()I can see that you are computing the correlation between the air quality metrics distributions and the lockdown type. However, the code as it is right now is considering all the species as a whole, which is not what we are looking for here, we should separate each species ( pm10, humidity, o3, co, no2, so2, wind-speed, wind-gust, dew, ... ) to know what are those to be included in the training process and which ones to ignore. By the way, this is what you tried to do by plotting data but you didn't compute the separate correlations.
This remark applies to other plots besides this:
# Plot time series dataset
ax = humidity.plot(linewidth=2, fontsize=12);
# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);The labels of the x-axis are hard to read, find out a way to format the Date column to look like a date. also, I know that this notebook was pushed before our discussion about the distribution, I prefer that you plot the Probability Density Function (PDF) of the distribution instead of the statistical measures superimposed, that way, the plot will be cleaner and easier to read.
I'm not sure why you wanted to read pm10 and no2 median values side by side, it would be great if you document your reasoning a bit to explain your thought process for the readers of your work (and for you in the future, in case you needed to look back at this notebook).
pm10_1.merge(no2_1[['no2', 'Date']], on = 'Date', how = 'left')same for the pivot table, why you chose those particular columns, what are you trying to do here? even a one-liner comment can be enough to describe your idea(s)
france_clean = france.pivot_table(index=['Date', 'Country', 'City','type'],columns='Specie',values='variance').reset_index().sort_values(['Country','City'])here, the date in the x-axis looks a lot better than earlier, however, I recommend turning the y-axis to a logarithmic scale (instead of the linear scale, used by default) just to see the both the big and small values
france_clean[france_clean.columns.drop(['Country', 'City', 'type'])].plot(figsize=(15, 6))
plt.show()I agree, scatter plot might be a better alternative of the line plots used earlier
px.scatter(data_frame=france_clean[france_clean['City']=='Paris'],x=france_clean[france_clean['City']=='Paris'].index,y='pm25',color='type')and you already tried to plot a Kernel Density Estimate (KDE)? nice to see that.
df_ozone.plot(kind='kde',figsize=[14,12]);and then, you trained a fb_prophet model to forecast the evolution of o3 in the French air, now, focus on the dates and train 2 versions: one, before a lockdown, and another one after it and see if there are any differences between the two.
Duplicate air_lockdown Notebooks
it looks like air lockdown.ipynb, air_lockdown.ipynb, and notebooks/air lockdown.ipynb are duplicates of the same notebook, which I already reviewed before. Only keep one of them and remove the other copies.
air_lockdown_filtering.ipynb
the air_lockdown_filtering.ipynb notebook downloads data and filters out all countries except France, but I don't see where in the code you exported/saved the france_lockdown.csv file, documenting the processing process is more important than the final result, when we have the processing script, we can run it again and get the result again (if it's not there already), but the other way is hard to predict (given the data, what processing has it been through?)
web_scraping.ipynb
the web_scraping.ipynb notebook aims to scrape data from the COVID-19_lockdowns Wikipedia article, I saw that you started trying with BeautifulSoup then switched to using pd.read_html, I agree, in this particular case, using pandas would be the best (and easier) option.
I liked how you used a regular expression (regex) to remove the references brackets in the scraped data
df_nat_clean3 = df_nat_clean3.replace(to_replace ='\[.*', value = '', regex = True)
df_nat_clean3
stoufa
left a comment
There was a problem hiding this comment.
Regarding the data/processed/french_data.csv, make sure to avoid uploading it with the code, and this is by including its path in the .gitignore file so that you can keep working on it locally on your machine without removing it manually with each commit, a better approach is to document the process of acquiring and processing that file, and instead of uploading it, I suggest sharing it on Dropbox or Google Drive, and including the download link.
this is our progress so far