For this problem set, we will be using real data. We will analyze the height and weight of the athletes at the 2012 London Olympics. You can find the CSV for this dataset here.
- Use
DictReaderfromimport csvto read the CSV data file into a list of dictionaries namedathletes, where each row is a dictionary. - Create a list named
agesthat is a simple list of integers of all the ages in our file. - Create two lists named
ages_femaleandages_malethat is a simple list of integers of the ages of female and male athletes. - Create three lists
weights,weights_female, andweights_male, much like parts 2 and 3, that are simple lists of integers values of the weights fromathletes. - Create three lists
heights,heights_female, andheights_male, much like parts 2 and 3, that are simple lists of integers values of the heights fromathletes. - Create a list called
bmi, which is a list of the body mass index (BMI) values for each athlete in our list. (HINT: BMI = weight {kg} / (height {meters} * height {meters}).) - Much like part 5, create two lists
bmi_femaleandbmi_male, which include just the BMI values for the female and male atheletes respectively.
NOTE: This problem set deals with the BMI because it is easy to calculate for this particular data set. However, the BMI has many limitations, and it does not fully represent the health of the human body.
- Find the mean and standard deviation of:
ages,ages_female, andages_male. What do you now know about the age of Olympic athletes? Is this what you expected? - Find the mean and standard deviation of:
heights,heights_female, andheights_male. We probably expect the average man to be somewhat taller than the averge woman. Is that true for Olympic athletes? - Find the mean and standard deviation of:
weights,weights_female, andweights_male. We probably expect the average man to be somewhat heavier than the averge woman. Is that true for Olympic athletes? - Find he mean and standard deviation of:
bmi,bmi_female, andbmi_male. What is a typical BMI for an Olympic athlete?
- How do the geometric mean and harmonic mean compare for
heights_female? - How do the geometric mean and harmonic mean compare for
weights_male? - Build a 10-bin histogram from the
bmilist. - Build a histogram for the
heights_femaleandheights_malelists, starting at 120 cm and going to up to 220 cm in 10 cm increments.
If Angelina Jolie and Brad Pitt were in the athletes list above, here is what their lines would look like:
{'Name': 'Angelina Jolie', 'Age': '40', 'Sex': 'F', 'Weight (kg)': '56.5', 'Sport': 'Acting', 'Height (cm)': '173'}
{'Name': 'Brad Pitt', 'Age': '52', 'Sex': 'M', 'Weight (kg)': '78', 'Sport': 'Acting', 'Height (cm)': '180'}
- What percentile is Angelina Jolie's weight, compared to the
weights_femalelist? - What percentile is Brad Pitt's height, compared to the
heights_malelist? - What percentile would Angelina and Brad fall into in
bmi_femaleandbmi_malerespectively? - What percentile would YOU fall into, in your respective sex height, weight, and bmi? (No judgements!)
Let's try and fit our data. First, we will try to interpolate between the age and the BMI of our Olympic athletes. As it happens, interpolation is meant for the situation where we have one X value for one Y value. Since we have many duplicate ages among our athletes, this is not a good fit. While taking a small sample of the data is fine for education, it is probably not what we would do with this data in real life.
- Use
dictandzipto make a dictionary of the first 25 athletes in youragesandbmilists. Name your dictionarybmi_by_age. - Create a ordered list, named
age_keysof the ages inbmi_by_age. (Usesortedand.keys().) - Create a list, named
bmi_values, of the bmi values associate with each age inage_keys. (Use aforloop and yourage_keysalong withbmi_by_age.) - Create a function
f_linearthat is an interpolation ofage_keysandbmi_values. (Useinterp1d.) - Create a function
f_cubicthat is a cubic interpolation ofage_keysandbmi_values. (Useinterp1dalong withkind='cubic'.) - Try different ages in your
f_linearandf_cubicfunctions. How well do they match each other? How well do they match the data? Do they make sense?
Let's try to analyze all of our data points (athletes) in a slightly more realistic way. A good start would be to use a more general curve-fitting approach.
Just to help you through the process, here is the data you're trying to fit:
- Convert the following from lists to
numpy.array:ages_female,ages_male,bmi_female, andbmi_male. - Create a function named
linearthat takesx,a, andband returns ax + b. - Use
curve_fitand yourlinearfunction to fit the data where female athletes ages are the x-value and female athletes BMI are the y-values. Do you think your fitted function matches the plot above? - Use
curve_fitand yourlinearfunction to fit the data where male athletes ages are the x-value and male athletes BMI are the y-values. Do you think your fitted function seems reasonable? How could you test that?
