You need to make sure python is in your path, once that is done you should open a 'CMD' prompt and run python to install the key dependencies via the pip system. Step two: launch python and install key python packages Instructions for each platform are described on the website page. This can be done by downloading the installer from download page on the website. You will need to install Python 3 which is the language all the code is written in. Installing Python and getting a Jupyter Notebook Setup Step one: install Python 3 Which leads me to this post: getting started with GoldenCheetah and OpenData. My standard refrain to such criticism was "hey, its public data, the notebooks are online, go look for yourself". Some of these discussions got quite heated. I started to get a much better feel for the quality of the data and some of the tweets I posted generated a lot of discussion. Jupyter Notebook and 3000 odd athlete season MMP curves Power duration - percentile values for all power durations from 1sec out to 10 hours.Power profile table - percentile values for specific durations and parameters, rather similar to the Coggan Power Profile, but empirically derived.These power profiles are also published online: I spent a good few weeks playing with the data and ended up creating two spreadsheets that summarised the distributions of power values for different durations. This would then help to generate rules for data editing and cleansing to get rid of some of the dirt. Who knew?Ĭlearly I needed to do some data profiling to understand the data better. Not everyone is as particular about their data as I am. It became clear, really quickly, that some of the data was poor quality. activities_mmp.csv - one line per activity (700k or more) listing peak power bests for durations from 1s to 36000 seconds.Īs part of validating the datasets I started to plot the data and explore the values.activities.csv - one line per activity (700k or more) providing the same metrics as above, but for each workout.athletes.csv one line per athlete (1300 or more), providing athlete bio like gender and age, along with career PBs for most popular power metrics.There are 3 main CSV files so far, all focused primarily on power data: Those scripts are running on the same server that receives and posts the raw data to the OSF and S3 buckets. To get things started I developed some python programs that read through all the raw data and generated comma separated variable (CSV) files folks could work with. We needed to provide tools and extracts of the data to get folks started. Providing a huge collection of raw data that was almost impossible to navigate. The problem of course, is that all the data is hidden away in a gazillion zip files. OK, right now it is like going in a library where all the book titles have been erased you know something interesting is there, but are at a loss to find it :-) Expecting lots of folks to clamour to get hold of it and trigger a flurry of startling new insights and analysis from this treasure trove of information. In early May 2018 I posted a tweet to announce the availability of the data. The data shared is posted publicly both on an S3 bucket you can explore and download via a browser, or via a project on the Open Science Framework.Ī library with all the book titles erased So far, at November 2018, over 1300 users have said 'Yes' and shared over 700,000 workouts. So, in April 2018, the 3.5 development release of GoldenCheetah started to ask users if they would share their data publicly. Crucially, we get the user's explicit consent to share anything (and offer options to revoke that consent too). As a result, we anonymise all the data before sending it out of GoldenCheetah and remove personally identifiable information and personal metadata. My first priority was to make sure we did the right thing, in the right way to protect user privacy and comply with GDPR regulations. So I started a project to do it, the GoldenCheetah OpenData project. But there is a growing appetite for such data, to inform development of new tools and to feed into models and machine learning algorithms. Popular sites like Strava, TodaysPlan, TrainingPeaks collect large volumes of athlete data, but quite rightly do not publish this data publicly. Large collections of sports workout data are generally not open to the general public. In this post I'm going to explain what the GoldenCheetah OpenData project is and how you can work with the data it has collected using Jupyter notebooks.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |