10 min read
The inaugural Big Data Utah and Boulder/Denver Big Data Users Group (BDBDUG) Global Data Competition 2015: Collaborate to Change Climate Change officially kicks off this weekend (June 6). The competition focuses on climate analysis in 22 regions around the globe, with the mandate to facilitate global collaboration and to promote data-driven decision-makingthat allows others to improve research and decisions on investments, adaptive approaches, policymaking, and more. This regional approach emphasizes the individualism of the impacts of climate change, while giving persons from all walks of life a chance to directly contribute and affect change through access to big data computation resources and data science.
The competition has two phases. The first phase is comprised of a predictive benchmark competition that provides competitors with an incomplete climate dataset and teaches them about the (science of the) data they will encounter. Competitors will be expected to complete the dataset at this stage, and competition points will be allocated using a data-science oriented Kaggle-styled scoring format based on the accuracy and efficiency of their methods.
The second phase of the competition addresses individually identified problems associated with climate change, and is the core stage for the big climate data/data science mashup. For that part of the competition, the challenge from the social studies teacher, Mr. Simonet, to his students in the Catherine Ryan Hyde 1999 novel “Pay it Forward” (made into a movie in 2000) can be given to each competitor: “Think of an idea for world change, put it into action.” Put more verbosely for the purpose of this competition: think of a solution or a mechanism to facilitate a solution to a climate change issue in your region, and put it into action while collaborating with like-minded colleagues.
This blog discusses some tenets associated with the interdisciplinary and collaborative nature of this competition to cogitate on.
1. Rigorously define the problem
Among the hundreds of variations of quotes from Albert Einstein available on the web is the following: “If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
Defining the core problem is critical before devising solutions, and has been shown to save time and effort in the long run. But this can be a daunting task. It is nonetheless an imperative one, as the knowledge of programming languages coupled with statistics without a clearly defined science problem often results in the danger zone—correlations and/or significant results are found that are not necessarily scientifically accurate! Thus, defining the problem and proposed solution with sufficient science background is paramount. The rigors of this step usually consume quite some time, and can lead to frustration mostly manifested as the appearance of inaction—but do not evade this step.
The advantages of due diligence here include clearer guidance in terms of identifying datasets, needed dataset elements, and analysis methods required. Even in the case of the first phase of this competition where the data is already provided, taking the time to understand the dataset and existing analysis methods for filling various variables in the science field may be appropriate. A few thought-provoking resources pertinent to the second part of the competition include NASA's Global Climate Change website that showcases innovation in the Earth Blog series and provide discussions on climate headlines across the globe; the Adapting to Climate Change Facebook page where discussions, dissemination and sharing of information about climate change adaptation from all parts of the globe can be found; and the Climate and Development Knowledge Network (CDKN) that supports decision-makers in designing and delivering climate products to minimize the impact of climate change, while maximizing economic and social growth.
2. Communication is paramount
Teams in this competition can be comprised of 2-10 persons, or 10+ persons that, for the purposes of this competition, are considered organization teams. It should be mentioned that competitors might also choose to compete individually. Nonetheless, collaborating requires a holistic effort. Of course, we have all heard the tired mantra, “There is no ‘I’ in team.” Likewise, surely we have all observed the golden conversational rule within a team, which is to use the pronoun “we” and seldom “I”.
But there is a lot more to it than that! It is important to take reasonable steps to ensure communication effectively works. Although people often assume effective communication will organically occur, it really requires effort—and this is especially the case of online collaboration and competitions. Respectful communication is paramount, and respect and trust in a team is something that should start at a mutual non-zero level and continued to be earned. A strong collaboration will include persons with varying skillsets, cultural backgrounds, and social perspectives, so it’s imperative that at the mutual non-zero respectful level, everyone understands, acknowledges and respects what the other brings to the table. A respectful and trustful team ensures that everyone is involved, and members recognize that an individual’s inputs throughout the duration of the project will vary.
It is important to leverage tools to strengthen collaboration and team communication. Tools like mailing lists, team wiki pages, and GitHub repos are freely available and worth considering. Here is a TED talk by Tom Wujec: Build a tower, build a team that may be useful for thinking about your team interaction.
Another aspect of communication involves explaining your results during the competition, and noting how you came to them. These aspects are vital for credibility amongst peers, judges and in general.
Data, data, data
Chances are that even if datasets required to solve a problem are easily identified, and the data within is understood, the data is most likely not in a preferred format, so all sorts of cleaning and tidying up of the datasets will be necessary before analysis. Noting the exact steps needed to achieve that cleaned dataset ahead of analysis is vital in terms of the credibility of the final results. Creating a list of instructions (a cookbook) furnished with an explanation of the cleaned dataset (a codebook) for this step is useful. During the competition, the BDBDUG will be providing some data and training on these data to include what the variables are and how they are commonly used.
If you find you require more data to address your problem, some climate datasets that may be of interest for the 22 regions outlined in this competition can be found in the CORdinated Downscaling EXperiment (CORDEX) website. CORDEX is a World Climate Research Program (WCRP) initiative that organizes regions to conduct regional climate change projections at similar temporal and spatial resolutions in an effort to facilitate impact and adaptation studies. The World Meteorological Organization database also provides station data for all countries. On that note, another useful resource when dealing with geospatial data is regridding packages that facilitate the comparison of geospatial data sets by placing them on common spatial and temporal grids, e.g., the SciPy interpolate griddata package.
In the late 2000s, the concept of reproducibility of results became huge in academia when published results were not easy to reproduce and/or analyses were not easy to replicate, thus leading to reduced reputability of the results. There are loads of resources now available to help ensure that results are reproducible, such as writing codebooks or leveraging systems like iPython notebook or RWordPress, markup languages, e.g., Markdown to document and complete an end-to-end analysis. Sometimes analysis may contain many moving parts, thus requiring more complex entities to build applications (for example, Apache Ant or Apache Maven for Java applications) to ensure similar environments in order to facilitate reproducibility.
“Science, my boy, is made up of mistakes, but they are mistakes which it is useful to make, because they lead little by little to the truth.” – Jules Verne
Multiple submissions are allowed in the first phase of the competition, however, only the last submission will be used for point-allocation purposes. The final allocation of points for the overall competition will be comprised of a summation of points from the two components, resulting in a cumulative skill that everyone can use to compare their skills to other people and teams in their region and worldwide. The final rankings will be made public and can be used on websites such as LinkedIn to demonstrate the competitors’ performance!
For more information on the Global Data Competition 2015 visit www.Global-Data-Competition.com or email Contact@Global-Data-Competition.com You can also join the conversation on Twitter! Send a Tweet to @GlobalDataComp or @BigDataUtah, and please include the hashtags #BDBDUG, #UtahGeekEvents and #BigDataUtah.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.