I am not sure about other data scientists’ experiences in trying to explain Big Data and Data Science to family members, but I find that there is a lot of interest coupled with a lot of confusion. That’s not too surprising since there are many experts in the field who are also confounded by the mixed messages, the hype, the uncertain meanings of terminology, and where all of this is going. Nevertheless, we try our best to inform, inspire, and include everyone in this adventure. Consequently, I have learned to find a connection between my big data science vocation and the life experiences of my listeners. Families are good listeners, but they also know you better than anyone else knows you, so they can see when you are exaggerating something for effect. So, in the interest of data scientists’ families everywhere, here is my serving of Big Data, Family Style, without the hype (…okay, with minimal hype). (Note: the names of the following family members and the listed work areas are real, but their relationship with the author has not been identified, just in case any of them actually read this, and each one’s associated work area indicates either a past job, or their current job, or their academic major in college.)
The rich applications of big data in genomic science are very visible in the treatment of genetic diseases that afflict the aging population. Specifically, searches for causes, treatments, and (hopefully) cures for Alzheimer’s disease and dementia have become major areas of data-intensive genomic sciences research. Personal genetics and the corresponding personal gene sequence data are creating a vast treasure trove for discovery science. An especially interesting characteristic of genomic “big data” is not so much the volume, but the variety (i.e., the complexity and high-dimensionality of the data). Much more so than any other big data science, genomic data have an enormously dense and high knowledge-to-bytes ratio. Both the ordering of the genes and the local co-occurrence of genes in the sequence (which is a combinatorial explosion like no other) encode the story of a person’s past, present, and future: their heritage and ancestry, their current health conditions and medical responsiveness to different drug therapies, and their potential susceptibility to future diseases. In the field of gerontology, the practitioners and researchers aim to improve the quality of life (not just the quantity of life) for aging persons. Genomic big data sequencing for individuals is becoming more and more affordable, and it is reasonable to assume that such data scans will become commonplace in the diagnosis, treatment, and care of all of us.
The explosion of interest in healthcare analytics demonstrates one of the greatest opportunities for big data to affect people’s lives through all its stages. The big data and analytics use cases include curating and mining EHRs and EMRs (electronic health and medical records), medical taxonomies and ontologies (e.g., the ICD-10 medical coding standard), patient modeling (for predictive diagnoses and prescriptive treatments), predictive models to determine the necessity of hospitalization, insurance fraud detection and “do not pay” programs (including Medicare & Medicaid), and much more. In many of these patient encounters, it is the first responder and/or nurse who obtains the first data points about a patient, who correlates that information with past medical history and medications, who assigns the initial medical coding for the event, and who then decides on the triage and correct treatment at that point. Consequently, nursing informatics is becoming one the hottest health IT specialties, and to fill that need, graduate programs are now appearing across the country.
Ever since there have been classrooms with teachers and students, data have been collected—on student enrollment, family information, absences, homework scores, exam grades, standardized testing results, etc. With daily (if not minute-by-minute) technology-based assessments in standard classrooms (as well as MOOCs and other online courses) it is now possible to track everything that Johnny and Jane are doing (clicking on, reading, re-reading, paused on, or moving quickly past), how they are performing, where are their weaknesses and strengths, what are they struggling with, where are they bored, where are they engaged, and what recommended interventions are most likely to be efficacious in enhancing their learning experience. Schools are using such data to inform decisions at the district, school, classroom, instructor, and student level. The ability to intercept a student who is on a poor learning trajectory, and then re-direct that trajectory onto a positive track, is a prerequisite skill for teachers and school administrators (and even for parents). Big data can significantly and positively inform that task for all stake-holders in a student’s learning environment. Not only can teachers intervene with remedial students to avoid poor outcomes, but they can also intervene with G&T (gifted and talented) students to avoid boredom, disengagement, and poor outcomes! Institutions of higher education are applying student data from first contact (recruitment) through the rest of the student’s life (alumni engagement). A collection of data analysis tools and resources for learning analytics is available at the Pittsburgh Science of Learning Center DataShop, and additional information can be obtained from the Brookings Institution.
So many occupations, so much big data, and so little time
As in all families, there is a wonderful diversity of career paths and occupations. We cannot do justice to so many. Consequently, so that other family members don’t feel left out, we include a few more here before we wrap up, but we only provide links to relevant articles (without discussion) to help initiate and inform a discussion of big data analytics in those professions:
Hadoop is a different story altogether!
To learn more about Hadoop I’d recommend you read The Executive’s Guide to Big Data and Apache Hadoop – you can download that for free here. MapR has also put together a great resource page on Hadoop here.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.