How to Get Real Value from AI and Machine Learning


Speakers:

Ellen Friedman, PhD

Principal Technologist, MapR Technologies


Just because there's a lot of hype around AI and machine learning doesn't mean there isn't also a huge potential value. But can you really put these approaches to work to your advantage?

Getting real value from AI and machine learning takes more than just using or building a powerful algorithm. In fact, success with these systems goes beyond the actions of data scientists, although they are crucial. Data engineers, system architects, business leads, and executives such as CDOs and CIOs - even CEOs - all make decisions that determine whether or not you'll get real benefits from AI and machine learning.

In this webinar we'll explore what makes AI and machine learning successful in real world settings, including:

  • How to recognize where AI and machine learning solutions can pay off
  • What having the right data (and making it available) really means
  • How to connect the insights and automated decisions of these machine-based systems to real business goals

Transcript

Ellen Friedman: 00:01 Good morning. Do you see the slides all right, and is the audio okay for everyone?

David: 00:12 Yeah.

Ellen Friedman: 00:13 Cool. Thanks very much. Thank you all for coming today. I'm going to talk to you about how to get real value from machine learning, from AI. I'm sure you're hearing a lot about these systems, but the question is, can they really pay off? And if so, how do you make that happen? This is my contact information, and I'll repeat that at the end. I'm here at MapR as a principal technologist. I'm also a committer of a couple of open source Apache projects and I've written a number of short books, mostly for O'Reilly Media. You can follow me on Twitter, and we do periodically have announcements of other books and events and that's a good place to hear about them.

Ellen Friedman: 01:03 Now, I think I'm going to start with the first question of AI and machine learning; Is it really just hype? And we certainly see examples of machine learning in so many different situations from really sophisticated deep learning for image analysis, manufacturing optimization, steps that are automating business processes, even simple things that are very powerful in business such as retail recommendations, we certainly see it everywhere. The question is, is it just hype or is it really paying off? And so as we see AI all over the place, we address the question of, is it just hype?

Ellen Friedman: 01:43 And the answer is, yes, there is a lot of hype. The hype is real, but don't let that fool you, the value is also very real, especially if you know how to approach these problems. And I think one thing that surprises people is, some of the most important issues that are the key to getting real value from building these systems is not about the mathematics, the sophistication of the model or the algorithm itself, although those things are very important and absolutely essential. But if you look at the whole picture of what results in success, the machine learning piece itself, the learning piece, the model, the algorithm, is actually a very small part of the picture.

Ellen Friedman: 02:25 There are a number of other things that are absolutely essential that are technical, that have to do with design, that has to do with human decision and input, that really determine whether these systems are just an interesting experiment or if they really bring value. And so it's those things that you can address that are outside of the model and the algorithm that I want to talk to you about today. Now, to get real value from machine learning and AI systems, there are two broad areas to pay attention to. First of all, there has to a really good fit between the system you're building and practical business goals.

Ellen Friedman: 03:04 Now, that sounds simple, but it involves a lot of different things from how you frame the question, what you're going to do about it, whether the systems you build are realistic in terms of, do their SLAs actually fit the business situation? Is the system, the technical system you're working with and the design that you have sufficient to give you the performance you need and scale to actually meet those SLAs? So there's a great deal about that that has to be a good fit. Otherwise, again, these systems can be built but they're not really practical.

Ellen Friedman: 03:39 Another big area that needs to be met has to do with flexibility, and this happens in so many different levels, but flexibility's important, not only to get performance, but especially because AI and machine learning systems tend to be irritative, and even after they're in production, they need to be responsive to real world situations. Even when you build a very successful system and have it in production, I guarantee the world will change. Things don't stay static. These are systems that need to be able to respond in an agile and flexible way to the world as it changes, to changing data, to changes in, say, customer behavior, to changes in the equipment behavior.

Ellen Friedman: 04:28 They need to be flexible and responsive, and to do that in a timely manner. And all of this has to be able to be done with a reasonable human effort, not only in terms of the data scientists, their level of sophistication, the number of data scientists that you have. But what's around that? The data engineers who handle this, the people who are administering the system, all of it has to be done at a level that is practical and cost effective, otherwise, you won't be getting real value from these systems. So with those two big areas in mind, let's dive into a few of the specifics.

Ellen Friedman: 05:04 We have time to just touch on a few today to say, what are some practical things that you can do that will help ensure that the systems you are building do actually bring value? And the first one I want to look at is, are you asking the right question? So we'll start with a really simple example. This isn't even machine learning, but I think it makes the point. You may think that you're asking the correct question, but it's not going to give you the value you want. So just to illustrate the point, consider, say, an email campaign for a marketing campaign. You've produced three different emails, they're shown here as A, B, C and represented by different colors, and you're testing them out.

Ellen Friedman: 05:46 Now you want to ask the question, which of these campaigns is the most effective? Which is the most successful? You maybe do that by looking at the number of clicks you have. So if you assess it at a particular time, at this moment in time, you look to see which of these three campaigns has more clicks. It looks pretty easily like C or the one in blue, is definitely the leading one, the better choice. It turns out this is actually very misleading, if you really think about the value of what you're trying to get out of this campaign. So let's look at this same question slightly differently, looking at the data slightly differently.

Ellen Friedman: 06:23 If you think of each campaign and its performance over time, maybe over a week or two of time rather than at just time T now, you see a different view of the behavior, and you see that each of these has a spike at certain periods of time after it's launched. And maybe what you're looking for is what's the best performer at it's spike, not the best one that's behaving right now. They may have been launched at different times. And so when you have dataware, you can actually look at this as a trend over a short period of time, in the lower part of that graph, I've aligned align those just to make it more clear. Suddenly, the blue one no longer looks like the best performer.

Ellen Friedman: 07:10 You can see that be the green one is really a better performer. All we've done here is slightly changed the nature of the question, how the question is formed, and so measuring performance at a constant time after launch, in this case gives a more consistent comparison of the behavior of these three campaigns. But also, keep in mind you wouldn't be able to do that if you've had overwritten old data with current data. You can only do it if you actually save data for that period of time. And so it's very important to ask the right question, but it's also important to save data that may not be initially what you think is the data that you're going to want to address. Sometimes you have to adjust your question later.

Ellen Friedman: 07:54 So using that real but very simple example, let's move on to some actual machine learning situations and see how the same kinds of ideas apply. Now, one thing that really makes a big difference if you're asking the right question is domain knowledge. It's not just the person who builds the models, who knows about the algorithm, and not even just the data engineers who deliver the models and the data to these systems, but it's also people who actually understand the situation, and those may be different people. So understanding the business, understanding the research situation makes a big difference for addressing the right question. This is a really simple example, but this is a real machine learning example that made a lot of difference for a company.

Ellen Friedman: 08:42 This is a company that was doing, many years ago, doing online video streaming. And initially, to build a powerful recommendation system for customers using that service, a recommender was built, and the recommender actually worked well in that it was doing what it was built to do, but it gave very poor performance in terms of the results with customers. And what happened is, in this recommender they were trying to understand what customer preferences were for videos. That's a good approach. They didn't look at ratings, they actually looked at what customers watched. That's also a very good approach for building a recommender.

Ellen Friedman: 09:25 But in this case, how do you tell which videos the customer watched? Well, they use the titles. What they found is that when you build a recommender based on the titles of the videos and using clicks as input data, what they found is that the model was working well, but it's actually testing the wrong performance. It's showing what titles people are attracted to, not necessarily what videos they wanted to watch, and so by something as simple as adjusting this system to represent preference now, by the first few seconds of viewing a video as opposed to just the title of the video, suddenly the recommender, that same model of performance became very good at returned real value for the business.

Ellen Friedman: 10:10 And so this is a simple question of just adjusting the question that you're asking and the importance of actually having the data to be able to address that question, that can really pay off. Let's look at another simple example from machine learning that has to do with asking the right question. In this example, the issue is not how sophisticated the model is that you built, but it's the sophistication of the person. In this case, it was a data scientist who looked at the business and recognized the right place to actually use machine learning to build an automated or a machine driven decisions step in this business. So he looked for a bottleneck in the business and found one where they felt that building model could really pay off in practical terms.

Ellen Friedman: 11:00 In this particular case, this is a large industrial company, they used machine learning, deep learning, AI in a number of different ways. They have a ton of IoT data that's streaming at it at very, very high rates. They have some very sophisticated models going on, analyzing the process, what's going on out in the field and so forth. But in this particular example, a data scientist, this is actually a MapR customer and the data scientist works here with MapR's professional services, he did a very clever thing and he talked to them and he found a point in the business process that had to do with accounting where some items, parts or services, had been mislabeled in accounting terms, and this was really costing the company money.

Ellen Friedman: 11:51 In some cases this was revenue that to be collected where it had a different status in terms of being a expense or whether it was taxable. Anyway, there were a number of different examples and there were enough that it added up to a considerable loss, a considerable cost, but to find those for a human to go in and find each of those examples, they were just too many bits of data. There are too many examples, most of which were correctly labeled, so it's like finding a needle in a haystack. So data scientists recognized that this is actually a very simple step to automate for machine learning, and it's not to build a system where a machine learning system must follow these correctly, that would be difficult and even might be hard to do in terms of regulation.

Ellen Friedman: 12:38 And it wasn't even to build a machine learning system where it corrected mistakes. Again, that requires a level of sophistication and it gets into some regulatory issues. But it was simply to go in and make a target rich collection of examples that were likely to have been mislabeled, and then humans could go in and actually correct those. And so it made it feasible and cost effective by enriching a group of examples for humans to look at, the model itself was an incredibly simple classification model, and the result was really tens of millions of dollars for the customer. This system was very quick to build.

Ellen Friedman: 13:22 Another nice aspect of this system in terms of machine learning is that machine learning, models training is an interactive process. You don't just build the model, test it, it's good enough, you deploy it. You'll be building many, many models. Models are tested, they're adjusted, they're evaluated, they're retrained, they're retested, and as time change, they may be again, tested or are trained again. In this particular situation, because most of the examples were actually correctly labeled, it meant that this customer had the data that they needed for training. They had a huge dataset. And often, the amount of data you need for training is much bigger than the amount of data that you'll need later as input just to successfully run the model once it's in production.

Ellen Friedman: 14:15 So most of their examples were correct. That's what had caused the problem, but it turns out with machine learning, that's a plus, it meant they had a lot of correct training data, and as the system was built and continued to run and they started to find more and more of these that were mislabeled, they could go back and use that as additional training data to further refine the model, make it even more accurate. So this is a simple system, the value and the cleverness came in recognizing where machine learning could be applied to make a big difference with the business.

Ellen Friedman: 14:51 Real value from AI depends not only in asking a practical question and a question with a good fit, but having the right data to answer it. And we see that in a number of different situations. For example, if you're doing fraud detection and maybe could be in various kinds of transactions or on a user's site, a website. The fraud may occur from different users. The fraud will occur at different times. And you have an issue when the fraud is detected, the faster you act after it's detective, obviously, the better it is or the less risk or loss. So that's a good thing in itself. You may build a system that can detect behavior that looks potentially fraudulent and be able to take an action.

Ellen Friedman: 15:36 But it's also possible in some systems to actually look at an earlier point in time where there's some signal behavior that would tell you that fraud might be about to occur when it could have been prevented. And so you made them build a different system to recognize those signatures. And in that situation, that only works if you have actually saved that record of behavior over time, rather than just the single point in time when fraud was detected. So once again, it really matters to have the right data that hasn't been overwritten or corrected or pulled out for certain feature extraction for particular projects.

Ellen Friedman: 16:15 Often, you want to save raw data if possible, because the features that you need for one project may be quite different than the features you need for another. Predictive maintenance in large industrial settings is another example that works much in the way of the simpler example I just showed you. You'll be watching a stream of measurements, data coming in from various IoT sensors. Maybe you see some indicator, it hits them alert just before there's going to be some sort of catastrophic failure in a part or a system. That's important, but what is more valuable is if you can look back at the history of what was happening for those various parameters, data from those various sensors as a longterm re-playable log.

Ellen Friedman: 17:03 When there has been a failure or a disorder, if you can look earlier, days earlier or weeks earlier, you may be able, using machine learning modeling, to recognize some failure signature that suggest that there's going to be a breakdown and to detect that early enough that you actually have time to go in, locate those parts, and do maintenance ahead of a failure. And so for that, you need a kind of static history of parts and maintenance, maybe even in database, and you need these longterm and playable logs from IoT data. And so it's this combination of asking the right question and making sure you still have the right data. Now, data is only valuable if you have convenient access.

Ellen Friedman: 17:47 We need it so that you can do this in a way that's not cumbersome, you can have good performance and make these systems practical. This is perhaps one of the most important issues for people building machine learning and AI systems. They want to be able to use a variety of different modeling tools, some of them are drag and drop, many others involve a lot more customization and sophistication. This might be done on premises, it might be done in cloud, but the principle is the same; people should be able to use a variety of different machine learning tools. Those machine learning tools often read data, say from a file in POSIX, but that data, especially at large scale, is coming in from many different data sources.

Ellen Friedman: 18:35 You will have had to be Data Prep, ETL or various ways as the data's brought in and stored. Often, that's done with tools such as Apache Spark, Apache Hive, and they're writing output as HDFS. Those machine learning tools don't tend to read from HDFS as an API, and so in many systems, you actually have to have, in some large data science product or work bench, a big part of it is they have the capability to keep copying data, large scale data, out of an HDFS systems into some other system where machine learning tools can actually access it.

Ellen Friedman: 19:17 One of the great advantages, one of the strengths of working with the MapR data platform is that MapR, which is dataware, addresses that directly. MapR is a system where these machine learning and AI/ML tools can read it directly from the storage, even from this output stored in the form of HDFS. This is a fully read/write POSIX compliant file system, and it makes it possible to use these tools directly without having to copy out first. That not only as a performance, it reduces time, it reduces cost. It also gives people much more flexibility, reduces the chance for error, and makes these systems, again, much more likely to be successful. So this is a huge difference.

Ellen Friedman: 20:09 By the way, if you're not familiar with the term dataware, it's kind of a new term, a new concept. dataware refers to data storage, but it goes far beyond just storage. It's a system that can orchestrate what happens with data in terms of access, control, management, replication, data movement. And to do this for a variety of different data structures, files, tables streams, to do this all within a single system rather than separate components that are trying to communicate or work together. MapR is dataware, it's a great example of dataware, and you'll see how this can pay off in other ways for a number of systems, but especially in building large AI and machine learning systems.

Ellen Friedman: 20:56 Now, I think it was last year, I wrote a blog looking at which machine learning tool is best. There are a number of popular ones, and the fact is we looked at what a number of different customers, we were doing large customers who are very sophisticated and using machine learning effectively in their systems. We found that they do tend to keep a kind of toolkit of their favorite tools, but all of them had multiple tools that they like to use, usually five, six, seven, in some cases as much as 10 or 12 different machine learning tools. Things like TensorFlow, H20 is a great one, some people like MXNet.

Ellen Friedman: 21:35 The point is they need that flexibility to use different tools, and it isn't just different personal preferences among the data scientists, but it's that no single good tool fits every situation, and a lot of valuable machine learning is done by trial and error. So in different situations, you try modeling with different tools and you want to be able to use the one that fits in that situation. So again, having something like the MapR dataware that gives you direct data access, let's use all of these tools directly, let you use tools that haven't been invented yet, the next one coming down the pipe. This is really a huge advantage in building these systems.

Ellen Friedman: 22:18 Now, another issue I've alluded to earlier is the logistics of dealing with huge amounts of data, especially the training steps, building many, many models, being able to roll those out into production. Easily roll them back, roll out new models as needed so you have that kind of responsiveness to environment. This is another place that dataware can make a big difference. It's another place where your design, the framework that you used to do this, to manage models can make a tremendous difference in the success of these systems. This example is looking at a framework of design called Rendezvous Architecture that I've written about with MapR's chief application architect Ted Dunning.

Ellen Friedman: 23:04 This particular framework is actually built on a streaming micro services approach, but you don't have to do it in this way. There are many different ways to do this. Other people are building different frameworks. Another really popular one that has a lot of potential comes from Google, it's called Kube Flow and it is a microservices approach, a more traditional microservices approach. And like this Rendezvous architecture, it takes advantage of containerization of applications for added flexibility. And so that's another key part of keeping these systems flexible and convenient and responsive. We like particularly the software, the framework called Kubernetes to orchestrate the containerized applications.

Ellen Friedman: 24:01 And you need, it's something in parallel to that, you basically need dataware to orchestrate the data. In this slide, you can see, think about this as a series of applications that are interacting with each other. You don't want to store state, you don't want to persist data in containers, it'll defeat the purpose of some of the flexibility of containers. You need a system, you need a platform for data persistence. MapR is designed to do this, and not just for data files, it also can handle streams and database log, however you want to store data from these containerized applications to keep the containers wide and true, so that you can run whatever kind of application you want into container, not just stateless applications.

Ellen Friedman: 24:51 This particular example is just a reminder that a first application might be using input from a file stored in the MapR data platform, it's output might go to a stream. That data might be then used by different consumer or different applications. These things all interconnect, so you need both orchestration of the containerized applications. In this case, we think Kubernetes is a great choice and lot of people are using it, but you also need orchestration of the data in parallel to that and that's where the MapR Data comes in.

Ellen Friedman: 25:29 MapR Data is really like having Kubernetes, but for data. These combinations, these systems make a huge difference for getting large scale applications into production, not just for machine learning and AI, but any kind of system, but they're particularly important in the machine learning area. The right dataware such as MapR Data Platform can make a big difference because there are a number of data management issues that should not be handled, should not have to be handled at the application level.

Ellen Friedman: 26:07 These things can be pushed down to the data platform, it makes them more efficient. It separates the concerns of the data scientists, the developers from the concerns of IT and system administration, and it makes them both much more efficient. So having the right data, the right dataware set you up not only for the current project to work efficiently, but what's really important, especially for AI and machine learning systems is, it sets you up for that next project so that you have a lower entry cost. You already have sum cost, you've built your system, you have the data, you have the systems, the data available.

Ellen Friedman: 26:48 This is important, particularly for AI and machine learning because these projects are really speculative, they often are speculative and that means that there's a higher risk in terms of until you try it, you don't know what's going to happen. You don't know if it will give you value, but where they work the potential value is tremendous. And so if you can lower that entry cost to be able to try new things, to be able to take advantage of new data and new situations as they arise, it makes it much more feasible to put these AI machine learning systems into play in real settings and have them pay off.

Ellen Friedman: 27:28 The last thing to think about is even if your system is working right, you have the right question, you have the right data, you built a great model or many models, you have it in production, everything is working, it's still doesn't really give you value for your business unless you have a way to take action based on the output of these systems. And I remind you, that building a report is not in itself an action, that's not what I mean by an action. You need a way to actually either automate it through human action, take the output, the insights that you draw from these machine based decisions, these machine based systems and put that into effect toward real business goals in a practical way, and a timely way, and a cost effective way.

Ellen Friedman: 28:17 Here's a real world example of that. There's a large service provider, they use machine learning again in a number of different place, but in this particular case, they use machine learning to improve their estimations of ad volume. They have to hopefully fairly correctly estimate the amount of ad volume they're going to have. If they miss it, the errors cost this company real money, and so the tighter those and more accurate those estimations are, the better their business is. When they switched to doing this through machine learning models and predictive analytics, they were able to get much more accurate estimations of ad volume, and ad volume changes over time.

Ellen Friedman: 29:03 That meant that they could take an action which is to adjust their contracts based on these accurate predictions and to do that in a timely manner. The business outcome is that this greatly improved their revenue, and often improve their customer satisfaction. So the key thing here is not just that they built the system and the model, but it was tied directly into an impact and effect, an action that can be taken at the business level. Remember, it's cool as AI is, it isn't magic, and so it actually takes, if you're going to get real value, you have to ask the right questions, you have to have the right data, you should have the right dataware to handle that and you need to be able to take action.

Ellen Friedman: 29:50 Thank you very much for joining us today. Some of this content is discussed in the recent book that I did with Ted Dunning, it's called AI and Analytics in Production, MapR makes this available as a free download of a PDF, and so you'll have the link to be able to do that in these slides. Another new book that you may find useful is that MapR's making available is called Getting Started With Apache Spark 2.x. This was written by Carol Mcdonald. Many of you will know her from her blogs and presentations. There are also a contribution from Ian Downard, both Carol and Ian are engineers here with MapR. And again, here is a link to be able to download a free PDF for that new book.

Ellen Friedman: 30:39 Please continue to help support women in all diversity and technology. This isn't just a good for London, but it's good for society and from one woman in technology, I say thank you very much for joining us today. Now, there may be some questions that we can go to and I'll repeat the contact information right at the end. Thank you.

David: 31:02 Just a reminder to ask a question, you can insert it into the chat box in the lower left hand corner of your browser. Ellen, I think I'll let you take a look that there are a couple of questions that have come in.

Ellen Friedman: 31:21 Okay. One question I see is a question of how large of a data science team is needed. What's a general guideline? And it's a very good question. The answer is that it will vary a lot, sometimes literally one data scientist may be all that you need, but remember, that's not your whole team that's building the Ml or AI system, because you really do need them working in close contact and coordination with data engineers, with people who have domain knowledge, both to understand the data and to be able to address those business space actions at the end.

Ellen Friedman: 32:05 We have talked in other settings about using a DataOps design or DataOps approach where your human resources, and this is an approach and basically an extension of DevOps that has the flexibility and focus of that approach, but it extends it to people with data skills, data scientists, data engineers. The idea is that they're all focused on a single target, they're all part of the same project, so it cuts across basically skill guilt. It improves communication between groups and it actually makes a big difference in the outcome. More often, it is useful in some systems are very complicated and very sophisticated and larger companies especially, may have teams of data scientists working on different projects.

Ellen Friedman: 32:55 Even in smaller situations, it can be useful to have more than one person, partly because it's good to have that interaction between the data scientists thinking about the modeling step itself and thinking about how the business, where those can be applied to business. But it's also useful because as the person that's working on building one system, often new opportunities show up sometimes as a kind of secondary effect of building the first system and it's good to have additional resources, additional data scientists who can start on that.

Ellen Friedman: 33:32 There's no set amount or the size of a team. A small teams can be very, very powerful and very agile. Once again, the importance is to make sure that the team or the skillsets that you have is a good fit for your particular situation. And do keep in mind that people with the data engineering skills and often they are more of them than the data scientists are an absolutely critical part of a machine learning or AI system. It isn't just the data scientist or the modeler who's key. David, did you have other questions there? Okay, I see a question here, it's saying, can I comment on how AI and machine learning can be useful for material scientists, especially for predictive modeling?

Ellen Friedman: 34:41 I can't go into detail about that system, but if you would email me, I think that we could take that up offline and go into more detail on that specific use case and I'll bring in expertise beyond myself, but thank you very much for that question.

David: 34:59 Ellen, there are a lot more questions, everybody, I know we're a bit over the time that the meeting was scheduled, the end, but if you did ask a question and you would like to stay on, we will do our best to get to them. There's a question in here about, is there a white paper or a book on the nuts and bolts of a production implementation?

Ellen Friedman: 35:27 Really, I think this AI and machine learning and production book that I put a link to is a good one. We did an earlier book called Machine Learning Logistics, which goes into a bit more detail about the model management and how you can easily roll a model into production and back out. And so we can make those links available, if we have the email of the person, I can send the links directly or those can be found on the MapR website.

David: 36:04 Great. Another one, how does ETL and warehousing DWH knowledge help in becoming better at developing a new model?

Ellen Friedman: 36:17 That's a very good question. I think that it's important to know that these ETL steps, take that as the example, they may be done specifically for a machine learning or AI system that you're building, that you already have in mind, but more often, they are done because people are increasingly recognizing that data has value, especially large scale data in aggregate. And so data is meaning, obviously, there's some ETL being done just to prepare the data and to get it ingested, but you may not want to do all of that refining for a specific system all in one step.

Ellen Friedman: 37:00 Keeping more data that is closer to being raw, it's cleaned up, but it still retains a lot of different features is very important because different systems go in and use different features. And the nice thing about large scale system and a dataware such as MapR, is that it really is practical, it's really feasible to scale to do that at reasonable and low cost and very conveniently. So you can have ETL that you've done to prepare, say data for widespread use, sort of the data lake or data hub and then you can do additional ETL to prepare data or began to extract data being used by specific machine learning systems.

Ellen Friedman: 37:48 And as those get more and more specialized, the data engineers, data scientists on a particular team will start to refine that training data for their specific purposes. But they don't run into the problems, they go back and find that features, have them thrown away. I think to do that effectively at all of those different levels, obviously, you can't know all the different ways data's going to be used now or in the future, you can't save all data that comes in, especially in some of these systems, we have some large scale data, actually carved manufacturing systems that have IoT data coming from individual cars at a rate of two gigabytes per second for all of these different cars.

Ellen Friedman: 38:32 In some cases, they need to move all of that data to the center for training, especially in early training stages, later they want to do some data prep, closer to the source and just use the outcome of the partially processed data. Data can hit a scale where you really aren't going to save everything, but I think recognizing that ETL should not wash away all features just for a single project is an important and kind of fundamental shift in how people work in the data.

David: 39:05 Okay. So let's see. Got a few more here. Do you see all the questions though?

Ellen Friedman: 39:12 I've seen some. They pop by and then I lose them. I saw one earlier asking, "Do you have to have a PhD to be a data scientist?" And the answer is like most things, yes and no. People who are actually working directly with data, designing algorithms, doing kind of customized modeling, they don't have to have a PhD, although many of them do have a PhD. They do need sophisticated knowledge often with mathematics, certainly with training, with working with statistics. They do need some very specialized knowledge and that's a very valuable thing, but there are also increasingly large number of systems where there are some pre-built models.

Ellen Friedman: 40:01 A great example is TensorFlow, which is a very sophisticated deep learning system that comes originally on Google. It takes advantage of deep learning that is using layers and layers of decisions rather than all being modeled at one time. It's very good for things like image recognition, but they've done a lot of the heavy lifting already. And there are some kind of template pre-built models such as One Color Inception-v3, where you can retrain it on your own images. And so you need to know a lot about how machine learning systems work. You need to know the domain knowledge of round system and little understanding of this.

Ellen Friedman: 40:43 But these kinds of models can be used trained and retrained for specific purposes by people who are not trained as data scientists, and we see that being done more and more effectively. So a PhD is not an absolute requirement even for the data science piece in the team, certainly not for the data engineering critical parts of the team. but it can be very helpful, especially for some new or cutting edge system.

David: 41:11 All right. I'm going to skip it. So Ellen, if you click on the questions tab, you'll be able to see all the questions.

Ellen Friedman: 41:22 I see. I could just ... Thank you.

David: 41:27 I'm going to skip down to two common languages for data scientists are Python and R What is your opinion on these?

Ellen Friedman: 41:34 Well, my opinion is I think thinks are both good choices, we see them very widely used. I think especially there's increasing use of R, I think they're both good choices. It's important to have the flexibility to use these, as I said directly on data. That's one of the advantages that MapR offers. We see people using other systems as well to a lesser extent, people are using Scala. And you really may find that over time, what you're most familiar with, you'll keep using, but you'll start adding to your tool set. But I think both Python and R are a very good choices for a person who's going into data science and machine learning.

David: 42:23 Okay. There's another one here. Can you comment on how AI and ML can be useful for material scientists, especially for predictive modeling?

Ellen Friedman: 42:31 Well, I don't have sufficient background for that, but if we can follow up via email, I will track down some information on that. Here's I see this next one. Let's see it. In a use case where a large amount of prediction are expected even larger dataset ... I'll see if I understand that. Ah, okay. I think they're talking about scale here. They're asking where you have a very large dataset, you're expecting a lot of predictions. If I understand the question correctly, here's my answer to that. They're saying is it feasible?Yes, and some good news is, even in situations like that, you may not always have to work at the same scale and so we do see systems that are being built, again, I'm most familiar with MapR customers.

Ellen Friedman: 43:25 MapR provides a system that lets you orchestrate the data at scale with superior performance, and to do that on premises and cloud, you have a lot of flexibility of how you do that and you can do it in a cost effective way. In early steps of data exploration and early steps of building models, people tend to use very large datasets especially for training, but also because they are literally trying to discover things. Machine learning and AI is a voyage of discovery, it's not just us accepting with a known answer. And so as they began to get insights and a large number of predictions from data, they may find that they now go back and refine their systems, refine their models so that not just that the model works better, but they're breaking apart the initial question into sub questions.

Ellen Friedman: 44:22 They understand better what's going on. A little bit of an example, it's kind of general, but I think it fits the situation, is that people build anomaly detection systems and when they build those well, they're building a system that isn't just defined to go out and say, a particular behavior is an outlier, they don't want to just define what normal is, but they actually model normal behavior so that they really are following behaviors as system change. And that can be a fairly sophisticated thing to do. And it really is a discovery. You don't know what kind of outlier or normalist behavior is going to show up.

Ellen Friedman: 45:05 But in some systems, as you began to see some of the typical outliers, maybe that's in an industrial setting, maybe it's at a financial situation where you're looking for somebody, a behavior that looks potentially fraudulent, you begin to build secondary systems such as a fraud detector, where it's much narrower and that it's looking for much more well defined systems. And the model center involved sometimes can be much simpler. So even in the system that start off with a huge number of predictions from a very, very large dataset, often they advance the system, it's where you're using smaller amounts even of training data and certainly smaller amounts of data as you are running the model and production.

Ellen Friedman: 45:54 Let's see, do we have time for one more?

David: 45:58 Why don't we do two more. Can you start with that? Is that a good AI/ML tool?

Ellen Friedman: 46:07 I am less familiar that ... Yes, I think so. We certainly see people using this. I'm more familiar with people who are using Python and R, but again, if there's no single good tool, I think people have to find one that fits their style of work. It's good to try to have experience with multiple tools, I really do encourage that.

David: 46:30 Why don't you pick one more one?

Ellen Friedman: 46:45 I see a person asking it. It's similar to the question about do you need a PhD? They're asking, do you need a master's at a bare minimum, the data science bootcamp or self learning survive. And it says, they come from a software engineering background and are comfortable with Python. The question of whether a person with those levels of background, say with a master's degree in data science, can they build these systems and do it effectively? Absolutely. But it does depend on the system. Again, what most matters is that you have a good fit between the skills, the requirements, the system you're building, the situation, the issues that it's addressing.

Ellen Friedman: 47:28 What I mean by that is, some systems require a very high level of sophistication, new and customized modeling, very complicated algorithms to build. And there are people who are self taught who do that beautifully. What matters is that you attain that level of knowledge whether you do it yourself or through a bootcamp or you do it through a more formal academic education. But I think in those situations, you probably do need something equivalent to certainly a masters and maybe often a PhD level of knowledge, whether you get the certificate for that or not.

Ellen Friedman: 48:08 But keep in mind, one of the huge changes that are happening in machine learning and AI is that it's getting much broader pickup across different industries and different businesses and it's being applied much more broadly in many different situations. And there are many situations in which to build effective models, you do not need that PhD level or even the masters level of knowledge about data science. There are some very simple approaches, recommendation is a great example. We've written about this, even written an earlier book about this. You can build a very simple recommender, that is looking at doing a simple statistical analysis of the interaction between users and say, objects or behaviors that look for interesting patterns of use.

Ellen Friedman: 49:02 And those very, very simple, that very simple approach to recommendation can be applied in a lot of different situations. We know of a very large financial institution who applied this to various customer transactions and resulted in an expanded program, a new service for them that within about four months, within production, and this is working with some top notch engineers, but people who were not data scientists and they had not previously done machine learning. So again, I want to emphasize, building these systems effectively, being the person who has the skills to build them, needs to be a good fit with the particular system.

Ellen Friedman: 49:41 It doesn't mean that you can build every system, but it means you are highly qualified to build systems that are even more commonly found than the very sophisticated ones. That's on the data science side, by that I mean the person who's actually doing the modeling, working with the statistics and so forth. But the data engineering part is tremendously important and becoming much more tightly integrated in working directly with the data scientists. It is very much an integral part of building these systems, and so, having a software engineering background is a great approach to that.

Ellen Friedman: 50:18 You just add some data skills. Being comfortable with Python, is a wonderful foundation for doing this. And so I think there's a lot of opportunity out there even for people who are self taught, people who are getting some data science or data skill exposure through courses. MapR offers some online courses for free as well. But there are a lot of different approaches and fortunately, I think people are more and more recognizing the skills that are needed rather than looking for just a single label or a title or academic degree. Shall we close on that one?

David: 50:58 Yeah. And there are a few other questions on here, Ellen, that I think if you're okay, you can follow up with personally afterwards. I think they're right up your-

Ellen Friedman: 51:11 That would be awesome.

David: 51:11 ... right up your alley.

Ellen Friedman: 51:13 Thanks to so many people for asking questions. I really appreciate that, and thank you for attending this session. And we'll look for you on the next one. I'll advance one more slide here so you can see contact ...