Industry Solutions Architect, MapR
Join us for a complimentary webinar to learn more about Apache Spark's GraphFrames, which makes graph analysis scalable and easier with Spark DataFrames.
In this webinar, we will go over an example from the eBook Getting Started with Apache Spark 2.x.: Using Apache Spark GraphFrames to Analyze Flight Delays and Distances.
Graphs provide a powerful way to analyze the connections in a data set, examples include search results based on website links, recommendations based on users and product connections, and fraud detection algorithms in banking, healthcare, and network security.
In this tutorial we will begin with an overview of Graph and GraphFrames concepts, then we will analyze a public flight data set for flight distances and delays.
Carol McDonald: 00:00 Thanks. I'll go ahead and share my desktop.
Carol McDonald: 00:15 And let's talk, what we're gonna go over is first an introduction to graphs. Then we're gonna go over an introduction to GraphFrames with a very simple Flight Dataset for easy understanding. And finally, we're gonna use GraphFrames with a real Flight Dataset for 2018.
Carol McDonald: 00:30 So, first just a simple introduction to graphs. A graph is a mathematical structure used to model relations between objects. It's made up of vertices and edges that connect them. The vertices are the objects, and the edges are the relationships between the objects. So, here we have an example of ... A Twitter example. Ted follows Carol, and the relationship- No, this is ... Sorry. This is Facebook friends. So, Ted is a friend of Carol's. And so, the vertices are Ted and Carol. And the relationship is friends.
Carol McDonald: 01:10 A regular graph is a graph where each vertex has the same number of edges. So, for this example the regular graph that we went over, if Ted is a friend of Carol, then Carol is also a friend of Ted's. A directed graph is a graph where the edges have a direction associated with them. An example of a directed graph is a Twitter follower. Carol can follow Oprah without implying that Oprah follows Carol.
Carol McDonald: 01:43 Spark GraphX supports graph computation with a distributed property graph. A property graph is a directed multi graph, which can have multiple edges and parallel. Every edge and vertex has properties associated with it. The parallel edges allow multiple relationships between the same vertices. Here on this example, we have ... As properties, we have on the edges we have the flights. And some properties we have are the flight number, the distance, and the delay. And on the vertices we have properties for city and state.
Carol McDonald: 02:19 Spark GraphX is a distributed graph processing framework that sits on top of the Spark Core. GraphX brings the speed and scalability of parallel iterative graph processing for big data sets by positioning graphs across [inaudible 00:02:34].
Carol McDonald: 02:38 Graph analysis comes in two forms. Graph algorithms and graph queries. GraphX made it possible to run graph algorithms within Spark, but until recently pattern queries required moving data manually to a specialized graph database.
Carol McDonald: 02:55 An example of a graph algorithm is PageRank. The creators of the Google search engine, took search to a new level with the creation of the PageRank graph algorithm, which measures the importance of a webpage. PageRank represents webpages as the vertexes, and links to pages as edges. And it measures the importance of a page, by the number and rank of linking pages, plus the number and rank of each of the linking pages. An example of using the PageRank algorithm for another use case, would be for Twitter to see who has the most Twitter followers. And also for an airport use case, you can use it to find the most important airport.
Carol McDonald: 03:41 You can visualize PageRank as each page sending a message, with its rank of importance to each page it points to. In the beginning, each page has the same rank equal to the ratio of the number of pages. The messages are aggregated and calculated at each vertex with a sum of all the incoming messages becoming the new PageRank. This is calculated iteratively so that links for more important pages are more important.
Carol McDonald: 04:13 Many iterative graph algorithms, such as PageRank, Shortest Path, and Connected Components repeatedly aggregate properties of neighboring messages. These algorithms can be implemented as a sequence of steps, where vertices pass message functions to their neighboring vertices, and then the aggregate of the message functions is calculated at the destination vertex.
Carol McDonald: 04:38 Now looking at graph queries. Graph motifs are recurrent patterns in a graph. And graph queries search a graph for all occurrences of a give motif or pattern. As an example, if you wanna recommend who to follow on Twitter, you could search for patterns where A follows B, and B follows C, but A does not follow C. Then you could recommend that A follows C.
Carol McDonald: 05:07 Here's this example with a graph motif query. This query finds, so we have a find for this pattern. And the pattern is specified like this for the motif find. The vertices are in between these parentheses and the edges are in the square brackets. So, this patterns searches for A follows B, and B follows C, but A doesn't follow C. So, here we have a not before that pattern. Later we're gonna use this motif find in some more examples.
Carol McDonald: 05:44 GraphX made it possible to run graph algorithms within Spark. But before GraphFrames, graph queries required moving data manually to specialized graph databases, such as neo4j or TITAN. GraphFrames integrates the graph queries and the graph algorithms with Spark SQL, simplifying the graph analytics pipeline and enabling optimizations across graph and SQL queries.
Carol McDonald: 06:14 Next we're gonna look at some simple graph examples, some real-world graph examples. Graph analysis is important in domains including commerce, banks, social networks, and medicine. Recommendation algorithms can use graphs where the notes are the users and products, and the edges are the ratings or purchases of the products by users. Graph algorithms can calculate rates for how similar users rated or purchased similar products, for example.
Carol McDonald: 06:44 Graphs are useful for fraud detection algorithms in banking, healthcare, and network security. In healthcare, graph algorithms can explore the connections between patients, doctors, and pharmacy prescriptions. With a graph you can explore relationships between entities with queries that are difficult to do with SQL. So, take patient matching as an example. With a graph you can determine things like two patients with similar names went to the same doctor and got the same prescription, and be 95% sure they're the same. And this would be really hard to do with relational query.
Carol McDonald: 07:21 In banking, graph algorithms can explore the relationship between credit card applicants and identifying information such as phone numbers. In this graph, this is actually a graph from a Capital One presentation. We see that these two people, frauds, are sharing the same phone number. And of course, in a bigger graph, you could see rings of sharing which happen. And it can be very important for credit card applications. Another credit card example would be exploring the relationship between credit card customers and merchant transactions. But this Capital One, just to say, this Capital One is using Spark, GraphX in this case.
Carol McDonald: 08:04 Next we'll go over a simple flight example with GraphFrames. As a setting example, we're gonna analyze three flights. And for each flight we have the following information. We have the originating airport, the destination airport, and the distance between airports, and the departure delay. So, this is what our graph looks like. We have San Francisco, Dallas, Fort Worth, Chicago. Those are our vertexes. And then our edges are the distance, and also the delay.
Carol McDonald: 08:36 So, we'll look at how we can build this graph. First, we have the vertex table. For our graph we're gonna have three vertexes, shown here. And a vertex table has to have an ID. So, for the ID column we have the airport codes. And as the property we have the cities. Looking at our edges table, the edge table has our three flight routes. An edge table has to have this Src and Dst, so that that's the source and destination vertices. So, for our source and destination vertices, we're using the airport code. So, for example here we have SFO to ORD. And then the edge property in this case is the distance.
Carol McDonald: 09:28 So, what we do is we put these into a DataFrame. In order to do this, first what we do is we're specifying Scala case class with the Schema for our airport. So, here we have the ID that's required, which is gonna be the airport code. And then the property of city. Next we're creating an array of these airport objects. And then we're turning this array into a DataFrame. And then we're showing the first ... We're showing the rows of this DataFrame here ... With the DataFrame show method.
Carol McDonald: 10:05 Next, we're gonna create the edges DataFrame. And the edges DataFrame has to have the source and destination columns, which in this case are going to be our airport codes. So, here we have San Francisco and Chicago. So, again we have a Scala class to define the Schema. We create an array of these flight objects, and then we turn that into a DataFrame. And then we're just showing the rows of this DataFrame.
Carol McDonald: 10:28 You create a DataFrame by supplying the vertex and edges DataFrames. You can also just supply the edges DataFrame, and then the vertices will be inferred from the edges. Here we're showing that the vertices and the edges are available in the graph with this attribute. Graph vertices that show.
Carol McDonald: 11:00 So, the vertices and edges are the GraphFrame or DataFrame, and they can be queried like a DataFrame. Here we're showing the first few rows of the edges, with the DataFrame show method. Next we can ask questions like how many airports are there? So, that's a count on the rows and the vertices DataFrame. And we have three airports in our DataFrame, in our GraphFrame. Then how many flights? We have three flights. So, that's a count on the edges.
Carol McDonald: 11:33 Here we have a filter on the edges to filter for the routes that are greater than 800 miles. And the results of this show that we have San Francisco to Chicago, and Dallas to San Francisco, are greater than 800 miles. So, that's a simple example just to get started. Next we'll look at a more complicated one with the real dataset from 2018.
Carol McDonald: 12:00 First we're gonna look at exploring with just the DataFrames. And we're gonna load this data from the MapR Database, from a MapR Database table. MapR Database is a JSON document database. Here's a quick visual just to show how a Spark application runs. So, it runs as parallel executer processes, coordinated by the Spark session object in your Spark application.
Carol McDonald: 12:30 A Spark dataset is a distributed collection of typed objects, like a list, except that it's spread out across multiple nodes in the cluster. And a DataFrame is a dataset of row objects. And you can think of it like a table partitioned across a cluster. The dataset and DataFrame API's provide ease of use, space efficiency and performance gains, with the Spark SQL execution engine.
Carol McDonald: 13:00 We're gonna be using Spark SQL with MapR Database JSON. And there's a connector, the Spark MapR Database connector which enables users to perform SQL queries and updates on top of MapR Database using a Spark dataset, while applying critical techniques such as projection and filter pushdown, partitioning and data locality.
Carol McDonald: 13:22 With MapR Database, a table is automatically partitioned across a cluster by key range. And each server is a source for a subset of a table. Grouping the data by key range provides for really fast reads and writes by row key. And it's also useful to have your data organized this way for really fast graphs, as we'll see too. The Spark MapR Database connector architecture has a connection object and every Spark executor, which allows for distributed parallel writes and reads with MapR Database.
Carol McDonald: 13:58 This is the schema and some data for a flight dataset, which is in JSON format. What we're gonna focus on is for this GraphFrame, is the ID, that's the unique ID. This is actually, this is the row key that we're talking about. This is the row key for MapR Database, which is gonna be automatically partitioned and sorted by this row key. And our row key, we're gonna have it automatically sorted and partitioned by the source airport, the destination airport, the date, the carrier airline, and then also the flight number. Then, the other important attributes are the source and destination. Those are required for the edges DataFrame. And then as properties we're gonna focus on the departure delay and the distance.
Carol McDonald: 14:54 So, again this data's gonna be automatically partitioned by the key range, and sorted. So, our data's gonna be grouped by the source airport and the destination airport. So, that means that these partitions are gonna be grouped that way, which will make it useful, because that's also what our query's gonna be looking at. We're looking at the source and destination airports in our queries. We have a subset of airports in America for this dataset. And these are the airports in the dataset, and the dataset is also for the year 2018.
Carol McDonald: 15:30 So, the first thing that we wanna do, is we wanna define the Schema. This is the Schema for that, the JSON row in our database, that I showed earlier. And so, we're defining this with a Scala case class, and also then with a Spark StructType.
Carol McDonald: 15:56 Here what we're doing is we're loading the flights from the MapR Database tables, into DataFrame with the load from MapR Database method. And we're specifying the table name, and then that Schema and also the type of the objects that are gonna be returned here, that we just specified there. So, that's the flight type object, and the Schema right here.
Carol McDonald: 16:18 Then we have a show of the first five rows of this DataFrame. And we see that it's composed of rows and columns. And here again we see that the results we see that they're automatically sorted and partitioned. So, here we have first it's gonna be A, Atlanta, the airport starts with A. And B. So, those are the ones that are gonna come first.
Carol McDonald: 16:46 So now we can explore the table with some DataFrame queries. We'll do a few just as an example. Here's a DataFrame query to answer the question, what are the originating airports with the most delays, where delay is greater than 40 minutes? So, we're filtering on departure delay greater than 40 minutes. Then we're grouping by the source airport. And then we have a count, and we're ordering in descending order. So, here we see that Chicago and Atlanta have the most departure delays.
Carol McDonald: 17:19 We can see the physical plan for this query with using this explain method. And here in red what I've highlighted is projection and filter pushdown. What this means is that the scanning of the source and departure delay columns, and the filter on a departure delay column are being pushed down into MapR Database. So, this scanning and filtering is taking place in MapR Database before returning the data to Spark.
Carol McDonald: 17:52 Projection pushdown minimizes the data transfer between MapR Database and Spark, by admitting unnecessary fields from the table scans. So, this can be particularly beneficial when you have a lot of columns that you're not interested in. So, you can narrow down the columns that you want returned. And the filter pushdown improves the performance by reducing the amount of data passed between MapR Database and Spark when filtering data. So, basically if you're selecting and filtering, that's gonna reduce the amount of data that's returned to the Spark engine. And that's gonna give you better performance, less memory usage, less data transfer.
Carol McDonald: 18:30 Next what we're gonna do is we're going ... We're registering ... We're caching this data, then we're creating a TempView. And this ... That allows us to do SQL queries on the data, in addition to the DataFrame queries. And then the count we see, we have 282,000 rows. Some examples of SQL queries, here we're getting what we did before, but using SQL. So we're getting a count of the departure delays by origin. We're selecting on the source for counting the departure delay where the departure delay is greater than 40, and then we're grouping this by the source airport. On the bottom ... and then we're displaying this in a Zeppelin bar graph, and we see that, then, Atlanta and Chicago have the highest departure delays in this example data set.
Carol McDonald: 19:30 Here's a bar graph for the departure delays by origin and destination. So here we're grouping it by the source and destination, and the bar graph here on the bottom we have then the source airports, and then we have a grouping of colors by the destination airports. So here we see Atalanta to LaGuardia has a lot of departure delays, and so does Chicago to LaGuardia.
Carol McDonald: 20:00 Next, we're gonna look at this ... exploring this table with the graph frames. We're gonna ... here's an example of questions that we can answer such as how many flight routes are there, what's the longest distance routes, what are the top ten flight routes.
Carol McDonald: 20:18 We have our vertices information in a JSON file, so here we're reading in the vertices from a JSON file, and this shows all the airports and the data that we have in this file. So again, the ID column is the airport codes, and then we have the properties of city, country, and state.
Carol McDonald: 20:37 Next, we'll create the graph frame from the airports, and data frame. And then we're going to show the graph edges. So this is just showing the first few rows of this data frame. I'm gonna take a break. I mean I'm gonna stop the ... I'm gonna do this and show this in a notebook first. But I wanna show first, before we go on to the graphing, I wanna show reading the data frame part first.
Carol McDonald: 21:15 So here I have a Zeppelin notebook, and this is running on a [inaudible 00:21:19] cluster with the data science refinery, which runs the Zeppelin notebook in Spark. First what I'm doing is, I have the dependency of these graph frames, because this is actually an extra package. It's not in the core Spark. It's an extra package, but I'm importing the needed packages, including the graph frames.
Carol McDonald: 21:39 Then here I just find the schema for our JSON data that's in Map RDB. Then I'm reading this data from Map RDB into a data set or type flight. So this is returning, then, our data frames of flight objects. This shows the schema for the data frame. Here I'm displaying the first five rows of our data set or data frame. And again, here we see that the idea is sorted by Atlanta Boston, so that's the first airport alphabetically, the first source airport alphabetically.
Carol McDonald: 22:20 Now some SQL queries where the query is for what are the originating airports with the most delays greater than 40 minutes. And so here we see, again, Chicago and Atlanta with the most delays. Here's an example of getting the query plan using Explain, and here we see the filter push down. We see ... push down. Here's the pushed filters where the departure delay is greater than 40.
Carol McDonald: 22:54 Next we're looking at some queries with some bar charts for the average departure delay by carrier. So here we're getting the average departure delay by carrier, and we see that Southwest, in this example, has actually the highest average departure delays. Here's the average departure delay be origin, and we have ... grouped by origin, we have ... with the average departure delay, is Newark and Miami have the highest averages, and Ben. Destination, also for the destination, Newark and San Francisco have the highest averages.
Carol McDonald: 23:42 Here for the average departure delay by day of the week, where Monday is number one. Here we have the count of departure delays for a departure delay of 40. So this is a count of departure delays by the day of the week, and we see that Monday is one, and four is Thursday. So the highest one here in this example was Thursday.
Carol McDonald: 24:07 Then the count of departure delays by hour of the day, with the highest count being at 6:00. And we have the count of departure delays by origin, again. So here we have Atlanta and Chicago have the highest count of departure delays. And so this is by destination, and origin and destination.
Carol McDonald: 24:30 So now we'll go back to the graph frames. So again here we created the graph frame from our airports and flights data frames. And now the airports and flights data frames are available as the graph vertices and edges. This just shows some data frame queries and methods that you can use to query the data frame. We'll look at the filter one.
Carol McDonald: 25:06 So here we're using a filter on the vertices to find the flights in the state of Texas, to find the airports on the vertices. To find the airports in the state of Texas. So in the state of Texas, we have the airports for [inaudible 00:25:23], Dallas Fort Worth, and Houston.
Carol McDonald: 25:30 Count returns the number of rows in the data frame, so here we have 13 airports and 282,628 flights in our data ... in our table. Here's the query for what are the four longest distance flights, so on the edges, we'll get the edges [inaudible 00:25:53] data frame, we're grouping by [inaudible 00:25:56] and destination, then we have a [inaudible 00:25:58] and we're sorting it in descending order, which shows that Miami and Seattle are the longest distance, followed by Boston and San Francisco.
Carol McDonald: 26:10 Here's a query for the average delay for delayed flights from Atlanta. So here on the graph edges, we're filtering for the source [inaudible 00:26:17] Atlanta, and departure delays greater than one. Then we're grouping it by the source and destination, and then we're getting the average and sorting it in descending order. The results are that Atlanta to Newark has the highest average departure delay followed by Atlanta to Chicago, and so on.
Carol McDonald: 26:41 This again shows that we have projection filter push down on this one, so we have the Explain, and then we see the push down filters for, and selection for, greater than ... departure delay greater than one.
Carol McDonald: 26:59 This is a query for what is the average departure delay for delayed flights from Atlanta by hour. So here we're filtering on Atlanta, and then we're grouping it by the departure hour and getting the average departure delay, which is showing the highest averages are in this rush hour, which would be expected.
Carol McDonald: 27:20 Next we're gonna look at the graph frame structure methods of degrees. The in degrees, out degree, and degree determine the number of incoming edges. So this is the number of incoming edges. This one determines the number of outgoing edges, and then degrees is the number of incoming and outgoing edges.
Carol McDonald: 27:40 This example of using the degrees for our graph, so this is gonna give the highest vertex with the highest number of incoming and outgoing edges. It's gonna get the total, and then we're sorting this in descending order. So for us, that gives the importance of the airports. So we can see that Chicago and Atlanta here have the highest number of incoming and outgoing flights in our data set.
Carol McDonald: 28:08 Next we'll look at an example using page rank. As discussed earlier, page rank measures the importance of each vertex in a graph by iteratively calculating which vertexes had the most connections to other vertexes. So here we're using page rank to determine which airports were the most important.
Carol McDonald: 28:29 Page rank returns a graph frame, so that's here. This is this ranked graph frame. And in that ranked graph frame, the page rank is a column in that graph frame. So what we're doing here with the ranked vertices, we're ordering by the page rank [inaudible 00:28:49] order. And here we see that Chicago and Atlanta have the highest page ranks.
Carol McDonald: 28:55 So this corresponds to what we saw with the degrees, but it was calculated slightly differently, a slightly different algorithm. So you can use both of those to determine the measure of importance of the vertexes.
Carol McDonald: 29:07 Next we're gonna look at the graph algorithm and the graph frames for aggregating messages. As discussed earlier, many graph algorithms use iterative message passing between vertices. Graphs [inaudible 00:29:26] provides an aggregate message passing API, which sends messages between vertices and then aggregates the message values at the destination vertex. So here what we're doing is we're setting the syn message to be the departure delay. So it's gonna be that attribute of the departure delay, and then so we're calling the aggregate messages and we're sending this departure delay, and then the aggregation that's gonna happen at all the destination vertexes is an average.
Carol McDonald: 29:58 So all the destination vertexes are gonna calculate the average, and the result is shown here, is a data frame with the vertex and then the average delay. So that's the calculation that we got for each vertex. So we see that the highest average delay was for Newark, Miami, and then Chicago.
Carol McDonald: 30:30 Here what we wanna do is we want to find what are the most frequent flight routes. So here we're using our graph edges and we're going to ... we're grouping by the source and destination, and then we're counting and we're ordering it by the descent of the count. So here we get a count for each flight route, and we determined that the highest ones are LaGuardia to Chicago, and then Chicago to LaGuardia, and then Los Angeles to San Francisco, and San Francisco to Los Angeles.
Carol McDonald: 31:03 So that's showing the top five. Then we can also count how many possible routes are there in our graph. That would be the count of this data frame, and we get 148. We're gonna use this flight route count data frame later in a motif query.
Carol McDonald: 31:21 Here what we're doing, we have a [inaudible 00:31:24], so that's ... you can use that with the Zeppelin notebook to display your data frame with the graphical visual method. So here we're displaying our flight route count that we just showed, and we're displaying this with a bar graph. On the bottom we have all the source airports, and then up here we have, in the colors, we have it grouped by the destination airports. So the highest counts we see are LaGuardia, it looks like LaGuardia and Chicago. Again, those were some of the highest ones.
Carol McDonald: 32:00 Next we're gonna look at triplets in the graph topology, and then graph queries with the motif find. Graph triplets are data frames composed of two vertices and an edge that connects them. So you see the results of a triplets are the two vertices, the source and destination, and the connecting edge. Here we're just showing three as an example.
Carol McDonald: 32:30 An example query that you can do in a triplets data frame is, here we're filtering the source for states equal to Texas. So here we're showing all the edges from ... starting from Texas, and destination to all the other vertices with the edges in between.
Carol McDonald: 32:55 The implementation of the ... of this triplets is actually a motif find. And remember, motif finding searches for structural patterns in a graph. So what this ... the triplets is actually searching for is this pattern. It's searching for the pattern of source vertices with the edge in between, and the destination. So that's the pattern the search is for. And that returns, then, the triplet.
Carol McDonald: 33:26 Next we're going to look at a different triplet, and we're gonna use this data frame that we did earlier for the flight route count in our motif query to have a smaller subset. So again, this one shows all the ... this flight route count shows all the possible destinations in our graph. All the possible flight routes in our graph.
Carol McDonald: 33:50 So what we're doing first here is we're creating a sub-graph from our graph frame, and we're using the vertices as the vertices, and then this data frame that we looked at earlier, which is gonna return a sub-graph. And this is only gonna have the possible routes in our flight routes, so that's gonna make it easier to search for a certain pattern.
Carol McDonald: 34:12 The pattern that we wanna search for here is, we wanna find flights with no direct connection. We can find this with this pattern, so we're searching for A flies to B, B flies to C, but A does not fly to C. So we wanna find ... so that's gonna show flights with no direct connection. Then we're gonna filter the results using data frames to remove duplicates.
Carol McDonald: 34:42 This shows the results of this query. So here we see that the flights with no direct connection are Newark to LGA, LGA to Newark, Los Angeles to LaGuardia, Laguardia to San Francisco, and LaGuardia to Seattle and Seattle to LaGuardia. So those are the flights with no direct connections. We're also going to then try to find connecting airports between some of these flights.
Carol McDonald: 35:12 So to find some connecting airports, what we're gonna do is we're gonna use first shortest path. Shortest path computes the shortest path from each vertex to the given sequence of landmark vertices. So here what we're ... we're finding the shortest path from each vertex to the LaGuardia in this example. What this returns is all the vertices, and the shortest path to LaGuardia.
Carol McDonald: 35:41 So here we see that when there's two, we see there's no direct connection. So again, this shows again that we have no direct connection between Los Angeles and LaGuardia, between San Francisco and LaGuardia, Seattle and LaGuardia, and Newark and LaGuardia. We have no direct connection between those two.
Carol McDonald: 36:00 Next we'll look at a breadth first search example. Breadth first search finds the shortest path from beginning vertices to end vertices. The beginning and end vertices are specified as data frame expressions. In this example, so our data frame expression, so we're doing a breadth first search from Los Angeles to LaGuardia, and then the max path links specifies how many connections. Here we want direct connections, so we specify one. And then running this shows that there's no direct flights between Los Angeles and LaGuardia.
Carol McDonald: 36:44 Here we set the max path links to two, so that'll find connections between LAX and LGA with a path of two. This shows that there are some flight connections from LaGuardia to LGA through Houston. It shows the edges that we have, so here we have one connecting edges, and the connecting edge for the next one. So here we're just showing five. There's more.
Carol McDonald: 37:19 Then we can also narrow down these results, or add data frames to refine the results. So here we're combining the motif searching with data frame operations, and we wanna find connecting flights between LAX and LGA. We've ... so here we have the pattern from A to B, and B to C. And then we're filtering it on A is equal to LAX and C is equal to LGA. So this is gonna search for flights between LAX and LGA, and then we're showing four. And again, here we see that we have connections through Houston, and we can also see the details of these connections.
Carol McDonald: 38:05 Here we're using Breadths first search from LAX to LGA, and we specified here a max pass length of three, so that would be with two connections. With two connections, and we're filtering an edge filter for a carrier of American Airlines. There we run this, which returns a data frame, and then on this data frame we're finding results for the arrival time is less than the departure time, minus an hour, and it's on the same date. Then we're showing some examples of this.
Carol McDonald: 38:41 Here we have some flights from LA through Boston, and then Boston to LGA. There's also LAX to Charlotte, and Charlotte to LGA with American Airlines. Next one I want to show is ... Then I'll just enter Zeppelin Notebook. Again, what we did was we read the airports file for the graph vertices. Then we ... Then we created the data frame from the airports, and the data frame. This is just showing then the vertices. Data frame, and then the edges data frame, which is sorted in order of Atlanta and Boston. Then we're getting how many sites are on the data sets, and how many airports, which are the counts of each data frame.
Carol McDonald: 39:49 Here's the crate for the longest delays for flights that are greater than 1,500 miles in distance. We're filtering over the distance, edges for the distance greater than 1,500, and then we're ordering it by the departure delay. We see that the highest departure delays for this one and the distance. The distance is, for example, Boston to Dallas, Fort Worth. This shows the career plan with filter push down, which we looked at earlier. This is showing the longest and shortest distance routes. We grouped the edges by the source and destination to get the max distance and the min distance. The minimum distance in this example is Boston to LaGuardia.
Carol McDonald: 40:38 Here's a query to find the flight routes that have the highest average departure delays, so we're grouping by the source and destination to get the average departure delay, and then sorting that. The highest is Atlanta to Newark for the averages. Here's a query for the count of the departure delays by origin and destination, where the delay is greater than 40 minutes. Then we have what are the longest delays for flights. We're filtering on departure delay greater than one, so it's delayed. Then we're sorting by the departure delay, and showing that the longest departure delay was this one from Houston to Miami.
Carol McDonald: 41:30 Here's the longest delay for flights that are greater than 1,500 miles. First we are filtering on a distance greater than 1,500, then ordering by the departure delay. This one then is Boston to Dallas. Here's a query for the average delay for delayed flights departing from Atlanta. We're filtering on Atlanta, and then grouping by the source and destination, and getting the average departure delay, which shows that Atlanta to New York has the highest average departure delay.
Carol McDonald: 42:07 This one is what are the worst hours for delayed flights departing from Atlanta, which we're filtering on Atlanta and then grouping by the departure hour. Then displaying this in a bar chart. This one shows the top delays from Atlanta. We're sorting, filtering in Atlanta with delays greater than 40 minutes, and then sorting by the departure delay. The highest one was Atlanta to Miami.
Carol McDonald: 42:40 Here we're using the in degrees to show which airport has the most incoming flights, which shows Atlanta and Chicago. The out degrees shows the most outgoing flights, which is also Chicago and Atlanta. Then the degrees has the most incoming and outgoing flights, which is also Chicago and Atlanta. Again here, we're using that v dot sho to display that in a bar graph for the DataFrames.
Carol McDonald: 43:14 Here we're combining the degrees with a filter, to filter over flights that are, flights coming in and out that are greater than 50, 50, 500 thou- 50 thousand. Here we have 60. Atlanta and Chicago have greater than 50 thousand, and LAX have greater than 50 thousand incoming and outgoing. Here we're showing what are the 10 most frequent flight routes in the data set. That's that flight route count that we want to use, we use later. This is the flight route count displayed in a bar graph.
Carol McDonald: 43:57 Then we're using this flight route count with the page rank. No, no, sorry. There's a page rank to find the most important airport. Here we used page rank. We set the probability and the max iterations, and the page rank shows that Chicago and Atlanta have the highest page rank. Then we use aggregate messages to calculate the average departure delay. This is going to send the average departure delay to all the vertices, and then it's going to be aggregated on the average for all the vertices. It shows that New York has the highest average departure delay.
Carol McDonald: 44:38 Then we're using that flight route count with the mode to find on this pattern, so we create a subgraph. Then we're using this pattern. A flies to B, and B flies to C, but C does not fly to A. Or we could also do A does not fly to C. It gives the same results. Then we are filtering where to remove the duplicates, and then that shows these airports with no direct connections. Then let's ... Let's first search for direct flights between LAX and LGA, with the max path link of one, so that there are no direct flights. A motif search for connections between LAX and LGA.
Carol McDonald: 45:24 Here we have A flies to B, and B flies to C, with A equal to LAX and C equal to LGA. That shows the connections between LAX and LGA. We see that there are connections, there are flights through Houston. At Breadth First Search for flights between LAX and LGA with one connection. There we set the max path links to two instead of one, so that it will also include connecting airports, and we also see some connections through Houston. Then, this is the shortest path from each airport to LGA. We see that then there are no direct flights from LAX to LGA, for example.
Carol McDonald: 46:17 The Breadth First Search from LGA direct to FFO is also, that shows that there's no direct flight from LGA to FFO. Then here we can set the max path link to get results from LGA to FFO. Here we're also specifying carrier equal to UA. Here we see from LGA to FFO you can connect through Houston. Then this is direct Seattle to LGA. Then, if we set the max path link to two, we can find connections between Seattle and LGA. This is a more complex motif search, where we're searching for flights from LAX through Houston to LGA on the same day. Just to show that you can combine these motif queries with also then, also with DataFrames to narrow it down. Here we have the motif part, then we're filtering it using DataFrames.
Carol McDonald: 47:30 Next looking at some resources. You can get the code for this example, and more, and other examples in the new Spark ebook which you can download at this URL. There's also other ebooks that you can download at mapr.com/ebooks. There's one on streaming architecture, machine logistics, and others. Also, you can learn more about Spark 2.0 and others at learn.mapr.com, and then you can also get certified for Spark 2.0 with a Spark certification from MapR. There's more resources on the MapR blog, and then you can run all the codes in this example on the MapR data platform on the same cluster. Again, you can download the code here, and then you can run it on your MapR cluster, on a cloud, or on your laptop with it. Or in your data center.
Carol McDonald: 48:45 Next I'll see if there are any, if there are any questions.
David: 48:58 Thanks Carol. Why don't you take a look through the questions that have come in. There's a few good ones. Just a reminder everybody, that you can submit a question through the chat box in the lower left hand corner of your browser. Additionally, there has been a number of questions of will we be sending out the slides and the recording. Yes, we did record the webinar. We will be sending a link out to the recording as well as the slides shortly after the event.
Carol McDonald: 49:30 Okay, one question is, can you use graphing to create graph algorithms, or do you still need to use GraphX directly? Most of the GraphX algorithms we saw are available in GraphFrames. Those GraphX algorithms that we saw, page rank, aggregate messages, those are available in GraphFrames. If you need to, you can also convert from GraphFrames to GraphX. If you need to, but most of it's available in GraphFrames.
Carol McDonald: 51:04 Then there was a question, could you answer some of these with a standard group of commands? That was to show, yes. Sometimes you can answer ... Some of these queries were possible on the data frame, but then that was what was nice about combining DataFrames with GraphFrames. You can then combine the graph queries with the DataFrames queries with, even in the same query. That makes it very useful, and that's something you wouldn't have if you were just using a graph database maybe, or if you were just using graph algorithms separately. You wouldn't be able to combine those together. That makes it even more useful.
Carol McDonald: 51:45 Then there's a question, how do we add custom algorithms to the GraphFrames? Well, you can customize somewhat with using that aggregate messages. A lot of graph algorithms are using aggregate messages, and you can also fall back to the RDDs if you need to.
Carol McDonald: 52:05 Then there's a question, to make rest cross vertices and edges. I'm not sure if I don't ... I'm not sure if I understand that question actually. Because when, the message sending for example, that's kind of ... The messages aren't being sent using a protocol, they're actually being sent using the graph, in the graph. You're not sending them over the network. You have everything in the graph right there. Actually, the way it analyzes it in more detail is you have these DataFrames, so it uses the vertices and the DataFrames to process that. That's actually kind of complicated to go into how it narrows that down, but there's more information on the GraphFrames website, if you want to go into how that's calculated. But it's not actually sending messages over the network. It's processing those using your graph.
Carol McDonald: 53:09 Then there's a question about the X in Spark two. The X means that it's 2.0 and more. Yes, so the GraphFrames is available on top of Spark DataFrames, which is available on top of Spark 2.0 and more. It's still not part of graph ... It's still not part of the Spark Core Connectors package.
Carol McDonald: 53:35 Then there's a question if we could share the resources slide again. The resources slide is actually ... You're going to get all these slides sent to you, they'll be available, and it's actually quite easy if you want to get the books. It's mapr.com/ebooks, so that's pretty easy.
Carol McDonald: 53:57 The total size of the graph in this example. It was 280 thousand rows in the database, was the size of the graph in this example. That was the year 2018, but I didn't have all the airports for 2018.
Carol McDonald: 54:16 Then there's a question about the largest graph in the cluster configuration. I don't have data for that, but basically with MapRDB and Spark X, you saw how it's going to be distributed across the clusters. With MapRDB, there's really no limit of how much data you want to put into it. You just add nodes to your cluster, but I don't have the exact numbers for that. But if you want to know more about MapRDB, there's a lot of information on the website, and you can also send me emails and ask about that.
Carol McDonald: 54:58 Okay, looks like that's the end of the questions. Thanks for attending.
David: 55:03 Great. Thank you, Carol, and thank you everyone for joining us. That is all the time we have for today. For more information on this topic and others, please visit mapr.com/resources. Thank you again, and have a great rest of your day.