Spark GraphX graph计算

1. graph计算overview

graph计算 is aprocessingstructure化data 计算model, 用于analysis实体之间 relationships. in 现实世界in, 许 many data都可以表示 for graph, 例such as社交network, 推荐system, 交通network and knowledgegraph谱etc..

1.1 graph basicconcepts

graph is 由顶点 (Vertices) and edge (Edges) 组成 datastructure, 用于表示实体之间 relationships.

顶点 (Vertex) : 表示实体, such as人, 物品, 地点etc.
edge (Edge) : 表示实体之间 relationships, such as good 友relationships, 购买relationships, 连接relationshipsetc.
property (Attribute) : 顶点 or edge on 附加information, such asuser 年龄, edge 权重etc.
has 向graph (Directed Graph) : edge has 方向 graph, such as网页链接relationships
无向graph (Undirected Graph) : edge没 has 方向 graph, such as社交networkin good 友relationships

1.2 GraphX Introduction

GraphX is Spark ecosystemin graph计算library, providing了统一 graph计算abstraction, 可以 high 效processinglarge-scalegraphdata. GraphX 结合了graphparallel计算 and dataparallel计算优势, supportgraph creation, 转换, query and algorithms执行.

2. GraphX core concepts

GraphX 引入了几个core concepts来表示 and processinggraphdata:

2.1 Graph class

GraphX in coreabstraction is Graph class, 它表示一个 has 向how heavygraph, 其in每个顶点 and edge都 has property. Graph class 定义such as under :

class Graph[VD, ED] {
  // graph 顶点collection, class型 for  RDD[(VertexId, VD)]
  val vertices: VertexRDD[VD]
  
  // graph edgecollection, class型 for  RDD[Edge[ED]]
  val edges: EdgeRDD[ED]
  
  // graph 三元组collection, class型 for  RDD[EdgeTriplet[VD, ED]]
  val triplets: RDD[EdgeTriplet[VD, ED]]
  
  // othergraphoperationmethod...
}

2.2 VertexRDD and EdgeRDD

VertexRDD[VD]: 表示graph 顶点collection, is RDD[(VertexId, VD)] optimizationversion, 其in VertexId is 顶点唯一标识符, VD is 顶点 propertyclass型
EdgeRDD[ED]: 表示graph edgecollection, is RDD[Edge[ED]] optimizationversion, 其in Edge is edge class型, ED is edge propertyclass型

2.3 EdgeTriplet

EdgeTriplet is Edge scale, 表示一个三元组 (src, dst, attr), package含sources顶点, 目标顶点 and edge property. EdgeTriplet providing了便捷访问顶点property method:

class EdgeTriplet[VD, ED] extends Edge[ED] {
  // sources顶点 property
  def srcAttr: VD
  
  // 目标顶点 property
  def dstAttr: VD
  
  // edge property
  def attr: ED
  
  // sources顶点 ID
  def srcId: VertexId
  
  // 目标顶点 ID
  def dstId: VertexId
}

3. graph creation

GraphX providing了 many 种creationgraph method, including from 顶点 and edge RDD creation, from edgelistcreationetc..

3.1 from 顶点 and edge RDD creation

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Graph, Edge, VertexId}

// creation SparkContext
val conf = new SparkConf().setAppName("GraphXExample").setMaster("local")
val sc = new SparkContext(conf)

// creation顶点 RDD
val vertices: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 30)),
  (3L, ("Charlie", 25)),
  (4L, ("David", 35)),
  (5L, ("Eve", 27))
))

// creationedge RDD
val edges: RDD[Edge[String]] = sc.parallelize(Array(
  Edge(1L, 2L, "friend"),
  Edge(2L, 3L, "friend"),
  Edge(3L, 4L, "colleague"),
  Edge(4L, 5L, "friend"),
  Edge(5L, 1L, "colleague"),
  Edge(2L, 4L, "friend"),
  Edge(3L, 1L, "friend")
))

// creationgraph
val graph: Graph[(String, Int), String] = Graph(vertices, edges)

// 打印graph basicinformation
println(s"顶点数量: ${graph.numVertices}")
println(s"edge数量: ${graph.numEdges}")

3.2 from edgelistcreation

such as果只 has edgelist, 可以using Graph.fromEdges methodcreationgraph, 顶点property默认 for 1.0:

//  from edgelistcreationgraph
val edges: RDD[Edge[Double]] = sc.parallelize(Array(
  Edge(1L, 2L, 0.5),
  Edge(2L, 3L, 0.8),
  Edge(3L, 4L, 1.0),
  Edge(4L, 1L, 0.3)
))

val graph: Graph[Double, Double] = Graph.fromEdges(edges, defaultValue = 1.0)

4. graph basicoperation

GraphX providing了丰富 graphoperationmethod, including转换, filter, query and 计算etc..

4.1 顶点 and edge operation

// 1. 顶点operation
// 筛选年龄 big 于 28  顶点
val olderVertices = graph.vertices.filter { case (id, (name, age)) => age > 28 }
println("年龄 big 于 28  顶点:")
olderVertices.collect().foreach { case (id, (name, age)) => 
  println(s"$id: $name ($age)")
}

// 2. edgeoperation
// 筛选edgeclass型 for  "friend"  edge
val friendEdges = graph.edges.filter(e => e.attr == "friend")
println("\n朋友relationshipsedge:")
friendEdges.collect().foreach { e => 
  println(s"${e.srcId} -> ${e.dstId}: ${e.attr}")
}

// 3. 三元组operation
// 打印所 has 三元组
println("\n所 has 三元组:")
graph.triplets.collect().foreach { triplet => 
  println(s"${triplet.srcAttr._1} (${triplet.srcId})  is  ${triplet.dstAttr._1} (${triplet.dstId})   ${triplet.attr}")
}

4.2 graph转换operation

GraphX providing了 many 种graph转换operation, 可以creation new graph:

// 1. mapVertices: 转换顶点property
val ageGraph = graph.mapVertices { case (id, (name, age)) => age }
println("\n转换 for 年龄graph:")
ageGraph.vertices.collect().foreach { case (id, age) => 
  println(s"$id: $age")
}

// 2. mapEdges: 转换edgeproperty
val inverseEdgesGraph = graph.mapEdges(e => s"inverse_${e.attr}")
println("\n转换edgeproperty:")
inverseEdgesGraph.edges.collect().foreach { e => 
  println(s"${e.srcId} -> ${e.dstId}: ${e.attr}")
}

// 3. mapTriplets: 转换三元组
val weightedGraph = graph.mapTriplets(triplet => 
  if (triplet.attr == "friend") 1.0 else 0.5
)
println("\n转换 for 加权graph:")
weightedGraph.triplets.collect().foreach { triplet => 
  println(s"${triplet.srcAttr._1} -> ${triplet.dstAttr._1}: ${triplet.attr}")
}

5. 常用graphalgorithms

GraphX in 置了 many 种常用graphalgorithms, 可以直接application于graphdata:

5.1 PageRank algorithms

PageRank is a用于assessment网页 important 性 algorithms, 也可用于社交networkanalysis. GraphX providing了 PageRank algorithmsimplementation:

import org.apache.spark.graphx.lib.PageRank

// using PageRank algorithms
val ranks = graph.pageRank(tol = 0.0001).vertices

//  or 者using in 置algorithmsobject
val ranks2 = PageRank.run(graph, numIter = 10).vertices

// 打印 PageRank 结果
println("\nPageRank 结果:")
ranks.join(graph.vertices).collect().foreach { case (id, (rank, (name, age))) => 
  println(s"$name: $rank")
}

5.2 连通分量algorithms

连通分量algorithms用于找出graphin 连通子graph. GraphX providing了 ConnectedComponents algorithms:

import org.apache.spark.graphx.lib.ConnectedComponents

// 计算连通分量
val cc = ConnectedComponents.run(graph)

// 打印连通分量结果
println("\n连通分量结果:")
cc.vertices.join(graph.vertices).collect().foreach { case (id, (componentId, (name, age))) => 
  println(s"$name 属于连通分量 $componentId")
}

5.3 三角形计数algorithms

三角形计数algorithms用于计算graphin每个顶点参 and 三角形数量, 可用于社交networkanalysis:

import org.apache.spark.graphx.lib.TriangleCount

// 计算三角形数量
val triangles = TriangleCount.run(graph)

// 打印三角形计数结果
println("\n三角形计数结果:")
triangles.vertices.join(graph.vertices).collect().foreach { case (id, (count, (name, age))) => 
  println(s"$name 参 and 了 $count 个三角形")
}

5.4 最 short pathalgorithms

最 short pathalgorithms用于计算graphin顶点之间最 short path. GraphX providing了 ShortestPaths algorithms:

import org.apache.spark.graphx.lib.ShortestPaths

// 计算 from 顶点 1  to other顶点 最 short path
val shortestPaths = ShortestPaths.run(graph, Seq(1L))

// 打印最 short path结果
println("\n from 顶点 1  to other顶点 最 short path:")
shortestPaths.vertices.collect().foreach { case (id, distances) => 
  distances.get(1L) match {
    case Some(distance) => println(s"顶点 1  to 顶点 $id  最 short path long 度: $distance")
    case None => println(s"顶点 1  to 顶点 $id 没 has path")
  }
}

5. graph 持久化 and cache

for 于 many 次using graph, 可以将其持久化 to memory or diskin, 以improvingperformance:

// 持久化graph to memory
val cachedGraph = graph.cache()

//  many 次usingcache graph
val ranks = cachedGraph.pageRank(tol = 0.0001).vertices
val cc = ConnectedComponents.run(cachedGraph)
val triangles = TriangleCount.run(cachedGraph)

// 不再using时释放cache
cachedGraph.unpersist()

6. practicalapplicationcase

GraphX 可以application于 many 种practical场景, 以 under is 一些典型case:

6.1 社交networkanalysis

using GraphX 可以analysis社交networkin relationships, such as:

识别社交networkin 关键人物 (PageRank)
发现communitystructure (连通分量, tag传播)
推荐朋友 (基于共同 good 友)
analysisinformation propagationpath

6.2 推荐system

GraphX 可以用于构建基于graph 推荐system, such as:

user-物品二分graph建模
基于graph 协同filter
物品相似度计算

6.3 交通networkanalysis

GraphX 可以用于交通network analysis and optimization, such as:

最 short path计算
交通traffic预测
路线optimization

6.4 knowledgegraph谱

GraphX 可以用于构建 and queryknowledgegraph谱, such as:

实体relationships建模
knowledge推理
语义搜索

实践练习

练习1: creation社交networkgraph

creation一个package含 10 个顶点 and 20 条edge 社交networkgraph, 顶点propertyincluding姓名 and 年龄, edgeproperty表示relationshipsclass型 (friend, colleague, family) .

练习2: PageRank analysis

for creation 社交networkgraph执行 PageRank algorithms, 找出networkin 关键人物.

练习3: 连通分量analysis

计算社交networkgraph 连通分量, analysisnetwork structure.

练习4: 自定义graphalgorithms

implementation一个 simple 自定义graphalgorithms, such as计算每个顶点入度 and 出度.

练习5: practicaldata集analysis

using公开 graphdata集 (such as社交networkdata or 网页链接data) , application GraphX algorithmsforanalysis.

7. summarized

本tutorial介绍了 Spark GraphX basicconcepts, corecomponent and usingmethod, including:

graph计算 basicconcepts and application场景
GraphX core concepts (Graph, VertexRDD, EdgeRDD, EdgeTriplet)
graph creationmethod
graph basicoperation (顶点, edge, 三元组operation)
graph转换operation (mapVertices, mapEdges, mapTriplets)
常用graphalgorithms (PageRank, 连通分量, 三角形计数, 最 short path)
graph 持久化 and cache
GraphX practicalapplicationcase

GraphX providing了强 big graph计算capacity, 可以 high 效processinglarge-scalegraphdata, 适用于社交networkanalysis, 推荐system, 交通networkoptimization and knowledgegraph谱etc. many 个领域. throughLearning GraphX, 您可以Mastergraph计算 basic原理 and 实践method, for processing complex relationshipsdata打 under Basics.

on 一课: Spark MLlib advanced features 返回tutoriallist under 一课: Spark performanceoptimization and 调优