Spark GraphX graph计算
1. graph计算overview
graph计算 is aprocessingstructure化data 计算model, 用于analysis实体之间 relationships. in 现实世界in, 许 many data都可以表示 for graph, 例such as社交network, 推荐system, 交通network and knowledgegraph谱etc..
1.1 graph basicconcepts
graph is 由顶点 (Vertices) and edge (Edges) 组成 datastructure, 用于表示实体之间 relationships.
- 顶点 (Vertex) : 表示实体, such as人, 物品, 地点etc.
- edge (Edge) : 表示实体之间 relationships, such as good 友relationships, 购买relationships, 连接relationshipsetc.
- property (Attribute) : 顶点 or edge on 附加information, such asuser 年龄, edge 权重etc.
- has 向graph (Directed Graph) : edge has 方向 graph, such as网页链接relationships
- 无向graph (Undirected Graph) : edge没 has 方向 graph, such as社交networkin good 友relationships
1.2 GraphX Introduction
GraphX is Spark ecosystemin graph计算library, providing了统一 graph计算abstraction, 可以 high 效processinglarge-scalegraphdata. GraphX 结合了graphparallel计算 and dataparallel计算 优势, supportgraph creation, 转换, query and algorithms执行.
2. GraphX core concepts
GraphX 引入了几个core concepts来表示 and processinggraphdata:
2.1 Graph class
GraphX in coreabstraction is Graph class, 它表示一个 has 向how heavygraph, 其in每个顶点 and edge都 has property. Graph class 定义such as under :
class Graph[VD, ED] {
// graph 顶点collection, class型 for RDD[(VertexId, VD)]
val vertices: VertexRDD[VD]
// graph edgecollection, class型 for RDD[Edge[ED]]
val edges: EdgeRDD[ED]
// graph 三元组collection, class型 for RDD[EdgeTriplet[VD, ED]]
val triplets: RDD[EdgeTriplet[VD, ED]]
// othergraphoperationmethod...
}
2.2 VertexRDD and EdgeRDD
- VertexRDD[VD]: 表示graph 顶点collection, is RDD[(VertexId, VD)] optimizationversion, 其in VertexId is 顶点 唯一标识符, VD is 顶点 propertyclass型
- EdgeRDD[ED]: 表示graph edgecollection, is RDD[Edge[ED]] optimizationversion, 其in Edge is edge class型, ED is edge propertyclass型
2.3 EdgeTriplet
EdgeTriplet is Edge scale, 表示一个三元组 (src, dst, attr), package含sources顶点, 目标顶点 and edge property. EdgeTriplet providing了便捷 访问顶点property method:
class EdgeTriplet[VD, ED] extends Edge[ED] {
// sources顶点 property
def srcAttr: VD
// 目标顶点 property
def dstAttr: VD
// edge property
def attr: ED
// sources顶点 ID
def srcId: VertexId
// 目标顶点 ID
def dstId: VertexId
}
3. graph creation
GraphX providing了 many 种creationgraph method, including from 顶点 and edge RDD creation, from edgelistcreationetc..
3.1 from 顶点 and edge RDD creation
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Graph, Edge, VertexId}
// creation SparkContext
val conf = new SparkConf().setAppName("GraphXExample").setMaster("local")
val sc = new SparkContext(conf)
// creation顶点 RDD
val vertices: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 30)),
(3L, ("Charlie", 25)),
(4L, ("David", 35)),
(5L, ("Eve", 27))
))
// creationedge RDD
val edges: RDD[Edge[String]] = sc.parallelize(Array(
Edge(1L, 2L, "friend"),
Edge(2L, 3L, "friend"),
Edge(3L, 4L, "colleague"),
Edge(4L, 5L, "friend"),
Edge(5L, 1L, "colleague"),
Edge(2L, 4L, "friend"),
Edge(3L, 1L, "friend")
))
// creationgraph
val graph: Graph[(String, Int), String] = Graph(vertices, edges)
// 打印graph basicinformation
println(s"顶点数量: ${graph.numVertices}")
println(s"edge数量: ${graph.numEdges}")
3.2 from edgelistcreation
such as果只 has edgelist, 可以using Graph.fromEdges methodcreationgraph, 顶点property默认 for 1.0:
// from edgelistcreationgraph val edges: RDD[Edge[Double]] = sc.parallelize(Array( Edge(1L, 2L, 0.5), Edge(2L, 3L, 0.8), Edge(3L, 4L, 1.0), Edge(4L, 1L, 0.3) )) val graph: Graph[Double, Double] = Graph.fromEdges(edges, defaultValue = 1.0)
4. graph basicoperation
GraphX providing了丰富 graphoperationmethod, including转换, filter, query and 计算etc..
4.1 顶点 and edge operation
// 1. 顶点operation
// 筛选年龄 big 于 28 顶点
val olderVertices = graph.vertices.filter { case (id, (name, age)) => age > 28 }
println("年龄 big 于 28 顶点:")
olderVertices.collect().foreach { case (id, (name, age)) =>
println(s"$id: $name ($age)")
}
// 2. edgeoperation
// 筛选edgeclass型 for "friend" edge
val friendEdges = graph.edges.filter(e => e.attr == "friend")
println("\n朋友relationshipsedge:")
friendEdges.collect().foreach { e =>
println(s"${e.srcId} -> ${e.dstId}: ${e.attr}")
}
// 3. 三元组operation
// 打印所 has 三元组
println("\n所 has 三元组:")
graph.triplets.collect().foreach { triplet =>
println(s"${triplet.srcAttr._1} (${triplet.srcId}) is ${triplet.dstAttr._1} (${triplet.dstId}) ${triplet.attr}")
}
4.2 graph转换operation
GraphX providing了 many 种graph转换operation, 可以creation new graph:
// 1. mapVertices: 转换顶点property
val ageGraph = graph.mapVertices { case (id, (name, age)) => age }
println("\n转换 for 年龄graph:")
ageGraph.vertices.collect().foreach { case (id, age) =>
println(s"$id: $age")
}
// 2. mapEdges: 转换edgeproperty
val inverseEdgesGraph = graph.mapEdges(e => s"inverse_${e.attr}")
println("\n转换edgeproperty:")
inverseEdgesGraph.edges.collect().foreach { e =>
println(s"${e.srcId} -> ${e.dstId}: ${e.attr}")
}
// 3. mapTriplets: 转换三元组
val weightedGraph = graph.mapTriplets(triplet =>
if (triplet.attr == "friend") 1.0 else 0.5
)
println("\n转换 for 加权graph:")
weightedGraph.triplets.collect().foreach { triplet =>
println(s"${triplet.srcAttr._1} -> ${triplet.dstAttr._1}: ${triplet.attr}")
}
5. 常用graphalgorithms
GraphX in 置了 many 种常用graphalgorithms, 可以直接application于graphdata:
5.1 PageRank algorithms
PageRank is a用于assessment网页 important 性 algorithms, 也可用于社交networkanalysis. GraphX providing了 PageRank algorithmsimplementation:
import org.apache.spark.graphx.lib.PageRank
// using PageRank algorithms
val ranks = graph.pageRank(tol = 0.0001).vertices
// or 者using in 置algorithmsobject
val ranks2 = PageRank.run(graph, numIter = 10).vertices
// 打印 PageRank 结果
println("\nPageRank 结果:")
ranks.join(graph.vertices).collect().foreach { case (id, (rank, (name, age))) =>
println(s"$name: $rank")
}
5.2 连通分量algorithms
连通分量algorithms用于找出graphin 连通子graph. GraphX providing了 ConnectedComponents algorithms:
import org.apache.spark.graphx.lib.ConnectedComponents
// 计算连通分量
val cc = ConnectedComponents.run(graph)
// 打印连通分量结果
println("\n连通分量结果:")
cc.vertices.join(graph.vertices).collect().foreach { case (id, (componentId, (name, age))) =>
println(s"$name 属于连通分量 $componentId")
}
5.3 三角形计数algorithms
三角形计数algorithms用于计算graphin每个顶点参 and 三角形数量, 可用于社交networkanalysis:
import org.apache.spark.graphx.lib.TriangleCount
// 计算三角形数量
val triangles = TriangleCount.run(graph)
// 打印三角形计数结果
println("\n三角形计数结果:")
triangles.vertices.join(graph.vertices).collect().foreach { case (id, (count, (name, age))) =>
println(s"$name 参 and 了 $count 个三角形")
}
5.4 最 short pathalgorithms
最 short pathalgorithms用于计算graphin顶点之间 最 short path. GraphX providing了 ShortestPaths algorithms:
import org.apache.spark.graphx.lib.ShortestPaths
// 计算 from 顶点 1 to other顶点 最 short path
val shortestPaths = ShortestPaths.run(graph, Seq(1L))
// 打印最 short path结果
println("\n from 顶点 1 to other顶点 最 short path:")
shortestPaths.vertices.collect().foreach { case (id, distances) =>
distances.get(1L) match {
case Some(distance) => println(s"顶点 1 to 顶点 $id 最 short path long 度: $distance")
case None => println(s"顶点 1 to 顶点 $id 没 has path")
}
}
5. graph 持久化 and cache
for 于 many 次using graph, 可以将其持久化 to memory or diskin, 以improvingperformance:
// 持久化graph to memory val cachedGraph = graph.cache() // many 次usingcache graph val ranks = cachedGraph.pageRank(tol = 0.0001).vertices val cc = ConnectedComponents.run(cachedGraph) val triangles = TriangleCount.run(cachedGraph) // 不再using时释放cache cachedGraph.unpersist()
6. practicalapplicationcase
GraphX 可以application于 many 种practical场景, 以 under is 一些典型case:
6.1 社交networkanalysis
using GraphX 可以analysis社交networkin relationships, such as:
- 识别社交networkin 关键人物 (PageRank)
- 发现communitystructure (连通分量, tag传播)
- 推荐朋友 (基于共同 good 友)
- analysisinformation propagationpath
6.2 推荐system
GraphX 可以用于构建基于graph 推荐system, such as:
- user-物品二分graph建模
- 基于graph 协同filter
- 物品相似度计算
6.3 交通networkanalysis
GraphX 可以用于交通network analysis and optimization, such as:
- 最 short path计算
- 交通traffic预测
- 路线optimization
6.4 knowledgegraph谱
GraphX 可以用于构建 and queryknowledgegraph谱, such as:
- 实体relationships建模
- knowledge推理
- 语义搜索
实践练习
练习1: creation社交networkgraph
creation一个package含 10 个顶点 and 20 条edge 社交networkgraph, 顶点propertyincluding姓名 and 年龄, edgeproperty表示relationshipsclass型 (friend, colleague, family) .
练习2: PageRank analysis
for creation 社交networkgraph执行 PageRank algorithms, 找出networkin 关键人物.
练习3: 连通分量analysis
计算社交networkgraph 连通分量, analysisnetwork structure.
练习4: 自定义graphalgorithms
implementation一个 simple 自定义graphalgorithms, such as计算每个顶点 入度 and 出度.
练习5: practicaldata集analysis
using公开 graphdata集 (such as社交networkdata or 网页链接data) , application GraphX algorithmsforanalysis.
7. summarized
本tutorial介绍了 Spark GraphX basicconcepts, corecomponent and usingmethod, including:
- graph计算 basicconcepts and application场景
- GraphX core concepts (Graph, VertexRDD, EdgeRDD, EdgeTriplet)
- graph creationmethod
- graph basicoperation (顶点, edge, 三元组operation)
- graph转换operation (mapVertices, mapEdges, mapTriplets)
- 常用graphalgorithms (PageRank, 连通分量, 三角形计数, 最 short path)
- graph 持久化 and cache
- GraphX practicalapplicationcase
GraphX providing了强 big graph计算capacity, 可以 high 效processinglarge-scalegraphdata, 适用于社交networkanalysis, 推荐system, 交通networkoptimization and knowledgegraph谱etc. many 个领域. throughLearning GraphX, 您可以Mastergraph计算 basic原理 and 实践method, for processing complex relationshipsdata打 under Basics.