Draftpdf
Draftpdf
Page 1 of 170
杨陈 shared this file. Want to do more with it?
  1. SSRW: A Scalable Algorithm for EstimatingGraphlet Statistics Based on Random WalkPaper ID: 99Abstract.Mining graphlet statistics is very meaningful due to its wideapplications in social networks, bioinformatics and information security,etc. However, it is a big challenge to exactly count graphlet statisticsas the number of subgraphs exponentially increases with the graph size,so sampling algorithms are widely used to estimate graphlet statisticswithin reasonable time. However, existing sampling algorithms are notscalable for large graphlets, e.g., they may get stuck when estimatinggraphlets with more than five nodes. To address this issue, we proposea highly scalable algorithm, Scalable subgraph Sampling via RandomWalk (SSRW), for graphlet counts and concentrations. SSRW samplesgraphlets by generating new nodes from the neighbors of previously visit-ed nodes instead of fixed ones. Thanks to this flexibility, we can generateanyk-graphlets in a unified way and estimate statistics ofk-graphlet effi-ciently even for largek. Our extensive experiments on estimating countsand concentrations of{4,5,6,7}-graphlets show that SSRW algorithm isscalable, accurate and fast.1 IntroductionGraphlets are small connected induced subgraphs in large graphs, and they arewidely studied in many applications, such as social networks [10,16], bioinformat-ics [15,19], and information security [17], etc. For example, triangle counting is aclassic problem in studying social networks, which reflects the closeness betweenmembers and can be used to make friend recommendation.Graphlet concentration is the relative frequency of a specific graphlet in aset of graphlets, such as the clustering coefficient. Graphlet concentration can beused to make large graph comparison [24]. Counting graphlets is a more generalanalysis task, and it also has wide applications [13], including deducing graphletconcentrations. A naive approach to obtain graphlet statistics is exact counting,which enumerates all connected subgraphs of a given size and then derives thestatistics by counting the number of subgraphs isomorphic to each graphlets.However, exact counting is time-consuming, and it is inefficient even for graphswith moderate size. For example, it may take more than ten days to obtain theexact number of 5-graphlets of the Flickr graph with 240K nodes using ESCAPE[18], a state-of-the-art exact counting algorithm. To the best of our knowledge,there is still a lack of efficient tools to obtain the statistics of graphlets with morethan five nodes in large graphs due to the extremely large number of subgraphsand the complicated structures of graphlets.
  2. Sampling is a common approach to reduce the time complexity for estimatinggraphlet concentrations [9,23]. Recent works are mainly based on random walkbecause of its simplicity and stability, which samplesk-graphlets inkconsecutivesteps of a random walk. However, somek-graphlets can not be sampled in sucha way, and these graphlets must be sampled with other sampling methods or byutilizing the relationships between graphlets [5,8]. There are also other samplingalgorithms which first sample a specific substructure in a graph, such as a path ora star, and then estimate statistics with the combinational relationship betweendifferent graphlets for unsampled graphlets [11,20,26]. Thus, existing samplingapproaches do not scale well for large sized graphlets in very large graphs.To address the scalability issue when sampling large sized graphlets, someworks execute a random walk on super graphs for getting samples [4,5,23,25],where each super node in the super graph corresponds to ak-connected sub-graph. Sampling on the super graph makes it easier to sample ak-subgraphthan on the original graph, but the size of ak-super graph is much larger thanthe original graph, so populating neighbor is still inefficient whenk≥3, makingrandom walks on a super graph hard to mix quickly. Therefore, sampling withrandom walk on super graphs are still inefficient, especially for large graphlets.In this paper, we propose a new algorithm, Scalable Subgraph Sampling viaRandom Walk (SSRW), to efficiently estimate graphlet concentrations of largesize. We develop a new sampling strategy for SSRW by introducing randomnessto the selection of new nodes when generating a sample, say ak-subgraph sam-ple, to ensure that the sample set contains all desired graphlets. The samplingstrategy is general and thus easy to be scaled to sample subgraphs with largesize. We summarize our contributions as follows.–A highly scalable subgraph sampling strategy.We develop a novelstrategy to quickly sample large sized subgraphs based on random walk.With our strategy, all graphlets can be sampled in a unified way. To geta subgraph sample, we generate new nodes from the neighbors of a set ofpreviously generated nodes instead of a fixed one, which makes it possibleto generate all types of subgraphs. Our strategy is simple and also easy tobe scaled for graphlets with more than 5 nodes in large graphs. In addition,it is also easy to be parallelized to further improve its efficiency.–A framework to estimate large sized graphlet counts and concen-trations.Based on our subgraph sampling strategy, we design a new sam-pling algorithm, SSRW, to estimate graphlet counts and concentrations inlarge graphs accurately and efficiently. SSRW provides an efficient alterna-tive for estimating large graphs, of which the exact graphlet statistics cannot be derived in reasonable time.–Extensive experiments.We implement our sampling algorithm for es-timating the concentrations and counts of{4,5,6,7}-graphlets. Extensiveexperiments show that our algorithm are accurate, scalable and fast. For ex-ample, for estimating 4-graphlet concentrations, the Normalized Root MeanSquare Errors (NRMSEs) of 3-star and 4-clique are smaller than 0.01 and0.02 respectively on com-Amazon with 334K nodes. Besides, our algorithm
  3. can also estimate 6 and 7-graphlet counts efficiently, which are rarely ad-dressed before, and the NRMSEs of prescribed 6-graphlets are smaller than0.1 on Gnutella04 with 10.8K nodes and smaller than 0.15 for prescribed7-graphlets on ca-GrQc with 4.1K nodes. We get 50×speed up when wemake our algorithm running in parallel.2 Preliminaries2.1 Notations and DefinitionsWe denote a graph asG= (V, E), whereV, Eare the sets of nodes and edgesrespectively. For a nodev∈V,N(v) is the set ofv’s neighbors, andd(v) =|N(v)|is the degree ofv. Given two nodesv1andv2, we use (v1, v2) to denote theundirected edge betweenv1andv2. A graphG1= (V1, E1) is ak-subgraph ofG, ifV1⊆V,E1⊆Eand|V1|=k. If any edge whose both ends are inV1isalso inE1, then we sayG1is an induced subgraph ofG.Two graphsG= (V, E) andG0= (V0, E0) are isomorphic, denoted asG'G0,if there is a bijectionφ:V→V0, such that for any two nodesv1, v2∈V, thereis an edge (v1, v2)∈E, if and only if there is an edge (φ(v1), φ(v2))∈E0.g31g32g41g42gg44g45g4643g51g52g53g54g55g56g57g58g59g510g511g512g513g514g515g516g517g518g519g520g521Fig. 1: All{3,4,5}-graphlets, wheregkidenotes thek-graphlets of typei.Graphlets are small, nonisomorphic and connected induced subgraphs ofG.As shown in Fig. 1, there are two kinds of 3-graphlets, called as 2-path (g31)and triangle (g32), respectively, and 6 different 4-graphlets, as well as 21 different5-graphlets. We note that the number of unique graphlets grows rapidly withthe increase of graphlet size, and in particular, the number of 6-graphlets and7-graphlets increase to 112 and 853, respectively. In this paper, we focus on{4,5,6,7}-graphlets.Definition of graphlet concentration: Given a graphGand a family ofk-graphletsGk={gk1, gk2, ...gkTk}, letCkibe the number of subgraphs inGwhichare isomorphic togki, i.e., the count ofgki, and the graphlet concentrationckiofgkiis defined ascki=CkPiTkt=1Cktfori= 1,2, ..., Tk.
  4. 2.2 Random WalkA simple random walk on a graph runs as follows. It starts from an initial nodein a graph, then chooses the next node from its neighbors uniformly and movesto it, and repeats this process until some stopping criteria are reached. Thetransition probability matrix is defined asP= (p(i, j))n×n,p(i, j) =(1d(vi)if (vi, vj)∈E;0 otherwise,wherenis the number of nodes in the graph andp(i, j) is the probability oftransiting fromvitovj. Simple random walk has a unique stationary distributionπwithπ(v)∝d(v) [3], denoting the probability of nodev∈Vbeing visited.3 The Scalable Subgraph Sampling AlgorithmOur algorithm is based on a random walk running on the original graph to gener-atek-subgraphs as samples. Specifically, when a random walk get mixing, everynode in the random walk serves as a starting node and induces ak-subgraphsample with our sampling strategy. During the sampling process, new nodes arechosen from the neighbors of multiple previously accessed nodes instead of onlyone fixed node. In this way, the selection of a new node is more flexible, whichmakes us be able to get allk-graphlets in a unified way. Finally, we correct biasaccording to the probability of samples and their repeat coefficients.In the following, we first illustrate the core idea of our algorithm, i.e., thesampling strategy, then describe the method of calculating repeat coefficientαkifor creating an unbiased estimator, and finally summarize the framework of ouralgorithm called SSRW.3.1 Sampling Strategy fork-SubgraphsNote that allk-graphlets contain 2-path, and this implies that we could get thesamples of every graphlet starting with a 2-path. Specifically, from a startingnode, we first choose a 2-path by choosing the second node uniformly from theneighbors of the starting node. For 3≤t≤k, thet-th nodevtis selected fromthe neighbors of the previoust−2 nodes, i.e., all previously chosen nodes exceptfor the first one. Basically, such a way of generatingk-subgraphs by sampling a 2-path and choosing the remaining nodes uniformly at random from the neighborsof multiple nodes, helps our sampling methods expand easily and quickly in thegraphG, and it is beneficial for sampling allk-graphlets in a unified way andscaling the algorithm to higher-order graphlets sampling.Take 4-graphlets as an example to further explain the sampling process. Wefirst select a starting node, such asain Fig. 2(a), from a random walk of thegraph at the first step, then choose the second node uniformly froma’s neighbors,sayb, and then continue to choose the third node uniformly fromb’s neighbors,
  5. sayc. Now we have chosena, b, cat the first three steps and get a 2-patha−b−c,as shown in Fig. 2(a). If we insist on selecting the fourth node from the neighborsof the third node (i.e.,c), we can never get a 3-star, i.e.,g42. However, if we choosethe fourth node from the neighbors of bothbandc, then we can get 3-star bychoosingefromb’s neighbors and can also get 3-patha−b−c−uby choosingufromc’s neighbors. So all 4-graphlets inGcan be generated.As for 5-graphlets, the first four nodes are generated as above, the fifth nodecan be chosen from the neighbors of all previously visited nodes except for thestarting node, and all 5-graphlets can be generated. As shown in Fig. 2(b),suppose that we have generated four nodes,a, b, c, u, we choose the fifth nodefrom the neighbors ofb,candu. If we chooseefromb’s neighbors as an example,then we can get a subgraph isomorphic tog58with node set{a, b, c, u, e}.eucfbag312(a) 4-graphleteucfbag3412(b) 5-graphletFig. 2: Examples of generating 4- and 5-graphlet.Algorithm 1Samplingk-SubgraphInput:starting nodev1Output:k-tuple (v1, v2, . . . , vk)1:v2←Random node inN(v1);2:foriin [3, . . . , k]do3:vi←Random node inN(v2),N(v3), ..., orN(vi−1);4:end for5:Returnk-tuple (v1, v2, v3, . . . , vk).Algorithm 1 shows our sampling method fork-subgraphs. To selectvt, wegenerate a multiset of nodes by concatenating the neighbors ofv2, ..., vt−1anduniformly select a node from the multiset. As in the example of generating4-graphlets in Fig. 2(a), in the final step, we choose the 4-th node from theneighbors ofbandc, which is a multiset{a, c, f, e, b, f, g, u}of sized(b) +d(c).Notice that nodefappears twice. If we chooseefromb’s neighbors, we get a3-star, and we get a 3-path (a, b, c, u) if we chooseufromc’s neighbors. By thisway, all 4-graphlets inGare possible to be generated in a unified way. To identifyfrom which neighbors the fourth node being selected, we use pairs to denote allpossible selections. For example, in Fig. 2(a), we use{hb, ai,hb, ci,hb, ei,hb, fi} ∪{hc, bi,hc, fi,hc, gi,hc, ui}to denote all choices to select the fourth node, wherehvi, vjimeans that nodevjis generated from the neighbors ofvi. We select a
  6. node asvtfrom the concatenation ofN(v2),· · ·, N(vt−1), rather than directlyfromN(v2)∪N(v3)∪...∪N(vt−1), because the former helps to make unbiasestimation of graphlet concentrations and counts based on the distribution ofnode degrees, which will be explained later in detail.Theorem 1.Algorithm 1 generates allk-graphlets in graphGfork≥2.Proof.Given a graph G and an integerk≥2, we show that everyk-graphletsgkiis possible to be sampled by Algorithm 1 as long as it exists inG.Given a graphletgki, since it is connected, it has a spanning tree, sayT. Takea leaf nodev1ofTas the starting node of sampling. We then choose the uniqueneighbor ofv1asv2. Note thatTis a connected graph withknodes. So there aresome of the remainingk−2 nodes connected to{v1, v2}, and more specifically,they are the neighbors ofv2sincev1is a leaf. Then we choose one from them asv3. Similarly, some of the remainingk−3 nodes are connected tov2orv3, so wecan continue to choose a node from the neighbors ofv2orv3as the fourth nodeby our sampling strategy, and so on. At each step of the sampling process, weadd a new node and an edge to the sampled subgraph. So during the samplingprocess, we keep the sampled subgraph being connected and also make sure thatthe number of nodes in the sampled graph is equal to the number of edges init plus one. So afterksteps of the sampling process, the sampled subgraph isexactlyT. BecauseTis the spanning tree ofgki,gkiis a subgraph of the inducedgraph ofV(T) (the node set ofT). Sogkican be sampled fromT, which is apossible sampling by Algorithm 1. So Algorithm 1 can generategki.3.2 Computation of Repeat CoefficientThere may be multiple ways to sample a graphlet. The number of ways for sam-pling different graphlets are usually different, resulting in biases for the samplesof different graphlets. To achieve unbiased estimation of graphlet concentrationsand counts, we need to calculate the repeat coefficientαkiof each graphletgki,i.e., the number of ways being sampled.We denote a sampling way of ak-graphlets as a tuple ofk−1 pairs, suchashhv1, v2i,hv2, v3i, ...,hvk−1, vkii, which meansv1is the starting node andviisselected from the neighbors ofvi−1for 2≤i≤k. For the 4-graphlets in Fig.3, we can sample it with multiple ways, such ashhv1, v2i,hv2, v3i,hv3, v4ii, orhhv1, v2i,hv2, v3i,hv2, v4ii, etc.v1v2v3v4Fig. 3: Tailed-triangle.We can computeαkiby firstly enumerating all sequences ofknodes, thencalculating the ways of samplinggkicorresponding to each sequence, and at last
  7. aggregating the ways corresponding to all sequences. For example, the tailed-triangleg44(Fig. 3) has 4 nodes. There are 24 sequences of 4 nodes, and thereare two ways to sampleg44corresponding to sequencehv1, v2, v3, v4i. Since thet-th (4≤t≤k) nodevtcomes from the neighbors of any one of the second,third, ..., (t−1)-th node, letvstbe the node from whose neighborsvtbeingchosen. We can denote a sampling way of ak-graphlets as a tuple ofk−1pairs,hhv1, v2i,hv2, v3i,hvs4, v4i,· · ·,hvsk, vkii, which means thatv1is the startingnode, andviis selected from the neighbors ofvsifor 4≤i≤k. Then the twoways corresponding to sequencehv1, v2, v3, v4iarehhv1, v2i,hv2, v3i,hv3, v4iiandhhv1, v2i,hv2, v3i,hv2, v4ii. There is no sampling ways corresponding to sequencehv3, v1, v2, v4i. Thus, there are 10 different ways to generateg44. We defineA(u, v) =(1 if (u, v)∈E;0 otherwise.In general, given a sequence (w1, w2, ..., wk) ofknodes for a graphlet, whena new nodewz(z≥3) is generated, it can be derived from the neighbors ofPz−1u=2A(wz, wu) different generated nodes in our strategy. So the number ofways of generating a graphlet corresponding to sequence (w1, w2, ..., wk) isf(w1, w2, ..., wk) =A(w1, w2)Qkz=3(Pz−1u=2(A(wz, wu))).For example, there are 1∗1∗(1 + 1) = 2 ways to generateg44correspondingto sequence (v1, v2, v3, v4) as shown in Fig. 3.3.3 Unbiased EstimatorWe use a simple random walk to generate the starting node for sampling, becauseit only needs a little information about the graph and usually most nodes in thegraph are in the largest connected component in real-world networks. When arandom walk get mixed, the probability of the appearance of nodevisP(v) =d(v)D, whereDis the sum of degrees of all nodes inG.LetR(k)be the set of all sequences ofknodes andr(k)∈R(k). The sequencer(k)= (v1, v2, . . . , vk) is generated with probabilityP(r(k)) =1D1d(v2)1[d(v2)+d(v3)]· · ·1[d(v2)+···+d(vk−1)].Now we defineL(r(k), gki) as follows.L(r(k), gki) =(1 ifs(r(k))'gki,0 otherwise,wheres(r(k)) represents the subgraph induced byr(k).Theorem 2.The unbiased estimation forCki, the count of graphletgki, isCˆki=1nαkiPnt=1L(r(k),gki)P(r(k)).
  8. Proof.Every subgraph isomorphic to graphletgkiinGcan be foundαkitimesin our strategy. So we havePr(k)∈R(k)L(r(k), gki) =αkiCki. By the property ofexpectation we haveE[Cˆki] =E[1nαkiXnt=1L(r(k), gki)P(r(k))] =1nαkiXnt=1E[L(r(k), gki)P(r(k))]=1nαkiXnt=1Xr(k)∈R(k)L(r(k), gki)P(r(k))P(r(k))=1nαkiXnt=1αkiCki=Cki.Therefore, with Law of Large Numbers, we can obtain the unbiased estimatorof graphlet concentrations and we don’t require any global information of thenetwork, sayD. LetF(r(k)) =d(v2)[d(v2) +d(v3)]· · ·[d(v2) +· · ·+d(vk−1)],which meansP(r(k)) =1D1F(r(k)). By the definition of concentration, we havePnt=11αkiF(r(k))L(r(k),gki)PTki=1Pnt=11αkiF(r(k))L(r(k),gki)→cki, whenn→ ∞.Algorithm 2SSRW for estimating graphlet countsInput:GraphG= (V, E), sample numbern, graphlet sizek, sum of degreesD;Output:Estimatedk-graphlet counts for ofG;1:picksas a starting node when random walk get mixing;2:fori= 1 tondo3:v1←s;4:generatek-{v1, v2, . . . , vk}starting fromv1using Algorithm.1;5:ifv1, v2, . . . , vkare all distinctthen6:id←graphlet id induced by (v1, v2, ..., vk);7:Cˆkid←Cˆkid+d(v2)[d(v2)+d(v3)]···[d(v2)+···+d(vk−1)]αkid∗Dn;8:end if9:s←random node inN(s);10:end for11:return(Cˆ1,Cˆ2, . . . ,CˆTk).3.4 Framework to Estimate Graphlet Counts and Bound AnalysisBased on the method and theoretical analysis in previous sections, we show ourframework to estimate graphlet counts in Algorithm 3. Our algorithm starts togenerate samples from starting nodes in a random walk when the random walkmixes. Thek-subgraph samples for estimatingk-graphlet counts are generatedby Algorithm 1. Every time we get a validk-graphlet samplegki, we add corre-sponding reweighted coefficientF(r(k))αki∗DntoˆCkiand finally get the estimatedgraphlet counts.To speed up our algorithm, we consider to run it in parallel. Running manyrandom walks at the same time would help us to get more samples and thus toobtain more accurate estimation. Every processing unit is chosen independently.
  9. We derive the unbiased estimation of target graphlets statistics by averaging theunbiased estiamtions from each processing units.Now we discuss the relation between the number of samples and the accu-racy.DefineT=τ(ξ) as itsξ-mixing time forξ≤18. Letv1, v2, . . . , vnde-note an-step random walk starting from an initial distributionφ,andkφkπ=Pv∈Vφ2(v)/π(v),Mki= maxr(k)∈R(k)DL(r(k),gki)P(r(k))αki, we have following theorem.Theorem 3.There exists a boundB=72TMkilogckφkπδCki2so that whenn > BwehaveP(|Cˆki−Cki| ≤Cki)≥δfor given accuracy parameterandconfidencelevelδ.The detail proof is shown in [1]. So the accuracy of our algorithm is relatedto the properties of the graph and initial state of random walk.4 ExperimentsOur algorithm get big advance in estimating graphlet statistics with more than5 nodes. We show our our experiment results for these higher-order graphletsfirst. With the help of parallelization, we can get massive samples in a shorttime, leading to higher accuracy. We show these results in section 4.3. Finally,we compare the accuracy and running time with the state-of-the-art algorithms.4.1 Experiment setupThe datasets used in our experiments come from Stanford Network Analysisproject(SNAP)[12] and networkrepository[22], listed in Table 1. We implementedour algorithm in C++ for graphlets with less than 6 nodes. For testing 6-and7-graphlets, we implemented our algorithm in Python with iGraph library. Ouralgorithm is based on simple random walk, so we choose the largest connectedcomponent(LCCs) in large graphs and remove directions and self-loop edges.Table 1: Performance on 6-and 7-graphletsGraph|V| |E|LiveJournal[12] 4843953 42851237com-Amazon[12] 334862 925870socfb-Penn94[22] 41536 1362180Slashdot[12] 77360 417759Gnutella04[12] 10876 39994Gnutella06[12] 8717 31525ca-GrQc[12] 4158 13422We use normalized root mean square error (NRMSE) to evaluate the ac-curacy of SSRW and other algorithms, which is defined as NRMSE(Cˆki) =√E(Cˆki−Cki)2Cki=√V ar[Cˆki]+(E[Cˆki−Cki])2Cki,whereCˆkiis the estimated value andCkiis the ground-truth.
  10. 4.2 Estimation of 6-and 7-graphlet statisticsTo show the scalability of our algorithm, we use SSRW to estimate 6- and 7-graphlet counts. In this experiment, we choose medium sized graphs, as there isno efficient tool to calculate the exact count of 6-and 7-graphlets in large graphs.Table 2: Performance on 6-and 7-graphletsGraphlet id Gnutella04(C6i) Gnutella06(C6i) ca-GrQc(C7i)1 0.0511 0.0604 0.04922 0.0452 0.0512 0.05203 0.0856 0.0306 0.08934 0.0428 0.0429 0.14005 0.0410 0.0367 0.11036 0.0233 0.0278 0.08157 0.0326 0.0340 0.0701Table. 2 illustrates the accuracy of our algorithm for 6-and 7-graphlet countsestimation. We get 500K samples and repeat 1000 times to get NRMSE andchoose seven of them to demonstrate. All NRMSEs are less than 0.1 for 6-graphlets and 0.15 for 7-graphlets. For getting 500K 6-graphlets samples, ittakes our algorithm 44.9 seconds and 44.3 seconds on Gnutella04 and Gnutel-la06, respectively, while Getting 500K 7-graphlets samples needs 136 seconds onca-GrQc. Parallelization will shorten running time greatly. If we sample 500Ksamples on Gnutella04 with 5 processing unit with every processing unit get-ting 100K samples, it only needs 9.8 seconds. We will show the performance ofparallelization in more detail in next subsection.For showing that our algorithm also performs well in large graphs, we listthe counts estimation for Slashdot with 77K nodes and 417K edges. As far aswe know, there is no practical tool for counting all 6-and 7-graphlets in graphsas big as Slashdot. We run 500K samples, which spends 104.8 seconds and 165.6seconds for 6-and 7-graphlets estimation respectively. Even though we have noexact value for comparison, the unbiasedness of our algorithm and previousexperiments make us believe that these values are very close to the groundtruth.4.3 ParallelizationParallelization would help us to get massive number samples in a short time,which will improve accuracy a lot. We parallelize our algorithm with 256 CPUs,Intel Xeon system, 2.20GHz and 512G memory.We run parallelized SSRW to estimate the counts of 6-graphlets in Gnutella04by varying the number of processes. Each process collects 10K samples. Fig. 5(a)shows the relation of the time consuming and the number of processes we use.The average accuracy of 7 graphlets is 0.077 for 100K samples. When we perform10 processes in parallel, which means we will have 10K×10=100K samples,time overhead only increases 41.7% with the average accuracy drops to 0.042.
  11. Graphlet id1 2 3 4 5 6 7NRMSE(Log-scale)1013101410151016(a) 6-graphletsGraphlet id1 2 3 4 5 6 7NRMSE(Log-scale)101610171018(b) 7-graphletsFig. 4: Estimation of 6-and 7-graphlets counts on Slashdot.When we run 100 processes in parallel, the time is only doubled, but the averageaccuracy drops to 0.035, reduced by 50%, as shown in Fig. 5(b).Number of Processing Units0 50 100Runtime(second)10152025Runtimeavg. NRMSEavg. NRMSE0.020.040.060.08(a) Runtime and avg. NRMSEGraphlet id1 2 3 4 5 6 7NRMSE00.020.040.060.080.10.12100K100K*10100K*100(b) NRMSE v.s. SamplesFig. 5: Runtime and estimation accuracy for 6-graphlets estimation on Gnutel-la06.We also evaluate our parallelized algorithm on a larger graph, i.e., LiveJour-nal for estimating 5-graphlet counts. Fig. 6 shows that the accuracy increaseswith the number of samples. When we run 100 processes in parallel and everythread gets 500K samples, we get 5M samples in total. Except forg511, all N-RMSEs are smaller than 0.1, and the running time is 43.9 seconds, while thestate-of-the-art algorithm ESCAPE costs 28385 seconds for getting the exactvalue of 5-graphlet counts.4.4 Comparison with previous workTo present the performance of SSRW furthermore, we compare it with the state-of-the-art algorithms Subgraph Random Walk 2 with corresponding states for asubgraph(SRW2CSS) [5] and Waddling Random Walk(WRW)[8] in the accura-cy of estimating graphlet concentrations. These algorithms are mainly designedto estimate concentrations of graphlets with less than 6 nodes. The counts es-timation can derive concentrations directly, so we compare the results of 4-and5-graphlet concentrations with the above two algorithms. SRW2CSS runs on thesuper graphG(2), where each node corresponds to an edge inG. WRW runs on
  12. Graphlet id0 5 10 15 20NRMSE00.511.522.5500K500K*10500K*100Fig. 6: Samples v.s. NRMSE on LiveJournal.the original graph and designs a specific waddling protocol for those graphletswhich can not be sampled inkconsecutive steps of the simple random walk.Graphlet id1 2 3 4 5 6NRMSE00.20.40.60.81SSRWWRWSRW2CSS(a) com-AmazonGraphlet id5 10 15 20NRMSE(Log-scale)10-210-1100101SSRWWRWSRW2CSS(b) com-AmazonFig. 7: NRMSE of concentration estimations for 4-and 5-graphlets.Fig. 7 shows the comparisons of the average NRMSEs of all 4-graphlets with20K samples and all 5-graphlets with 30K samples on com-Amazon and Emai-Eron. The NRMSE is estimated over 1000 independent simulations. From Fig.7, we find that SSRW and WRW are significantly more accurate than SRW2CSSfor all graphlets. For 4-graphlets, SSRW outperforms SRW2CSS at least 6×andup to 18×(e.g.g45in com-Amazon), and outperforms WRW up to 1.8×(e.g.g46in com-Amazon). For 5-graphlets, SSRW outperforms SRW2CSS 10×andWRW 1.5×in general. We think that it is because SRW2CSS samples on thesuper graph, which induces to a much larger state space of sampling. SSRW andWRW sample on the original graph, so they can touch different parts of graphquickly and get samples more evenly from the graph.To furthermore explore the performance of sampling algorithms on each indi-vidual graphlet, we also compare the accuracies of SSRW, WRW and SRW2CSSwith different number of samples for every graphlet. Due to the limitation ofpages, we only present the experimental results on dataset Com-Amazon, andthe similar conclusion is derived on other datasets. In Fig. 8, we compare thethree algorithms for two 4-graphlets and two 5-graphlets. It shows that SSRWoutperforms the others in accuracy, converges fastest (about 5Ksamples) andis stable during the whole sampling process.
We use cookies to provide, improve, protect and promote our services. Visit our Privacy Policy and Privacy Policy FAQs to learn more. You can manage your personal preferences, including your ‘Do not sell or share my personal data to third parties’ setting using the “Customize cookies” button below.