Data Science: Graphical Analysis of data using Neo4j and Gephi Tool

What is graph data?

Graph data is known as connected data science, where the relationships and connections are between the data points to make efficient and accurate predictions.

How graph data is stored? and how it is represented?

Graph data is generally reserved in storage files, where each file contains data known as nodes, associations, labels, and properties, for a certain piece of the graph. A data graph is a graphical representation of knowledge and wisdom through which we can easily portray relationships and patterns in an instant.

How to visualize graph data and devise conclusions/results?

While studying the graph accurately, one can imagine the graph data and successfully gain knowledge about the data that shows signs and relations, which may assist the user to understand it and predict the future.

Neo4j Tool

Neo4j is an open-source and world’s leading graph database management system developed by Neo Technology, Inc. It is designed for optimizing the fast management, storage, and traversal of nodes and relationships. It is a highly scalable, native graph database built to leverage data and their relation. It delivers constant real-time performance, which enables enterprises to build applications to meet today’s evolving data challenges.

Features of Neo4j:

  1. Flexible schema
  2. Scaling and Performance
  3. Drivers for popular languages and frameworks
  4. Cloud-ready
  5. Powerful Cypher Query language
  6. Data Import
  7. Hot Backups

Three main primitives in Neo4j:

  1. Nodes — It is like a table of Relational Database where we store the data.
  2. Relationships — It is a connection between Data mapped between two nodes.
  3. Properties — It is nothing but tags that can be attached to both Nodes and Relationships. It is having the data. ex. Node Person can have properties like Name, Age.

At first, I will run a hello world query which will create the 2 nodes called Neo4j and Hello world and 1 relation called says.

CREATE (database:Database {name:"Neo4j"})-[r:SAYS]->(message:Message {name:"Hello World!"})
RETURN database, message, r
The relationship created just by a simple query
Table view of nodes & relations

Steps:

  1. Select a database to work on movies.
  2. Click Start and select Open browser from the graph apps menu.
  3. Run query in the editor ($).
  4. Note and analyze patterns amongst the nodes.
  5. Check if any relations exist, it would be beneficial for analysis.
  6. Repeat steps 3, 4, and 5 for other movies and actors.
:play movie-graph
It is a mini graph application containing actors and directors that are related through the movies
MATCH (tom {name: “Tom Hanks”}) RETURN tom
Find the actor named ”Tom Hanks”…
MATCH (cloudAtlas {title: "Cloud Atlas"}) RETURN cloudAtlas
Find the movie with the title “Cloud Atlas”…
MATCH (people:Person) RETURN people.name LIMIT 10
Find 10 people…
MATCH (nineties:Movie) WHERE nineties.released >= 1990 AND nineties.released < 2000 RETURN nineties.title
Find movies released in the 1990s…
MATCH (tom:Person {name: "Tom Hanks"})-[:ACTED_IN]->(tomHanksMovies) RETURN tom,tomHanksMovies
List all Tom Hanks movies…
MATCH (cloudAtlas {title: "Cloud Atlas"})<-[:DIRECTED]-(directors) RETURN directors.name
Who directed “Cloud Atlas”?
MATCH (people:Person)-[relatedTo]-(:Movie {title: "Cloud Atlas"}) RETURN people.name, Type(relatedTo), relatedTo
How people are related to “Cloud Atlas”…

Task:

Find a simple shortest path for actor Joel Silver using the following methods:

  1. Variable-length patterns
  2. Built-in shortest-path() algorithm
MATCH (bacon:Person {name:"Kevin Bacon"})-[*1..4]-(hollywood)
RETURN DISTINCT hollywood
Movies and actors up to 4 “hops” away from Kevin Bacon
MATCH p=shortestPath(
(bacon:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"})
)
RETURN p
Bacon path, the shortest path of any relationships to Meg Ryan
MATCH (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors), (coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cruise:Person {name:"Tom Cruise"})
RETURN tom, m, coActors, m2, cruise
Find someone to introduce Tom Hanks to Tom Cruise

Gephi Tool

Gephi is open-source software for visualizing and analyzing large networks graphs. Gephi uses a 3D render engine to display graphs in real-time and speed up the exploration. You can use it to explore, analyze, spatialize, filter, cluster, manipulate and export all types of graphs.

Features:

  • Real-time Visualization
  • Built-in Rendering Engine
  • Native File Formats Support
  • Layout Algorithm
  • Metrics and Statistics
  • Data Laboratory
  • Dynamic Filtering

Open the Gephi Tool and click on New Project. Then choose File->Open and load the dataset of your choice as shown below. On loading the dataset it would show the number of nodes and edges present in the dataset as well as the type of the graph. Here, I have chosen the karate.gml dataset.

After clicking OK all the nodes and edges are displayed when initially data is loaded.

Now we can represent the data in various layouts. In the left pane choose the Layout option and choose the layout of your choice and click on Run. In the below image I have chosen the Fruchterman Reingold layout which displays the data in the following form.

We can show nodes in a different color, sizes based on degree, in-degree and out-degree. For that go to the left panel on the top side, Nodes->Ranking, select color for Degree/ In-degree/ Out-degree. Where red color nodes have a lower degree compared to white and the dark grey node has the highest degree rankings.

For displaying in various sizes in the left pane in the Appearance section select the Size option and then mention the minimum and maximum size of nodes you want to display. I have given the Min size to be 10 and Max size to be 30. In the below image nodes having higher degrees are larger in size compared to nodes having less degree i.e nodes in Dark grey have a high value of degree compared to nodes in white and red color.

Next, we generate a Degree Distribution graph for Degree, In-Degree, and Out-Degree and also get the Average Degree value for all the nodes. To generate the graph simply in the right pane choose the Statistics tab and there run Average Degree in the Network Overview section.

A report will be generated as well the column for the Degree, In-degree, and Out-degree will be added to the dataset table.

To see the Data Table in the top Menu Bar select Window->Data Table and you would be able to see your table as in the above image where after running the Average Degree function columns for Degree, In-Degree, and Out-Degree are added for each node present.

Conclusion

We have performed an example of the graphical analysis of data using Neo4j and Gephi Tool.