Extracting Nodes and Edges for Social Network Analysis using Twitter Data

Fatah Kasmaran
3 min readOct 20, 2020

As its name, ‘network’ is a connection between points. Here, in SNA the connection is made based on the Nodes (the dot) and Edges (the line) of the data.

Edges is drawn as a line connecting two vertices, called endpoints or end vertices. There are two kinds of edges which is Direct (represented as an arrow between the nodes) and Indirect (treats both nodes interchangeably).

Nodes is the dot / point represents the data that will be connected by the edges.

Following codes are written here: https://github.com/maskasmaran/maskasmaran.github.io/blob/master/snaprep.py

Now, let’s get our hands dirty :

First, get the data. In this example, I am using Twitter Data. If you want to know how to get Twitter Data, refer to my earlier post about Mining Twitter Data. What we need is the raw Twitter Data which is in JSON file format that looks like this :

{“results”: [{“created_at”: “Thu Oct 08 22:30:19 +0000 2020”, “id”: 1314332323886690304, “id_str”: “1314332323886690304”, “text”: “RT @nepxtn: GUYS. TOLONG BANTU RETWEET. INI MEREKA DARI KAMPUSKU HILANG SAAT MELIPUT. TOLONG DIBANTU\ud83d\ude2d\ud83d\ude4f\ud83c\udffb\n#DPRPengkhianatRakyat #TolakOmnibus\u2026”, “source”: “<a href=\”http://twitter.com/download/android\" rel=\”nofollow\”>Twitter for Android</a>”, “truncated”: false, “in_reply_to_status_id”: null, “in_reply_to_status_id_str”: null, “in_reply_to_user_id”: null, “in_reply_to_user_id_str”: null, “in_reply_to_screen_name”: null, “user”: {“id”: 393239060, “id_str”: “393239060”, “name”: “M. Yanuar”, “screen_name”: “myanuarary”, “location”: “semarang — jakarta”, “url”: “http://technaturology.blogspot.com", “description”: “M.Yanuar | Network Engineer | Love Books”, “translator_type”: “none”, “protected”: false, “verified”: false, “followers_count”: 178, “friends_count”: 206, “listed_count”: 4, “favourites_count”: 0, “statuses_count”: 17246, “created_at”: “Tue Oct 18 07:14:07 +0000 2011”, “utc_offset”: null, “time_zone”: null, “geo_enabled”: false, “lang”: null, “contributors_enabled”: false, “is_translator”: false, “profile_background_color”: “000000”, “profile_background_image_url”: “http://abs.twimg.c and so on..

Second, prepare the code. The first one is importing all the module we need.

import sys
import json
import re
import numpy as np
import pandas as pd
from datetime import datetime

Third, using pandas library. Create dataframes as the value holder for our Pre-Processing.

userdata = pd.DataFrame(columns=(‘Id’, ‘Label’, ‘user_created_at’, ‘profile_image’, ‘followers_count’, ‘friends_count’)) edges = pd.DataFrame(columns=(‘Source’, ‘Target’, ‘Weight’))

Above, we made two variables which is userdata that holds Nodes value and edges that holds edges value as needed in Network.

Fourth, load the JSON file.

with open(“file.json”, “r”) as data_read:
json_data = json.load(data_read)

Fifth, the main code.

for tweet in json_data:
if ‘retweeted_status’ in tweet:
userdata = userdata.append(pd.DataFrame([[tweet[‘user’][‘id_str’], tweet[‘user’][‘screen_name’], tweet[‘user’][‘screen_name’], tweet[‘user’][‘created_at’], tweet[‘user’][‘profile_image_url_https’], tweet[‘user’][‘followers_count’], tweet[‘user’][‘friends_count’]]], columns=(‘Id’, ‘Label’, ‘user_created_at’, ‘profile_image’, ‘followers_count’, ‘friends_count’)), ignore_index=True)
userdata = userdata.append(pd.DataFrame([[tweet[‘retweeted_status’][‘user’][‘id_str’], tweet[‘retweeted_status’][‘user’][‘screen_name’], tweet[‘retweeted_status’][‘user’][‘created_at’], tweet[‘retweeted_status’][‘user’][‘profile_image_url_https’], tweet[‘retweeted_status’][‘user’][‘followers_count’], tweet[‘retweeted_status’][‘user’][‘friends_count’]]], columns=(‘Id’, ‘Label’, ‘user_created_at’, ‘profile_image’, ‘followers_count’, ‘friends_count’)), ignore_index=True)edges = edges.append(pd.DataFrame([[tweet[‘user’][‘id_str’], tweet[‘retweeted_status’][‘user’][‘id_str’], str(datetime.strptime(tweet[‘created_at’], ‘%a %b %d %H:%M:%S +0000 %Y’))]], columns=(‘Source’, ‘Target’, ‘Weight’)), ignore_index=True)

In short, the codes above speaks like this : the Twitter data filtered only for retweeted tweet by using if statements, that is because the networks I want to build is based on the retweet.

The value of Twitter data temporarily saved and modified using pandas DataFrame in variable userdata.

The first userdata variable represents the Retweeter Data. While the second userdata variable represents the True Tweet that being retweeted.

The edges variables holds the edge value which consists of Source, Target, Weight (personalized for Gephi Visualization). If A retweeted B’s tweet, then A connected to be. A as Target and B as Source. While Weight represents the number of how many A have ever retweeted B.

Sixth, set Weight level.

weightlevel = 1

weightlevel variable is the holder value for minimum Weight of our Data that we want to analyze later. weightlevel of 1 means a person ever retweeted other person once.

Seventh, create the Weight column value.

#creating Weight column value by counting the same combination of Source & Target 
edges2 = edges.groupby([‘Source’, ‘Target’])[‘Weight’].count()
#reset the index or level of dataframe
edges2 = edges2.reset_index()
#filter the weight by our weightlevel
edges2 = edges2[edges2[‘Weight’] >= weightlevel]

Eighth, export nodes from the edges and add node attributes for both Sources and Targets.

userdata = userdata.sort_values([‘Id’, ‘followers_count’], ascending=[True, False])#deleting duplicates tweet if any
userdata = userdata.drop_duplicates([‘Id’], keep=’first’)
#exporting nodes from edges
ids = edges2[‘Source’].append(edges2[‘Target’]).to_frame()
ids.columns = [‘Id’]ids = ids.drop_duplicates()#combining the value we made above
nodes = pd.merge(ids, userdata, on=’Id’, how=’left’)

Ninth, export it to a CSV file.

nodes.to_csv('nodes.csv', encoding='utf-8', index=False)edges2.to_csv('edges.csv', encoding='utf-8', index=False)

May be useful for you,

Thankyou.

--

--