This is a running post, tracking my attempts to analyze the subgraph induced by the accounts @nikhilarundesai follows on Twitter.

Goals:

Generate a static set of data about the accounts I follow for easy analysis.
Do a basic node-wise analysis to search for stale, dead, or spam accounts to unfollow.
Cluster the undirected graph of accounts by domain, using a simple methodology like the Louvain method.
(Stretch) Use triangle-closure or other approaches to discover new candidate accounts to follow.
(Stretch) Use an overlapping community detection algorithm for better resolution.
(Stretch) Use text-processing on node bios to generate additional features.
(Stretch) Identify and use a feature-assisted clustering algorithm.
(Stretch) Identify and use an algorithm suited to directed graphs.
Potentially more things.

Non-goals:

Analyze the subgraph induced by accounts following @nikhilarundesai - not interesting.
Analyze the subgraph induced by accounts whose tweets have been liked by @nikhilarundesai
- Better suited for a followon post.
- Requires more work to analyze different “types” of likes - paper/video links, memes, bookmarking an account, etc.
- Fetching may be slow until https://github.com/eleurent/twitter-graph/issues/25 is resolved.

Getting the data

eleurent has already had the same idea, and has solved this problem end-to-end at https://github.com/eleurent/twitter-graph/.

I made some small tweaks to fit my use case. In particular, the script by default filtered “excluded” accounts from the friendships graph. However, most “excluded” accounts are simply accounts following zero other people (see https://github.com/eleurent/twitter-graph/issues/27). While this is surprisingly common across multiple node profiles, it is particularly common for accounts that are “organizational” in nature - for example @ANI, @TheNVIndy, @OvercastFM. These are still interesting accounts to include in the analysis, as they may be followed by many other accounts even if they are not following anyone.

After many false starts, the data dump successfully completed on 2022-02-12.

Basic analysis

As dumped, the graph seems to have 5002 nodes and 657856 directed edges, though there is potential for duplicates in the edges.

When loading the edge list into NetworkX as an undirected graph, we find one giant connected component of 4988 participating nodes and 529606 undirected edges. There are 14 additional isolated notes with no connections to each other or the large component. https://github.com/nadesai/twitter-graph/blob/6c364bc08975f53124dd4ab051fc6a22087b1411/notebooks/nodes_missing.ipynb is a quick analysis of these 14 nodes.

Sadly attempting to view this graph in Gephi caused it to crash.

TODO: check the directed connectivity of the graph.

Node analysis

The most urgent reason for doing this analysis is that I’ve hit the follow limit on Twitter and can no longer follow new accounts. This is aggravating, and is likely due to chaff in my follows that can be reviewed.

The JSON dump of the data (and possibly also the final CSV) from eleurent’s downloader contains information on the last time the account tweeted. This is probably the most useful metric for account utility - anybody who hasn’t tweeted in the past year deserves a critical look.

Using https://github.com/nadesai/twitter-graph/blob/6c364bc08975f53124dd4ab051fc6a22087b1411/notebooks/nodes_exploration.ipynb, I was able to look through the 300 most “stale” accounts, ordered by date of last tweet. This had a high hit rate for chaff - out of the 300 or so accounts in the list, I decided to unfollow about 100 of them as no longer relevant.

Graph clustering

Selecting an algorithm

To do this properly, first I need to figure out a good algorithm to use. It’s been a long time since I read any literature on community detection. The most recent review I recall was this 2009 review by Santo Fortunato.

In his README.md, eleurent refers to the Louvain paper and this other paper that I am not familiar with, but appears to be the paper that introduced the “resolution” parameter that is standard in Louvain implementations. I thought this might be a good avenue to discover a more up-to-date community detection review paper.

I generated the Connected Papers graph for the Louvain paper. The paper’s derivative works include two potentially promising results: this updated review paper from Fortunato and Hric, and this “atlas” e-book from Michele Coscia. Once I’ve done a quick skim of at least one of these, I will update this post.

TODO: read through the review and determine if any more recent algorithms deserve a look.

Louvain method

I used the Louvain method as implemented at https://github.com/taynaud/python-louvain to generate a coarse clustering. The notebook is at https://github.com/nadesai/twitter-graph/blob/e8e05190924dfe004070aed7d2b3a44354ea5961/notebooks/louvain_exploratory.ipynb.

There are seven distinct clusters which we can number 0 through 6. A brief inspection suggests the following cluster assignments:

Cluster 0: California, Bay Area, and housing (322 members)
Cluster 1: Left/activist politics (671 members)
Cluster 2: South Asia, non-US, and financial news (876 members)
Cluster 3: Right-wing politics (342 members)
Cluster 4: Unclear (see below) (1180 members)
Cluster 5: Climate, energy, earth sciences, outdoors (577 members)
Cluster 6: US politics and policy (1020 members)

The largest cluster (4) clearly contains multiple types of accounts. An immediately obvious throughline is that they are the accounts I find the most interesting and important to follow. It includes technologists, scientists, personal acquaintances, and novelty accounts.

The second-largest cluster (6) is also clearly the one most in need of pruning.

TODO: tweak the Louvain resolution parameter to see if this provides more useful or fine-grained clusters. TODO: generate a Louvain dendrogram rather than a partition and see if this yields additional insight.