This blog post is continuation of Part-I.
The sample data is increased to 150K Pakistani tweeps now.
Follower count is no longer a good influence measure. On average each
Pakistani tweep gets followed by 129 users. Majority of Pakistanis
(about 3/4th) have less than 50 followers. Half of Pakistani twitter
users have less than 10 followers. There are about 10,000 tweeps with
no follower and about 12,000 tweeps with single follower. This is a
very strange trend. If you look deeply into these accounts, you’ll notice
that most of them are with default DP and default background. It seems like
these are fake accounts, created by social media cells of different political
parties to increase follower count of their leaders on twitter.
On the other side, there are just 24 Pakistani’s with more than 50,000
followers. Most of them are politicians and TV anchors. Just 2331
tweeps have more than 1000 followers.
Klout is more reliable social media influence measure. Out of 150,000 Pakistani
tweeps about 40,000 do not have any klout score. About 70,000 have their klout
between 11-20. Average klout score is 16.72. About 12,000 have the minimum
possible score 10.
Only 22 users have scored above 70 score.
Here is the list of most influential Pakistanis (klout: 70+)
Note: This score may have changed when you’re reading this article.
For this analysis, description of about 150,000 Pakistani tweeps was
used. Out of 150K only about 77K (about 51%) users have set description
field in their twitter profiles.
Excluding punctuations and stopwords, following is the list of most commonly
used words by Pakistani tweeps in their profiles.
Technology used was FreqDist and stopwords of nltk.
I do not want to start this blog post by bashing Posterous.
Posterous is a great blogging tool for quickly making blog posts.
A couple of years back, some of its unique features convinced me to move
my blog from wordpress to posterous. Posterous offered custom domain name for
free whereas wordpress was charging for it. I really liked the email to blog
post feature, although I never used it other than testing it a couple of times.
Another amazing feature of posterous was detecting and making beautiful
widgets for external objects like YouTube, github gist etc.
Posterous provides some nice templates but I wanted to have more control over
presentation. A few days back youtube was blocked in Pakistan. Some
misconfiguration caused problem in loading other google sites. This affected
google maps etc. Same thing happened with my site. The template, I was using
was consuming some resources from google. I don’t know why but it was there
and there was no way to remove it. So, the end result was a slowly loading page
for Pakistani audience.
Another problem was how posterous modifies the HTML of the blog post. Again, I
wanted to have more control on my blog post presentation. Inserting a table in
a blog post was a trivial task. The WYSIWYG editor cannot handle table, even if
it is copy pasted. I had to manually draft HTML and paste it in HTML part of
the WYSIWYG editor. And it gets modified when rendered :-(
The idea of SSG is amazing. Why do I need a dynamic
setup for content which is hardly going to be modified in a month. I tried
Jekyll & Pelican and decided to use Pelican. Why Pelican? It was because
I am more biades towards Python. Jekyll is an equally good or may be better SSG.
Being a geek, I like writing in plain text editors more than WYSIWYG editors.
Writing in Markdown and reStructuredText is fun. One can keep his energies
focused on writing rather than formatting the content. My content is saved
as content not as HTML markup. It has better revision management using git or
any other version control system. This can easily be imported to any other
application. The content is saved in files, not in DB. I can write offline
and publish when I am online.
I have full control over the page rendered. I can design and optimize it
as I want. I do not have to worry about security or scaling as all the content
is purely static.
User Experience & Minimalism
I am not a UX expert but I do not want a lot of distractions in my content.
Here is what I did to improve UX:
- Removed comments, users can tweet the feedback.
- No facebook like or twitter tweet button.
- No tags, category or author name with each post.
- Using grey instead of pure black for text.
- Worked on typography
- Using typogrify
Jekyll provides a posterous importer but Pelican does not. Currently pelican
provides only following imports:
- RSS/Atom feed
For posterous I had to write my own importer which consumes Posterous API.
Here is the code:
def posterous2fields(api_token, email, password):
"""Imports posterous posts"""
from datetime import datetime, timedelta
import simplejson as json
def get_posterous_posts(api_token, email, password, page = 1):
base64string = base64.encodestring('%s:%s' % (email, password)).replace('\n', '')
url = "http://posterous.com/api/v2/users/me/sites/primary/posts?api_token=%s&page=%d" % (api_token, page)
request = urllib2.Request(url)
request.add_header("Authorization", "Basic %s" % base64string)
handle = urllib2.urlopen(request)
posts = json.loads(handle.read())
page = 1
posts = get_posterous_posts(api_token, email, password, page)
while len(posts) > 0:
posts = get_posterous_posts(api_token, email, password, page)
page += 1
for post in posts:
slug = post.get('slug')
if not slug:
slug = slugify(post.get('title'))
tags = [tag.get('name') for tag in post.get('tags')]
raw_date = post.get('display_date')
date_object = datetime.strptime(raw_date[:-6], "%Y/%m/%d %H:%M:%S")
offset = int(raw_date[-5:])
delta = timedelta(hours = offset / 100)
date_object -= delta
date = date_object.strftime("%Y-%m-%d %H:%M")
yield (post.get('title'), post.get('body_cleaned'), slug, date,
post.get('user').get('display_name'), , tags, "html")
The above code produced pelican fields which can later be passed to
fields2pelican which uses pandoc to tranform html content
to markdown or reStructuredText.
The site is deployed on heroku Cedar Stack which supports Pyhton
applications. It is served from great wsgi app called ‘static‘, gunicorn
Update: Using my own fork of static for performance tweaks.
Here is the list of domain managed by MarkMonitor and have their
nameservers pointing to dns2.freehostia.com & dns1.freehostia.com
According to some reports there are about 1.9M twitter users in
Pakistan. This was mentioned by someone in #SOCMM12 but there doesn’t
seem to be any source of this information.
I had been collecting twitter data for quite some time. Sample data
contains more than 100K Pakistani twitter users crawled using twitter
API. Only public profiles who have mentioned Pakistan or some pakistani
city name in their profile were considered for this analysis. This data
contains almost all active tweeple of Pakistan.
Here are the results of data analysis:
About half of Pakistani tweeps live in major cities like Karachi, Lahore
- 24.5% in Punjab (just 64.4% of them in Lahore, rest in other cities
- 21.6% in Sindh (with 92.6% of them in Karachi)
- 10.0% in Islamabad
- 2.6% in KPK (with 58.7% of them in Peshawar)
- 0.96% in Balochistan
- 0.45% in Azad Kashmir
gender.c, with custom names database of more than 5,000 names, was
used on first names in twitter profiles:
Names (first name):
Most common male names:
Most common female names:
According to www.stopbadware.org and ESET following government
websites contain malwares and they are NOT safe for your computer.
- phc.gos.pk (Shaheed Benazir Bhutto Housing Cell)
- gulbergtownlahore.gov.pk (Gulberg Town Lahore)
- wasafaisalabad.gop.pk (WASA Faisalabad)
- moip.gov.pk (Ministry of Industries)
- khushabpolice.gov.pk (Khushab Police)
- sped.gos.pk (Special Education Department, Government of Sindh)
- fatada.gov.pk (FATA Development Authority)
- bisp.gov.pk (Benazir Income Support Programme)