Monday, September 19, 2011

Content Choreography | Trent Walton

Media_httptrentwalton_cmrve
Media_httptrentwalton_ihmaa
Media_httptrentwalton_rrymj

Great thinking points for layout-scaping your webpage.

posted via email from posterous

Wednesday, August 31, 2011

Horoscoped

Horoscoped - Do horoscopes really just all say the same thing?


Do horoscopes really all just say the same thing? We scraped & analysed 22,000 to see.

See our completed meta-horoscope chart and make up your own mind.

We’ve also created a single meta-prediction out of the most common words..

How we did it

Horoscoped - Scraping 22,000 horoscopes
How do you gather 22,000 horoscopes? Obviously you could manually cut and paste them from one of the many online Zodiac pages. But that, we calculated, would take about a week of solid work (84.44 hours). So we engaged the services of arch-coder Thomas Winnigham to do a bit of hacking.

Yahoo Shine kindly archive their daily predictions in a simple and very hackable format (example). Thank you! So Thomas wrote a Python script to screen-scrape 22,186 horoscopes into a single massive spreadsheet. Screen-scraping is pulling the text off a website after it’s displayed. Python is a programming language. You can use it to write scripts that only gather the specific text you want. Then you run it multiple times so it mines an entire website.

Well, it’s not quite that easy. Big sites like Yahoo have ‘rate-limiting’ on their servers. That means if you access a page too many times too quickly, it thinks you’re a hacker and deploys all kinds of anti-hacking counter-measures. Initially, Thomas set his scraping speed too high (once every 10th of a second) and his IP got instantly banned from Yahoo for 24 hours. After some experimenting (and more bans), he found that a two second delay between scrapes prevented the defense mechanisms from kicking in. The script was set to run in the background (while we smoked cigars and discussed the empire). 12 hours later, we had our 22,000 horoscopes in a single file!

We can’t share the 9.5MB spreadsheet with you because it’s Yahoo’s copyright. But here are the Python scripts should you feel like recreating the experiment.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 
"""
Creates a list of URLs to stdout based on repeating patterns found in the site, suitable for use with WGET or CURL.

"""

import datetime

scopes=[
"aries",
"taurus",
"gemini",
"cancer",
"leo",
"virgo",
"libra",
"scorpio",
"sagittarius",
"capricorn",
"aquarius",
"pisces"
]



daily_urlbases=[
("overview","http://shine.yahoo.com/astrology/%s/daily-overview/%i/"),
("extended","http://shine.yahoo.com/astrology/%s/daily-extended/%i/"),
("daily-teen","http://shine.yahoo.com/astrology/%s/daily-teen/%i/"),
("daily-love","http://shine.yahoo.com/astrology/%s/daily-love/%i/"),
("daily-career","http://shine.yahoo.com/astrology/%s/daily-career/%i/"),
]

yearly_urlbases=[
("yearly","http://shine.yahoo.com/astrology/%s/yearly-overview/"),
("yearly-love","http://shine.yahoo.com/astrology/%s/yearly-love/"),
("yearly-career","http://shine.yahoo.com/astrology/%s/yearly-career/")
]

weekly_urlbases=[
("weekly","http://shine.yahoo.com/astrology/%s/weekly-overview/%i/"),
("weekly-love","http://shine.yahoo.com/astrology/%s/weekly-love/%i/"),
("weekly-career","http://shine.yahoo.com/astrology/%s/weekly-career/%i/")
]

monthly_urlbases=[
("monthly","http://shine.yahoo.com/astrology/%s/monthly-overview/%i/"),
("monthly-love","http://shine.yahoo.com/astrology/%s/monthly-love/%i/"),
("monthly-career","http://shine.yahoo.com/astrology/%s/monthly-career/%i/")
]

class dateincrement:
def __init__(self,initialdate=datetime.datetime(2009,12,31),scale=datetime.timedelta(days=1)):
self.thisdate=initialdate
self.scale=scale
def next(self):
if self.thisdate + self.scale < datetime.datetime(2011,1,1):
self.thisdate += self.scale
return self.thisdate
else:
return False

class monthincrement:
def __init__(self,initialmonth=0):
self.month=initialmonth
def next(self):
if self.month < 12:
self.month += 1
return datetime.datetime(2010,self.month,1)
else:
return False

dayobj=None
weekobj=None
monthObj=None

def makeObjects():
dayobj = dateincrement()
weekobj = dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7))
monthobj = monthincrement()

def testobj(obj):
while True:
result=obj.next()
if result:
print result
else:
break

def testallobjs():
testobj(dayobj)
testobj(weekobj)
testobj(monthobj)

pad = lambda x: str(x).rjust(2,"0")

def generate_urls(thelist,theobj):
results=[]
while True:
d=theobj.next()
print d
if d:
for url in thelist:
for month in scopes:
datestr=int(str(d.year)+pad(d.month)+pad(d.day))
print repr(month)
print repr(datestr)
aresult=url[1] % (month,datestr)
print aresult
results.append((url[0]+"_"+month,datestr,aresult))
else:
break
return results

def generate_yearly_urls(thelist):
results=[]
for url in thelist:
for month in scopes:
print repr(month)
aresult=url[1] % month
print aresult
results.append((url[0]+"_"+month,"2010",aresult))
return results


import sys
from urllib import urlopen
#from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser():

def parse():
try:
while True:
url = queue.get_nowait()
print "GRABBING: " + repr(url)
try:
content = urlopen(url[2]).read()
f=open("output/"+url[0]+"_"+str(url[1])+".html",'w')
f.write(content)
f.close()
if len(content) > 10000:
url.task_done()
except:
print "PASS: " + repr(url)
pass
except Empty:
pass

return parse


if __name__ == "__main__":
daily=generate_urls(daily_urlbases,dateincrement())
weekly=generate_urls(weekly_urlbases,dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7)))
monthly=generate_urls(monthly_urlbases,monthincrement())
yearly=generate_yearly_urls(yearly_urlbases)
combined = daily+weekly+monthly+yearly
parser = get_parser()
for x in combined:
#queue.put(x)
print x[2]
workers=[]
# for i in range(5):
#worker = Thread(target=parser)
#worker.start()
#works.append(worker)
#for worker in workers:
#worker.join()

"""
Creates a list of URLs to stdout based on repeating patterns found in the site, suitable for use with WGET or CURL.

"""

import datetime

scopes=[
"aries",
"taurus",
"gemini",
"cancer",
"leo",
"virgo",
"libra",
"scorpio",
"sagittarius",
"capricorn",
"aquarius",
"pisces"
]



daily_urlbases=[
("overview","http://shine.yahoo.com/astrology/%s/daily-overview/%i/"),
("extended","http://shine.yahoo.com/astrology/%s/daily-extended/%i/"),
("daily-teen","http://shine.yahoo.com/astrology/%s/daily-teen/%i/"),
("daily-love","http://shine.yahoo.com/astrology/%s/daily-love/%i/"),
("daily-career","http://shine.yahoo.com/astrology/%s/daily-career/%i/"),
]

yearly_urlbases=[
("yearly","http://shine.yahoo.com/astrology/%s/yearly-overview/"),
("yearly-love","http://shine.yahoo.com/astrology/%s/yearly-love/"),
("yearly-career","http://shine.yahoo.com/astrology/%s/yearly-career/")
]

weekly_urlbases=[
("weekly","http://shine.yahoo.com/astrology/%s/weekly-overview/%i/"),
("weekly-love","http://shine.yahoo.com/astrology/%s/weekly-love/%i/"),
("weekly-career","http://shine.yahoo.com/astrology/%s/weekly-career/%i/")
]

monthly_urlbases=[
("monthly","http://shine.yahoo.com/astrology/%s/monthly-overview/%i/"),
("monthly-love","http://shine.yahoo.com/astrology/%s/monthly-love/%i/"),
("monthly-career","http://shine.yahoo.com/astrology/%s/monthly-career/%i/")
]

class dateincrement:
def __init__(self,initialdate=datetime.datetime(2009,12,31),scale=datetime.timedelta(days=1)):
self.thisdate=initialdate
self.scale=scale
def next(self):
if self.thisdate + self.scale < datetime.datetime(2011,1,1):
self.thisdate += self.scale
return self.thisdate
else:
return False

class monthincrement:
def __init__(self,initialmonth=0):
self.month=initialmonth
def next(self):
if self.month < 12:
self.month += 1
return datetime.datetime(2010,self.month,1)
else:
return False

dayobj=None
weekobj=None
monthObj=None

def makeObjects():
dayobj = dateincrement()
weekobj = dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7))
monthobj = monthincrement()

def testobj(obj):
while True:
result=obj.next()
if result:
print result
else:
break

def testallobjs():
testobj(dayobj)
testobj(weekobj)
testobj(monthobj)

pad = lambda x: str(x).rjust(2,"0")

def generate_urls(thelist,theobj):
results=[]
while True:
d=theobj.next()
print d
if d:
for url in thelist:
for month in scopes:
datestr=int(str(d.year)+pad(d.month)+pad(d.day))
print repr(month)
print repr(datestr)
aresult=url[1] % (month,datestr)
print aresult
results.append((url[0]+"_"+month,datestr,aresult))
else:
break
return results

def generate_yearly_urls(thelist):
results=[]
for url in thelist:
for month in scopes:
print repr(month)
aresult=url[1] % month
print aresult
results.append((url[0]+"_"+month,"2010",aresult))
return results


import sys
from urllib import urlopen
#from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser():

def parse():
try:
while True:
url = queue.get_nowait()
print "GRABBING: " + repr(url)
try:
content = urlopen(url[2]).read()
f=open("output/"+url[0]+"_"+str(url[1])+".html",'w')
f.write(content)
f.close()
if len(content) > 10000:
url.task_done()
except:
print "PASS: " + repr(url)
pass
except Empty:
pass

return parse


if __name__ == "__main__":
daily=generate_urls(daily_urlbases,dateincrement())
weekly=generate_urls(weekly_urlbases,dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7)))
monthly=generate_urls(monthly_urlbases,monthincrement())
yearly=generate_yearly_urls(yearly_urlbases)
combined = daily+weekly+monthly+yearly
parser = get_parser()
for x in combined:
#queue.put(x)
print x[2]
workers=[]
# for i in range(5):
#worker = Thread(target=parser)
#worker.start()
#works.append(worker)
#for worker in workers:
#worker.join()


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 
"""

Example of using the old BeautifulSoup API to extract content from downloaded html files into CSV... if you're doing this sort of thing today, I recommend using the newer lxml interface directly, but lxml also has a BeautifulSoup compatibility layer.

"""


import os


from BeautifulSoup import BeautifulSoup as bs

def get_horo(source):
g=bs(source)
mainblock=g.find('div',{'class':'content'})
return mainblock

import os

filelist=[]


for dirname, dirnames, filenames in os.walk('./cleaned/'):
    for filename in filenames:
        filelist.append(os.path.join(dirname, filename))

bses=[]
for f in filelist:
source=open(f).read()
bses.append((f,source,bs(source)))

noneones=[]
for f in bses:
if f[1]=='None':
noneones.append(f)

sourcecomp=[]
for f in noneones:
sourcecomp.append((f[0],open(f[0].replace('./cleaned/','./output/')).read()))


urls=[x[0].replace('./cleaned/','').replace('-','/') for x in sourcecomp]

open('html_urls.txt','w').writelines([u + '\n' for u in urls])

newurls=[]
for u in urls:
a=u.split('/')
newurls.append(a[0] +'//' + a[2] + '/' + a[3] + '/' + a[4] + '/' + a[5] + '-' + a[6] + '/' + a[7] + '/')

open('html_urls.txt','w').writelines([u + '\n' for u in newurls])

newbses=[]
noneurls=[x[0] for x in noneones]
for b in range(len(bses)):
if not bses[b][0] in noneurls:
newbses.append(bses[b])




b=newbses

parsed=[]
for u in b:
record=[]
h3s=u[2].findAll('h3')
section_count=0
if len(h3s) == 2:
record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next]
record=[x.contents[0].strip() for x in record]
section_count=2
if len(h3s) == 0:
record=[u[2].contents[0].next.next]
record=[x.contents[0].strip() for x in record]
section_count=0
if len(h3s) == 1:
record=[h3s[0],h3s[0].next.next]
record=[x.contents[0].strip() for x in record]
section_count=1
if len(h3s) == 3:
record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next,h3s[2],h3s[2].next.next.next]
record=[x.contents[0].strip() for x in record]
section_count=3
record.reverse()
record.append(section_count)
record.reverse()
parsed.append((u[0].split('-')[3:-1],u[1],u[2],record))

noneones=[x for x in parsed if len(x[3])==0]

import pickle
parsed=pickle.load(open('parsed','b'))

newparsed=[x[0]+x[3] for x in parsed]

import csv
w=csv.writer(open('horoscopes.csv','wb'),delimiter=',',quoting=csv.QUOTE_NONNUMERIC)
for x in newparsed:
w.writerow(x[1:])

"""

Example of using the old BeautifulSoup API to extract content from downloaded html files into CSV... if you're doing this sort of thing today, I recommend using the newer lxml interface directly, but lxml also has a BeautifulSoup compatibility layer.

"""

import os

from BeautifulSoup import BeautifulSoup as bs

def get_horo(source):
g=bs(source)
mainblock=g.find('div',{'class':'content'})
return mainblock

import os

filelist=[]

for dirname, dirnames, filenames in os.walk('./cleaned/'):
    for filename in filenames:
        filelist.append(os.path.join(dirname, filename))

bses=[]
for f in filelist:
source=open(f).read()
bses.append((f,source,bs(source)))

noneones=[]
for f in bses:
if f[1]=='None':
noneones.append(f)

sourcecomp=[]
for f in noneones:
sourcecomp.append((f[0],open(f[0].replace('./cleaned/','./output/')).read()))

urls=[x[0].replace('./cleaned/','').replace('-','/') for x in sourcecomp]

open('html_urls.txt','w').writelines([u + '\n' for u in urls])

newurls=[]
for u in urls:
a=u.split('/')
newurls.append(a[0] +'//' + a[2] + '/' + a[3] + '/' + a[4] + '/' + a[5] + '-' + a[6] + '/' + a[7] + '/')

open('html_urls.txt','w').writelines([u + '\n' for u in newurls])

newbses=[]
noneurls=[x[0] for x in noneones]
for b in range(len(bses)):
if not bses[b][0] in noneurls:
newbses.append(bses[b])

b=newbses

parsed=[]
for u in b:
record=[]
h3s=u[2].findAll('h3')
section_count=0
if len(h3s) == 2:
record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next]
record=[x.contents[0].strip() for x in record]
section_count=2
if len(h3s) == 0:
record=[u[2].contents[0].next.next]
record=[x.contents[0].strip() for x in record]
section_count=0
if len(h3s) == 1:
record=[h3s[0],h3s[0].next.next]
record=[x.contents[0].strip() for x in record]
section_count=1
if len(h3s) == 3:
record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next,h3s[2],h3s[2].next.next.next]
record=[x.contents[0].strip() for x in record]
section_count=3
record.reverse()
record.append(section_count)
record.reverse()
parsed.append((u[0].split('-')[3:-1],u[1],u[2],record))

noneones=[x for x in parsed if len(x[3])==0]

import pickle
parsed=pickle.load(open('parsed','b'))

newparsed=[x[0]+x[3] for x in parsed]

import csv
w=csv.writer(open('horoscopes.csv','wb'),delimiter=',',quoting=csv.QUOTE_NONNUMERIC)
for x in newparsed:
w.writerow(x[1:])

Filtering it down

Horoscoped - Filtering 22,000 horoscopes
So every different type of horoscope got sucked up – career, teen, love, daily overview. Who knew there were so many? It was felt, though, that career & love predictions would have their internal biases i.e. lots of mentions of work, career, love, marriage etc. So we opted to just analyse the generic daily horoscopes for each sign. A total of 4,380 (365 per star sign).

Word Analysis Version 1

We used an online tool called TagCrowd to find the most common words. I prefer it to Wordle. You’ve got better control over any ‘noise’ in the signal, because you can not only filter common words (“and”, “for”, “is” etc) but also a special ‘stoplist’ of words you’ve chosen.

So we broke down the most common 50 words to see if there are any patterns of unique words. This is what was revealed:

Horoscoped - Unique words in top 50 words in predictions of each star sign

You can see the full data in a Google spreadsheet here.

Word Analysis 2

It struck me that several words in the top 50 – like “someone”, “really”, “quite” – were just qualifiers and not really that revealing. You’d find them in any English word analysis.

So we stripped those kinds of words out (see our stoplist). And lo! A fresh set of unique, revealing and more accurate words appeared in the top words per sign.

Horoscoped - Unique words in top 50 words in predictions of each star sign

Can I just say that I have no personal interest in horoscopes. I don’t know what the various characteristics of each star sign are meant to be. So you’ll have to tell me if any of this corresponds to folklore.

This was the data we used to create our meta-chart. Check out the final image. Or see all the data in this Google spreadsheet.

Meta-Prediction

One more thing though. This analysis appears to reveal something. The bulk of the words in horoscopes (at least 90%) are the same. That’s not a full, proper statistical analysis. (If you are a statistician and you want to do a proper analysis, please get in touch)

The cool thing is, once you’ve isolated the most common words, you can actually write a generic, meta prediction that would apply to all star signs, every day of the year. Here it is.

Horoscoped - Meta-prediction made from most common words in 4,000 star sign predictions

The Future

As ever, I’ve laid out my whole process and all the data here: http://bit.ly/horoscoped.
That way it’s all balanced and you can make up your own mind. Typical Libran!

Concept & research & design: David McCandless
Additional design: Matt Hancock
Additional research: Miriam Quick
Hacking: Thomas Winningham
Source: Yahoo Shine Horoscopes
Code & Scripts: Here and here
Data & workings: bit.ly/horoscoped

With me, talking about horoscopes usually induces scoffing, but this was a genuinely-interesting, fun read. :)

posted via email from posterous