Great thinking points for layout-scaping your webpage.
the influence of the enthusiasm consequent
Monday, September 19, 2011
Wednesday, August 31, 2011
Horoscoped
Do horoscopes really all just say the same thing? We scraped & analysed 22,000 to see.See our completed meta-horoscope chart and make up your own mind.
We’ve also created a single meta-prediction out of the most common words..
How we did it
How do you gather 22,000 horoscopes? Obviously you could manually cut and paste them from one of the many online Zodiac pages. But that, we calculated, would take about a week of solid work (84.44 hours). So we engaged the services of arch-coder Thomas Winnigham to do a bit of hacking.Yahoo Shine kindly archive their daily predictions in a simple and very hackable format (example). Thank you! So Thomas wrote a Python script to screen-scrape 22,186 horoscopes into a single massive spreadsheet. Screen-scraping is pulling the text off a website after it’s displayed. Python is a programming language. You can use it to write scripts that only gather the specific text you want. Then you run it multiple times so it mines an entire website.
Well, it’s not quite that easy. Big sites like Yahoo have ‘rate-limiting’ on their servers. That means if you access a page too many times too quickly, it thinks you’re a hacker and deploys all kinds of anti-hacking counter-measures. Initially, Thomas set his scraping speed too high (once every 10th of a second) and his IP got instantly banned from Yahoo for 24 hours. After some experimenting (and more bans), he found that a two second delay between scrapes prevented the defense mechanisms from kicking in. The script was set to run in the background (while we smoked cigars and discussed the empire). 12 hours later, we had our 22,000 horoscopes in a single file!
We can’t share the 9.5MB spreadsheet with you because it’s Yahoo’s copyright. But here are the Python scripts should you feel like recreating the experiment.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 """Creates a list of URLs to stdout based on repeating patterns found in the site, suitable for use with WGET or CURL."""import datetimescopes=["aries","taurus","gemini","cancer","leo","virgo","libra","scorpio","sagittarius","capricorn","aquarius","pisces"]daily_urlbases=[("overview","http://shine.yahoo.com/astrology/%s/daily-overview/%i/"),("extended","http://shine.yahoo.com/astrology/%s/daily-extended/%i/"),("daily-teen","http://shine.yahoo.com/astrology/%s/daily-teen/%i/"),("daily-love","http://shine.yahoo.com/astrology/%s/daily-love/%i/"),("daily-career","http://shine.yahoo.com/astrology/%s/daily-career/%i/"),]yearly_urlbases=[("yearly","http://shine.yahoo.com/astrology/%s/yearly-overview/"),("yearly-love","http://shine.yahoo.com/astrology/%s/yearly-love/"),("yearly-career","http://shine.yahoo.com/astrology/%s/yearly-career/")]weekly_urlbases=[("weekly","http://shine.yahoo.com/astrology/%s/weekly-overview/%i/"),("weekly-love","http://shine.yahoo.com/astrology/%s/weekly-love/%i/"),("weekly-career","http://shine.yahoo.com/astrology/%s/weekly-career/%i/")]monthly_urlbases=[("monthly","http://shine.yahoo.com/astrology/%s/monthly-overview/%i/"),("monthly-love","http://shine.yahoo.com/astrology/%s/monthly-love/%i/"),("monthly-career","http://shine.yahoo.com/astrology/%s/monthly-career/%i/")]class dateincrement:def __init__(self,initialdate=datetime.datetime(2009,12,31),scale=datetime.timedelta(days=1)):self.thisdate=initialdateself.scale=scaledef next(self):if self.thisdate + self.scale < datetime.datetime(2011,1,1):self.thisdate += self.scalereturn self.thisdateelse:return Falseclass monthincrement:def __init__(self,initialmonth=0):self.month=initialmonthdef next(self):if self.month < 12:self.month += 1return datetime.datetime(2010,self.month,1)else:return Falsedayobj=Noneweekobj=NonemonthObj=Nonedef makeObjects():dayobj = dateincrement()weekobj = dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7))monthobj = monthincrement()def testobj(obj):while True:result=obj.next()if result:print resultelse:breakdef testallobjs():testobj(dayobj)testobj(weekobj)testobj(monthobj)pad = lambda x: str(x).rjust(2,"0")def generate_urls(thelist,theobj):results=[]while True:d=theobj.next()print dif d:for url in thelist:for month in scopes:datestr=int(str(d.year)+pad(d.month)+pad(d.day))print repr(month)print repr(datestr)aresult=url[1] % (month,datestr)print aresultresults.append((url[0]+"_"+month,datestr,aresult))else:breakreturn resultsdef generate_yearly_urls(thelist):results=[]for url in thelist:for month in scopes:print repr(month)aresult=url[1] % monthprint aresultresults.append((url[0]+"_"+month,"2010",aresult))return resultsimport sysfrom urllib import urlopen#from BeautifulSoup import BeautifulSoupfrom Queue import Queue, Emptyfrom threading import Threadvisited = set()queue = Queue()def get_parser():def parse():try:while True:url = queue.get_nowait()print "GRABBING: " + repr(url)try:content = urlopen(url[2]).read()f=open("output/"+url[0]+"_"+str(url[1])+".html",'w')f.write(content)f.close()if len(content) > 10000:url.task_done()except:print "PASS: " + repr(url)passexcept Empty:passreturn parseif __name__ == "__main__":daily=generate_urls(daily_urlbases,dateincrement())weekly=generate_urls(weekly_urlbases,dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7)))monthly=generate_urls(monthly_urlbases,monthincrement())yearly=generate_yearly_urls(yearly_urlbases)combined = daily+weekly+monthly+yearlyparser = get_parser()for x in combined:#queue.put(x)print x[2]workers=[]# for i in range(5):#worker = Thread(target=parser)#worker.start()#works.append(worker)#for worker in workers:#worker.join()"""Creates a list of URLs to stdout based on repeating patterns found in the site, suitable for use with WGET or CURL."""import datetimescopes=["aries","taurus","gemini","cancer","leo","virgo","libra","scorpio","sagittarius","capricorn","aquarius","pisces"]daily_urlbases=[("overview","http://shine.yahoo.com/astrology/%s/daily-overview/%i/"),("extended","http://shine.yahoo.com/astrology/%s/daily-extended/%i/"),("daily-teen","http://shine.yahoo.com/astrology/%s/daily-teen/%i/"),("daily-love","http://shine.yahoo.com/astrology/%s/daily-love/%i/"),("daily-career","http://shine.yahoo.com/astrology/%s/daily-career/%i/"),]yearly_urlbases=[("yearly","http://shine.yahoo.com/astrology/%s/yearly-overview/"),("yearly-love","http://shine.yahoo.com/astrology/%s/yearly-love/"),("yearly-career","http://shine.yahoo.com/astrology/%s/yearly-career/")]weekly_urlbases=[("weekly","http://shine.yahoo.com/astrology/%s/weekly-overview/%i/"),("weekly-love","http://shine.yahoo.com/astrology/%s/weekly-love/%i/"),("weekly-career","http://shine.yahoo.com/astrology/%s/weekly-career/%i/")]monthly_urlbases=[("monthly","http://shine.yahoo.com/astrology/%s/monthly-overview/%i/"),("monthly-love","http://shine.yahoo.com/astrology/%s/monthly-love/%i/"),("monthly-career","http://shine.yahoo.com/astrology/%s/monthly-career/%i/")]class dateincrement:def __init__(self,initialdate=datetime.datetime(2009,12,31),scale=datetime.timedelta(days=1)):self.thisdate=initialdateself.scale=scaledef next(self):if self.thisdate + self.scale < datetime.datetime(2011,1,1):self.thisdate += self.scalereturn self.thisdateelse:return Falseclass monthincrement:def __init__(self,initialmonth=0):self.month=initialmonthdef next(self):if self.month < 12:self.month += 1return datetime.datetime(2010,self.month,1)else:return Falsedayobj=Noneweekobj=NonemonthObj=Nonedef makeObjects():dayobj = dateincrement()weekobj = dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7))monthobj = monthincrement()def testobj(obj):while True:result=obj.next()if result:print resultelse:breakdef testallobjs():testobj(dayobj)testobj(weekobj)testobj(monthobj)pad = lambda x: str(x).rjust(2,"0")def generate_urls(thelist,theobj):results=[]while True:d=theobj.next()print dif d:for url in thelist:for month in scopes:datestr=int(str(d.year)+pad(d.month)+pad(d.day))print repr(month)print repr(datestr)aresult=url[1] % (month,datestr)print aresultresults.append((url[0]+"_"+month,datestr,aresult))else:breakreturn resultsdef generate_yearly_urls(thelist):results=[]for url in thelist:for month in scopes:print repr(month)aresult=url[1] % monthprint aresultresults.append((url[0]+"_"+month,"2010",aresult))return resultsimport sysfrom urllib import urlopen#from BeautifulSoup import BeautifulSoupfrom Queue import Queue, Emptyfrom threading import Threadvisited = set()queue = Queue()def get_parser():def parse():try:while True:url = queue.get_nowait()print "GRABBING: " + repr(url)try:content = urlopen(url[2]).read()f=open("output/"+url[0]+"_"+str(url[1])+".html",'w')f.write(content)f.close()if len(content) > 10000:url.task_done()except:print "PASS: " + repr(url)passexcept Empty:passreturn parseif __name__ == "__main__":daily=generate_urls(daily_urlbases,dateincrement())weekly=generate_urls(weekly_urlbases,dateincrement(datetime.datetime(2009,12,28),datetime.timedelta(days=7)))monthly=generate_urls(monthly_urlbases,monthincrement())yearly=generate_yearly_urls(yearly_urlbases)combined = daily+weekly+monthly+yearlyparser = get_parser()for x in combined:#queue.put(x)print x[2]workers=[]# for i in range(5):#worker = Thread(target=parser)#worker.start()#works.append(worker)#for worker in workers:#worker.join()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 """Example of using the old BeautifulSoup API to extract content from downloaded html files into CSV... if you're doing this sort of thing today, I recommend using the newer lxml interface directly, but lxml also has a BeautifulSoup compatibility layer."""import osfrom BeautifulSoup import BeautifulSoup as bsdef get_horo(source):g=bs(source)mainblock=g.find('div',{'class':'content'})return mainblockimport osfilelist=[]for dirname, dirnames, filenames in os.walk('./cleaned/'):for filename in filenames:filelist.append(os.path.join(dirname, filename))bses=[]for f in filelist:source=open(f).read()bses.append((f,source,bs(source)))noneones=[]for f in bses:if f[1]=='None':noneones.append(f)sourcecomp=[]for f in noneones:sourcecomp.append((f[0],open(f[0].replace('./cleaned/','./output/')).read()))urls=[x[0].replace('./cleaned/','').replace('-','/') for x in sourcecomp]open('html_urls.txt','w').writelines([u + '\n' for u in urls])newurls=[]for u in urls:a=u.split('/')newurls.append(a[0] +'//' + a[2] + '/' + a[3] + '/' + a[4] + '/' + a[5] + '-' + a[6] + '/' + a[7] + '/')open('html_urls.txt','w').writelines([u + '\n' for u in newurls])newbses=[]noneurls=[x[0] for x in noneones]for b in range(len(bses)):if not bses[b][0] in noneurls:newbses.append(bses[b])b=newbsesparsed=[]for u in b:record=[]h3s=u[2].findAll('h3')section_count=0if len(h3s) == 2:record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next]record=[x.contents[0].strip() for x in record]section_count=2if len(h3s) == 0:record=[u[2].contents[0].next.next]record=[x.contents[0].strip() for x in record]section_count=0if len(h3s) == 1:record=[h3s[0],h3s[0].next.next]record=[x.contents[0].strip() for x in record]section_count=1if len(h3s) == 3:record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next,h3s[2],h3s[2].next.next.next]record=[x.contents[0].strip() for x in record]section_count=3record.reverse()record.append(section_count)record.reverse()parsed.append((u[0].split('-')[3:-1],u[1],u[2],record))noneones=[x for x in parsed if len(x[3])==0]import pickleparsed=pickle.load(open('parsed','b'))newparsed=[x[0]+x[3] for x in parsed]import csvw=csv.writer(open('horoscopes.csv','wb'),delimiter=',',quoting=csv.QUOTE_NONNUMERIC)for x in newparsed:w.writerow(x[1:])"""Example of using the old BeautifulSoup API to extract content from downloaded html files into CSV... if you're doing this sort of thing today, I recommend using the newer lxml interface directly, but lxml also has a BeautifulSoup compatibility layer."""import osfrom BeautifulSoup import BeautifulSoup as bsdef get_horo(source):g=bs(source)mainblock=g.find('div',{'class':'content'})return mainblockimport osfilelist=[]for dirname, dirnames, filenames in os.walk('./cleaned/'):for filename in filenames:filelist.append(os.path.join(dirname, filename))bses=[]for f in filelist:source=open(f).read()bses.append((f,source,bs(source)))noneones=[]for f in bses:if f[1]=='None':noneones.append(f)sourcecomp=[]for f in noneones:sourcecomp.append((f[0],open(f[0].replace('./cleaned/','./output/')).read()))urls=[x[0].replace('./cleaned/','').replace('-','/') for x in sourcecomp]open('html_urls.txt','w').writelines([u + '\n' for u in urls])newurls=[]for u in urls:a=u.split('/')newurls.append(a[0] +'//' + a[2] + '/' + a[3] + '/' + a[4] + '/' + a[5] + '-' + a[6] + '/' + a[7] + '/')open('html_urls.txt','w').writelines([u + '\n' for u in newurls])newbses=[]noneurls=[x[0] for x in noneones]for b in range(len(bses)):if not bses[b][0] in noneurls:newbses.append(bses[b])b=newbsesparsed=[]for u in b:record=[]h3s=u[2].findAll('h3')section_count=0if len(h3s) == 2:record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next]record=[x.contents[0].strip() for x in record]section_count=2if len(h3s) == 0:record=[u[2].contents[0].next.next]record=[x.contents[0].strip() for x in record]section_count=0if len(h3s) == 1:record=[h3s[0],h3s[0].next.next]record=[x.contents[0].strip() for x in record]section_count=1if len(h3s) == 3:record=[h3s[0],h3s[0].next.next.next,h3s[1],h3s[1].next.next.next,h3s[2],h3s[2].next.next.next]record=[x.contents[0].strip() for x in record]section_count=3record.reverse()record.append(section_count)record.reverse()parsed.append((u[0].split('-')[3:-1],u[1],u[2],record))noneones=[x for x in parsed if len(x[3])==0]import pickleparsed=pickle.load(open('parsed','b'))newparsed=[x[0]+x[3] for x in parsed]import csvw=csv.writer(open('horoscopes.csv','wb'),delimiter=',',quoting=csv.QUOTE_NONNUMERIC)for x in newparsed:w.writerow(x[1:])Filtering it down
So every different type of horoscope got sucked up – career, teen, love, daily overview. Who knew there were so many? It was felt, though, that career & love predictions would have their internal biases i.e. lots of mentions of work, career, love, marriage etc. So we opted to just analyse the generic daily horoscopes for each sign. A total of 4,380 (365 per star sign).Word Analysis Version 1
We used an online tool called TagCrowd to find the most common words. I prefer it to Wordle. You’ve got better control over any ‘noise’ in the signal, because you can not only filter common words (“and”, “for”, “is” etc) but also a special ‘stoplist’ of words you’ve chosen.
So we broke down the most common 50 words to see if there are any patterns of unique words. This is what was revealed:
You can see the full data in a Google spreadsheet here.
Word Analysis 2
It struck me that several words in the top 50 – like “someone”, “really”, “quite” – were just qualifiers and not really that revealing. You’d find them in any English word analysis.
So we stripped those kinds of words out (see our stoplist). And lo! A fresh set of unique, revealing and more accurate words appeared in the top words per sign.
Can I just say that I have no personal interest in horoscopes. I don’t know what the various characteristics of each star sign are meant to be. So you’ll have to tell me if any of this corresponds to folklore.
This was the data we used to create our meta-chart. Check out the final image. Or see all the data in this Google spreadsheet.
Meta-Prediction
One more thing though. This analysis appears to reveal something. The bulk of the words in horoscopes (at least 90%) are the same. That’s not a full, proper statistical analysis. (If you are a statistician and you want to do a proper analysis, please get in touch)
The cool thing is, once you’ve isolated the most common words, you can actually write a generic, meta prediction that would apply to all star signs, every day of the year. Here it is.
The Future
As ever, I’ve laid out my whole process and all the data here: http://bit.ly/horoscoped.
That way it’s all balanced and you can make up your own mind. Typical Libran!Concept & research & design: David McCandless
Additional design: Matt Hancock
Additional research: Miriam Quick
Hacking: Thomas Winningham
Source: Yahoo Shine Horoscopes
Code & Scripts: Here and here
Data & workings: bit.ly/horoscoped
- Share this:
- Share
With me, talking about horoscopes usually induces scoffing, but this was a genuinely-interesting, fun read. :)