Ideas and Execution Magic Chart

Ideas vs Execution

There is an endless discussion in the startup community about the value of ideas versus the importance of execution. Here is a timeline showing Hacker News community submissions with the idea(s) keyword in the title:

I am no prophet, but I believe the future will most likely lean towards ideas because the cost of creating and operating a web company has been dramatically reduced. Soon marketing and sales services will be more affordable, making it easier to resolve the business puzzle. On the other hand, although following Joseph Schumpeter’s thinking, big companies have an advantage because they have more resources, they often prefer to follow the acquisition route after market natural selection instead of building risky projects from scratch. Entrepreneurs benefit from reduced competition in the initial phase of product development.

Magic Chart

This is an exercise, you must be objective to fill in your chart, and dabble in the black art of time estimation. The idea of the magic chart is to fill in a scatter plot chart. The x axis shows the time you expect it to take to execute the idea (you can limit it to development time first), and the y axis the potential of the idea. You can easily add other dimensions like cost, to the graph by using the size of the point plotted or colors. Add a vertical asymptote to the chart at the outside time limit which is feasible for you.

Here is my magic chart:

 

As you see it’s difficult to came up with ideas which can be executed in a short time and many of the ideas fall on an uncertainty beyond some time point. If you think that having a minimum viable product is key, then you must think very hard about how to reduce your product execution time, and this is an art more than a science. The need to generate profit is a serious constraint. Your idea may be excellent and your software may be used by millions of people, but you may lack a business model.

How is your ideas execution magic chart landscape?

HNSearch Script

Here is the Python script for retrieving Hacker News posts with the words idea and ideas in the title. It includes a legal hack (what else?) to bypass the ThriftDB’s HNSearch API imposed limit of 1000 items.

#!/usr/bin/python
# -*- coding: utf-8 -*-

# Done under Visual Studio 2010 using the excelent Python Tools for Visual Studio http://pytools.codeplex.com/

import urllib2
import json
from datetime import datetime
from time import mktime
import csv
import codecs
import cStringIO

class CSVUnicodeWriter: # http://docs.python.org/library/csv.html
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

def get_hackernews_articles_with_idea_in_the_title():
    endpoint = 'http://api.thriftdb.com/api.hnsearch.com/items/_search?filter[fields][title]=idea&start={0}&limit={1}&sortby=map(ms(create_ts),{2},{3},4294967295000)%20asc'

    incomplete_iso_8601_format = '%Y-%m-%dT%H:%M:%SZ'

    items = {}
    start = 0
    limit = 100
    begin_range = 0
    end_range = 0

    url = endpoint.format(start, limit, begin_range, str(int(end_range)))
    response = urllib2.urlopen(url).read()
    data = json.loads(response)

    prev_timestamp = datetime.fromtimestamp(0)

    results = data['results']

    while results:
        for e in data['results']:
            _id = e['item']['id']
            title = e['item']['title']
            points = e['item']['points']
            num_comments = e['item']['num_comments']
            timestamp = datetime.strptime(e['item']['create_ts'], incomplete_iso_8601_format)

            #if timestamp < prev_timestamp: # The results are not correctly sorted. We can't rely on this one.             if _id in items: # If the circle is complete.                 return items             prev_timestamp = timestamp                      items[_id] = {'id':_id, 'title':title, 'points':points, 'num_comments':num_comments, 'timestamp':timestamp}             title_utf8 = title.encode('utf-8')             print title_utf8, timestamp, _id, points, num_comments         start += len(results)         if start + limit > 1000:
            start = 0
            end_range = mktime(timestamp.timetuple())*1000

        url = endpoint.format(start, limit, begin_range, str(int(end_range))) # if not str(int(x)) then a float gives in the sci math form: '1.24267528e+12'
        response = urllib2.urlopen(url).read()
        data = json.loads(response)
        results = data['results']

    return items

if __name__ == '__main__':
    items = get_hackernews_articles_with_idea_in_the_title()

    with open('hn-articles.csv', 'wb') as f:
        hn_articles = CSVUnicodeWriter(f)

        hn_articles.writerow(['ID', 'Timestamp', 'Title', 'Points', '# Comments'])

        for k,e in items.items():
            hn_articles.writerow([str(e['id']), str(e['timestamp']), e['title'], str(e['points']), str(e['num_comments'])])

# It returns 3706 articles where the query says that they are 3711... find the bug...

 

Resources

  1. Are Ideas Getting Harder to Find? (2016)
  2. Science as Art
  3. Thinking Skills Instruction: Concepts and Techniques (Anthology)
  4. De Bono’s Lateral Thinking
  5. TRIZ
  6. Schumpeter’s Creative Destruction: A Review of the Evidence
  7. Google Query: “ideas vs execution” OR “execution vs ideas”
  8. Google Query: site:news.ycombinator.com AND (intitle:idea OR intitle:ideas)
  9. Startup Ideas We’d Like to Fund
  10. My list of ideas, if you’re looking for inspiration by Jacques Mattheij
  11. Startup Ideas We’d Like to Fund by Paul Graham.
  12. Ideas don’t make you rich. The correct execution of ideas does excerpt from Felix Dennis book.
  13. Ideas suck by Chris Prescott.
  14. Execution Matters, Ideas Don’t by Fred Wilson.
  15. What Is Twitter’s Problem? No, It’s Not the Product
  16. 1000 results limit? (HNSearch NoAPI limits, bonus hack included in this article).
  17. Year 2038 problem
  18. How to use time > year 2038 on official Windows Python 2.5
  19. Solr FunctionQuery
  20. HackerNews Ideas Articles
  21. Execution Is An Order Of Magnitude Easier Than Opportunity

Articles Summary

This is a summary of all the Data Big Bang blog articles by subject.

IR

A summary of information retrieval stages and current data science articles.

Fetching

  1. Distributed Scraping With Multiple Tor Circuits
  2. Running Your Own Anonymous Rotating Proxies

Cleaning/Tidying

  1. HTML Cleaners and Tidiers

Parsing

Handling of Active Content

  1. Web Scraping Ajax and Javascript Sites

Main Content Extraction

  1. Extraction of Main Text Content Using the Google Reader NoAPI
  2. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing

Language Identification

  1. Language Identification for Text Mining and NLP

Security

  1. Automated Browserless OAuth Authentication for Twitter
  2. The Python POPO’s Way to Integrate PayPal Instant Payment Notification

APIs and NoAPIs

  1. Google Search NoAPI
  2. Exporting StackOverflow users blogs to Excel Hyperlinks
  3. Extraction of Main Text Content Using the Google Reader NoAPI
  4. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Policies and Data Issues

  1. Scraping vs Antiscraping
  2. The Data Portability Fact Sheet

Entrepreneurship

  1. Ideas and Execution Magic Chart
  2. Ideas: Egont, A Web Orchestration Language
  3. Egont Part II

Marketing and Sales

  1. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website

Plugins

  1. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Big Data Stack

  1. Using Queues in Web Crawling and Analysis Infrastructure
  2. Persisting Native Python Queues
  3. Adding Acknowledgement Semantics to a Persistent Queue
  4. Esoteric Queue Scheduling Disciplines

Tools

  1. Running Microsoft Windows Console Applications with Invisible Windows

Announcements

Resources

  1. Data Science Resources
Digital Art by Don Relyea

HTML Cleaners and Tidiers

Tag Soup

When you are crawling a website you will come across a lot of malformed web pages. Some typical problems are unclosed tags, mishandling of comments or of css styles. Modern browsers have to do a good job of cleaning HTML to build the correct DOM without ambiguities. Due to performance and scalability limitations, it is more efficient to process HTML with a parser instead of using a browser or headless browsers such as HTMLUnit or PhantomJS. If your HTML parser does not incorporate the cleaning or fixing process, you will have to use an HTML cleaner or tidier.

As in other processing pipelines if you fail to clean up malformed HTML, all subsequent processes will be stalled. It is important to choose a good HTML cleaner. Many cleaners fail to do their jobs.

HTML Cleaner List

The list of HTML cleaners is long, but the list of good ones is pretty short. In our experience the best choice is lxml.html. Other cleaners often have trouble.

Comprehensive Resources

  1. lxml.html
  2. Beautiful Soup
  3. lxml.html vs Beautiful Soup
  4. Cleaning Word’s Nasty HTML
  5. HTML Cleaners query
  6. Tag soup