Precise Scraping with Google Chrome

Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control Chrome: the Debugger protocol. Unfortunately, Debugger protocol is pretty slow.

The Google Chrome extension API is an excellent choice for writing an up to date scraper which uses a full browser with the latest HTML5 features and performance improvements. In a previous article, we described how to scrape Microsoft TechNet App-V forum. Now, we will focus on VMWare’s ThinApp. In this case, we develop a Google extension instead of a Python script.

Procedure

You will need Google Chrome, Python 2.7, and lxml.html
Download the code from github
Install the Google Chrome extension
Enter the VMware ThinApp: Discussion Forum
The scraper starts automatically
Once it stops, go to the Google Chrome console and copy&paste the results in JSON format to the thinapp.json file
Run the thinapp_parser.py to generate the thinapp.csv file with the results
Open the thinapp.csv file with a spreadsheet
To rank the results, add a column which divides the number of views by the number of days.

Our Results: Top Twenty Threads

Registry Isolation…
Thinapp Internet Explorer 10
Process (ifrun60.exe) remains active (Taskmanager) after closing thinapp under windows7 (xp works)
Google Chrome browser
File association not passing file to thinapp package
Adobe CS3 Design Premium and FlexNET woes…
How to thinapp Office 2010?
Size limit of .dat file?
ThinApp Citrix Receiver 3.2
Visio 2010 Thinapp – Licensing issue
Thinapp Google Chrome
Thinapp IE7 running on Windows 7
Adobe CS 6
Failed to open, find, or create Sandbox directory
Microsoft Project and Office issues
No thinapp in thinapp factory + unable to create workpool
IE8 Thinapp crashing with IE 10 installed natively
ThinApp MS project and MS Visio 2010
Difference between ESXi and vSphere and VMware view ??
ThinAPP with AppSense

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

This approach can be successfully used to scrape heavy Javascript and AJAX sites
Instead of copying the JSON data from the Chrome console, you can use the FileSystem API to write the results to a file
You can also write the CSV directly from Chrome instead of using an extra script

If you like this article, you might also be interested in

Resources

Web Scraping for Semi-automatic Market Research

It is easy to web scrape Microsoft TechNet Forums (look at the xml output here: http://social.technet.microsoft.com/Forums/en-US/mdopappv/threads?outputAs=xml)and normalize the resulting information to have a better idea of each thread’s rank based on views and initial publication date. Knowing how issues are ranked can help a company choose what to focus on.

This code was used to scrape Microsoft TechNet’s forums. In the example below we scraped the App-V forum, since it is one of the application virtualization market’s leaders along with VMware ThinApp, and Symantec Workspace Virtualization.

These are the top ten threads for the App-V forum:

“Exception has been thrown by the target of an invocation”
Office 2010 KMS activation Error: 0xC004F074
App-V 5 Hotfix 1
Outlook 2010 Search Not Working
Java 1.6 update 17 for Kronos (webapp)
Word 2010 There was a problem sending the command to the program
Utility to quickly install/remove App-V packages
SAP GUI 7.1
The dreaded: “The Application Virtualization Client could not launch the application”
Sequencing Chrome with 4.6 SP1 on Windows 7 x64

The results show how frequently customers have issues with virtualizing Microsoft Office, Key Management Services (KMS), SAP, and Java. App-V competitors like Symantec Workspace Virtualization and VMWare ThinApp have similar problems. Researching markets this way gives you a good idea of areas where you can contribute solutions.

The scraper stores all the information in a SQLite database. The database can be exported using the csv_App-V.py script to an UTF-8 CSV file. We imported the file with Microsoft Excel and then normalized the ranking of the threads. To normalize it we divided the number of views by the age of the thread so threads with more views per day rank higher. Again, the scraper can be used on any Microsoft forum on Social TechNet. Try it out on your favorite forum.

Code

Prerequisites: lxml.html

The code is available at microsoft-technet-forums-scraping [github] . It was written by Matias Palomera from Nektra Advanced Computing, who received valuable support from Victor Gonzalez.

Usage

Run scrapper-App-V.py
Then run csv_App-V.py
The results are available in the App-V.csv file

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

This is a single thread code. You can take a look at our discovering web resources code to optimize it with multithreading.
Microsoft has given scrapers a special gift: it is possible to use the outputAs variable in the URL to get the structured information as XML instead of parsing HTML web pages.
Our articles Distributed Scraping With Multiple Tor Circuits and Running Your Own Anonymous Rotating Proxies show how to Implement your own rotating proxies infrastructure with Tor.

If you liked this article, you might also like:

The Python POPO’s Way to Integrate PayPal Instant Payment Notification

Pompeo Massani: The Money Counter

Python PayPal IPN

PayPal is the fastest, but not the best, way to incorporate payments on your web site and reach a worldwide audience. If you are searching for a Plain Old Python Object (POPO) way to integrate with the Python programming language, you are on your own. The Instant Payment Notification (IPN) page only incorporates ASP, .NET, ColdFusion, Java, Perl and PHP samples. A web search will bring up a ton of Python code. Most of this code will be for frameworks such as Django. The rest will not be specifically for connecting Python with IPN: there will be a lot of extra code you do not need. Here is a translation of the PHP sample code into Python.

Code

also available on GitHub.

#!/usr/bin/python

# PHP to Python translation from: https://cms.paypal.com/cms_content/US/en_US/files/developer/IPN_PHP_41.txt

import urllib
import cgi
import cgitb
import socket, ssl, pprint
import pickle
import sys
import json

cgitb.enable(logdir='../logs/')

form = cgi.FieldStorage()

req = 'cmd=_notify-validate'
for k in form.keys():
	v = form[k]
	value = urllib.quote(v.value.decode('string_escape')) # http://stackoverflow.com/questions/13454/python-version-of-phps-stripslashes
	req = req + '&{0}={1}'.format(k, value)

header = 'POST /cgi-bin/webscr HTTP/1.0\r\n'
header += 'Content-Type: application/x-www-form-urlencoded\r\n'
header += 'Content-Length: ' + str(len(req)) + '\r\n\r\n'

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = ssl.wrap_socket(s)
ssl_sock.connect(('www.sandbox.paypal.com', 443)) # Use this for sandbox testing
# ssl_sock.connect(('www.paypal.com', 443)) # Use this for production

ssl_sock.write(header + req)

data = ssl_sock.read()
VERIFIED = False
while len(data) > 0:
	if 'VERIFIED' in data:
		VERIFIED = True
		break
	elif 'INVALID' in data:
		VERIFIED = False
		break

	data = ssl_sock.read()

ssl_sock.close()

if not VERIFIED:
	print "Content-type: text/plain"
	print
	print "Not Verified"
	sys.exit(1)

fields = {	'item_name': None,
		'item_number': None,
		'payment_status': None,
		'mc_gross': None,
		'mc_currency': None,
		'txn_id': None,
		'receiver_email': None,
		'payer_email': None,
		'custom': None,
	}

for k in fields.keys():
	if k in form:
		fields[k] = form[k].value

item_name = fields['item_name']
item_number = fields['item_number']
payment_status = fields['payment_status']
payment_amount = fields['mc_gross']
payment_currency = fields['mc_currency']
txn_id = fields['txn_id']
receiver_email = fields['receiver_email']
payer_email = fields['payer_email']

# check the payment_status is Completed
# check that txn_id has not been previously processed
#  check that receiver_email is your Primary PayPal email
# check that payment_amount/payment_currency are correct
# process payment

print "Content-type: text/plain"
print
print "Verified"

Resources

Ideas: Egont, A Web Orchestration Language

Inspiration

Human curiosity goes beyond limited web applications, recommendation systems and search engines. People collect lists of things on the web. Things like music playlists, movie rankings or visited places are populating our web culture, but this information is spread out in different places and we need search engines, social networks, and recommendation systems to leverage it. The real-time web also offers transformation opportunities which are only limited by the imagination.

How can we adjust all this information to our personal or organizational needs? The semantic web could play an important role here, but the web is not organized semantically yet. However, it is possible today to give people tools to manipulate information at a personal and social level. Spreadsheets have hundreds of functions which are used by people with limited computer and mathematical skills. What if we could transform information in a similar way? What if a new stimuli, like a new tweet or a new ranked movie could trigger a cascade of processes?

People and organizations are sharing a record amount of data, but current web platforms tightly dictate the limits to its use. For example Twitter’s API has very small call rates for the general public. Most Twitter applications cannot retrieve more than one or two degrees of a user’s social network without working around these API limitations. Examples of API limitations abound, undermining the opportunities to leverage data potentials.

The inspiration for Egont was come from the idea of a social operating system. People do not only share data, they also share data transformations. Egont is a platform for writing simple code snippets, while allowing others to reuse them to extract new information. It is a shared pipeline which is focused on connecting people’s data and processes. It can be thought of as a living operating system: when a state changes, the dependant processes are recalculated. Although Egont has clear security controls it’s primarily oriented to data that can be shared, even providing tools for exporting information to be analyzed offline. The shift is from a perspective where users accept platforms applications, to a perspective where users do not only generate data but also processes. Users and third parties will be free to write new functions to extend Egont’s capabilities.

(continue to part ii)

Resources

Ideas and Execution Magic Chart

Ideas vs Execution

There is an endless discussion in the startup community about the value of ideas versus the importance of execution. Here is a timeline showing Hacker News community submissions with the idea(s) keyword in the title:

I am no prophet, but I believe the future will most likely lean towards ideas because the cost of creating and operating a web company has been dramatically reduced. Soon marketing and sales services will be more affordable, making it easier to resolve the business puzzle. On the other hand, although following Joseph Schumpeter’s thinking, big companies have an advantage because they have more resources, they often prefer to follow the acquisition route after market natural selection instead of building risky projects from scratch. Entrepreneurs benefit from reduced competition in the initial phase of product development.

Magic Chart

This is an exercise, you must be objective to fill in your chart, and dabble in the black art of time estimation. The idea of the magic chart is to fill in a scatter plot chart. The x axis shows the time you expect it to take to execute the idea (you can limit it to development time first), and the y axis the potential of the idea. You can easily add other dimensions like cost, to the graph by using the size of the point plotted or colors. Add a vertical asymptote to the chart at the outside time limit which is feasible for you.

Here is my magic chart:

As you see it’s difficult to came up with ideas which can be executed in a short time and many of the ideas fall on an uncertainty beyond some time point. If you think that having a minimum viable product is key, then you must think very hard about how to reduce your product execution time, and this is an art more than a science. The need to generate profit is a serious constraint. Your idea may be excellent and your software may be used by millions of people, but you may lack a business model.

How is your ideas execution magic chart landscape?

HNSearch Script

Here is the Python script for retrieving Hacker News posts with the words idea and ideas in the title. It includes a legal hack (what else?) to bypass the ThriftDB’s HNSearch API imposed limit of 1000 items.

#!/usr/bin/python
# -*- coding: utf-8 -*-

# Done under Visual Studio 2010 using the excelent Python Tools for Visual Studio http://pytools.codeplex.com/

import urllib2
import json
from datetime import datetime
from time import mktime
import csv
import codecs
import cStringIO

class CSVUnicodeWriter: # http://docs.python.org/library/csv.html
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

def get_hackernews_articles_with_idea_in_the_title():
    endpoint = 'http://api.thriftdb.com/api.hnsearch.com/items/_search?filter[fields][title]=idea&start={0}&limit={1}&sortby=map(ms(create_ts),{2},{3},4294967295000)%20asc'

    incomplete_iso_8601_format = '%Y-%m-%dT%H:%M:%SZ'

    items = {}
    start = 0
    limit = 100
    begin_range = 0
    end_range = 0

    url = endpoint.format(start, limit, begin_range, str(int(end_range)))
    response = urllib2.urlopen(url).read()
    data = json.loads(response)

    prev_timestamp = datetime.fromtimestamp(0)

    results = data['results']

    while results:
        for e in data['results']:
            _id = e['item']['id']
            title = e['item']['title']
            points = e['item']['points']
            num_comments = e['item']['num_comments']
            timestamp = datetime.strptime(e['item']['create_ts'], incomplete_iso_8601_format)

            #if timestamp < prev_timestamp: # The results are not correctly sorted. We can't rely on this one.             if _id in items: # If the circle is complete.                 return items             prev_timestamp = timestamp                      items[_id] = {'id':_id, 'title':title, 'points':points, 'num_comments':num_comments, 'timestamp':timestamp}             title_utf8 = title.encode('utf-8')             print title_utf8, timestamp, _id, points, num_comments         start += len(results)         if start + limit > 1000:
            start = 0
            end_range = mktime(timestamp.timetuple())*1000

        url = endpoint.format(start, limit, begin_range, str(int(end_range))) # if not str(int(x)) then a float gives in the sci math form: '1.24267528e+12'
        response = urllib2.urlopen(url).read()
        data = json.loads(response)
        results = data['results']

    return items

if __name__ == '__main__':
    items = get_hackernews_articles_with_idea_in_the_title()

    with open('hn-articles.csv', 'wb') as f:
        hn_articles = CSVUnicodeWriter(f)

        hn_articles.writerow(['ID', 'Timestamp', 'Title', 'Points', '# Comments'])

        for k,e in items.items():
            hn_articles.writerow([str(e['id']), str(e['timestamp']), e['title'], str(e['points']), str(e['num_comments'])])

# It returns 3706 articles where the query says that they are 3711... find the bug...

Resources

Are Ideas Getting Harder to Find? (2016)
Science as Art
Thinking Skills Instruction: Concepts and Techniques (Anthology)
De Bono’s Lateral Thinking
TRIZ
Schumpeter’s Creative Destruction: A Review of the Evidence
Google Query: “ideas vs execution” OR “execution vs ideas”
Google Query: site:news.ycombinator.com AND (intitle:idea OR intitle:ideas)
Startup Ideas We’d Like to Fund
My list of ideas, if you’re looking for inspiration by Jacques Mattheij
Startup Ideas We’d Like to Fund by Paul Graham.
Ideas don’t make you rich. The correct execution of ideas does excerpt from Felix Dennis book.
Ideas suck by Chris Prescott.
Execution Matters, Ideas Don’t by Fred Wilson.
What Is Twitter’s Problem? No, It’s Not the Product
1000 results limit? (HNSearch NoAPI limits, bonus hack included in this article).
Year 2038 problem
How to use time > year 2038 on official Windows Python 2.5
Solr FunctionQuery
HackerNews Ideas Articles
Execution Is An Order Of Magnitude Easier Than Opportunity

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Category Archives: Entrepreneurship

Precise Scraping with Google Chrome

Procedure

Our Results: Top Twenty Threads

Acknowledgments

Notes

If you like this article, you might also be interested in

Resources

Web Scraping for Semi-automatic Market Research

Code

Usage

Acknowledgments

Notes

If you liked this article, you might also like:

The Python POPO’s Way to Integrate PayPal Instant Payment Notification

Python PayPal IPN

Code

Resources

Ideas and Execution Magic Chart

Ideas vs Execution

Magic Chart

HNSearch Script

Resources

Menu

Procedure

Our Results: Top Twenty Threads

Acknowledgments

Notes

If you like this article, you might also be interested in

Resources

Code

Usage

Acknowledgments

Notes

If you liked this article, you might also like:

Python PayPal IPN

Code

Resources

Inspiration

See Also

Resources

Ideas vs Execution

Magic Chart

HNSearch Script

Resources