Sunday, 30 November 2014

Why scraping and why TheWebMiner?

If you read this blog you are one of two things: you are either interested in web scraping and you have studied this domain for quite a while, or you are just curious about this relatively new field of interest and want to know what it is, how it’s done and especially why. Either way it’s fine!

In case you haven’t googled already this I can tell you that data extraction (or scraping) is a technique in which a computer program extracts data from human-readable output coming from another program (wikipedia). Basically it can collect all the information on a certain subject from certain places. It’s sort of the equivalent of ctrl+f, at the scale of the whole internet. It’s nothing like the search engines that we currently use because it can extract the data in a certain file, as excel, csv (coma separated values) or any other that the buyer wants, and also extracts only the relevant data, only the values that you are interested in.

I hope now that you understand the concept and you are wondering just why would you need such data. Well let’s take the example of an online store, pretty common nowadays, and of course the manager just like any manager wants his business to thrive, so, for that he has to keep up with the other online stores. Now the web scraping takes place: it is very useful for him to have, saved as excels all the competitor’s prices of certain products if not all of them. By this he can maintain a fair pricing policy and always be ahead of his competitors by knowing all of their prices and fluctuations.  Of course the data collecting can also be done manually but this is not effective because we are talking of thousand of products each one having its own page and so on. This is only one example of situation in which scrapping is useful but there are hundreds and each one of them it’s profitable for the company.

By now I’ve talked about what it is and why you should be interested in it, from now on I’m going to explain why you should use thewebminer.com. First of all, it’s easy: you only have to specify what type of data you want and from where and we’ll manage the rest. Throughout the project you will receive first of all an approximation of price, followed by a time approximation. All the time you will be in contact with us so you can find out at any point what is the state of your project. The pricing policy is reasonable and depends on factors like the project size or complexity. For very big projects a discount may be applicable so the total cost be within reason.

Now I believe that thewebminer.com is able to manage with any kind of situation or requirement from users all over the world and to convince you, free samples are available at any project you may have or any uncertainty or doubt.

Source:http://thewebminer.com/blog/2013/07/

Thursday, 27 November 2014

Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")
# Pull out the appropriate line
author_lines <- web_page[grep("<I>", web_page)]
# Delete unwanted characters in the lines we pulled out
authors <- gsub("<I>", "", author_lines, fixed = TRUE)
# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]
[webscrape results]


We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")
# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")
# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)
jan09_parsed <- htmlTreeParse(jan09)
# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Source: http://www.r-bloggers.com/webscraping-using-readlines-and-rcurl-2/

Wednesday, 26 November 2014

Screen scrapers: To program or to purchase?

Companies today use screen scraping tools for a variety of purposes, including collecting competitive information, capturing product specs, moving data between legacy and new systems, and keeping inventory or price lists accurate.

Because of their popularity and reputation as being extremely efficient tools for quickly gathering applicable display data, screen scraping tools or browser add-ons are a dime a dozen: some free, some low cost, and some part of a larger solution. Alternatively, you can build your own if you are (or know) a programming whiz. Each tool has its potential pros and cons, however, to keep in mind as you determine which type of tool would best fit your business need.

Program-your-own screen scraper

Pros:

    Using in-house resources doesn't require additional budget

Cons:

    Properly creating scripts to automate screen scraping can take a significant amount of time initially, and continues to take time in order to maintain the process. If, for instance, objects from which you're gathering data move on a web page, the entire process will either need to be re-automated, or someone with programming acumen will have to edit the script every time there is a change.

    It's questionable whether or not this method actually saves time and resources

Free or cheap scrapers

Pros:

    Here again, budget doesn't ever enter the picture, and you can drive the process yourself.

    Some tools take care of at least some of the programming heavy lifting required to screen scrape effectively

Cons:

    Many inexpensive screen scrapers require that you get up to speed on their programming language—a time-consuming process that negates the idea of efficiency that prompted the purchase.

Screen scraping as part of a full automation solution

Pros:

    In the amount of time it takes to perform one data extraction task, you have a completely composed script that the system writes for you

    It's the easiest to use out of all of the options

    Screen scraping is only part of the package; you can leverage automation software to automate nearly any task or process including tasks in Windows, Excel automation, IT processes like uploads, backups, and integrations, and business processes like invoice processing.

    You're likely to get buy-in for other automation projects (and visibility for the efficiency you're introducing to the organization) if you pick a solution with a clear and scalable business purpose, not simply a tool to accomplish a single task.

Cons:

    This option has the highest price tag because of its comprehensive capabilities.

Looking for more information?

Here are some options to dig deeper into screen scraping, and deciding on the right tool for you:

 Watch a couple demos of what screen scraping looks like with an automation solution driving the process.

 Read our web data extraction guide for a complete overview.

 Try screen scraping today by downloading a free trial.

Source: https://www.automationanywhere.com/screen-scrapers

Sunday, 23 November 2014

Data Mining Outsourcing in a Better and Unique Approach

Data mining outsourcing services are ideal for clarity in various decision making processes.  It is the ultimate goal of any organization and business to increase on its profits as well as strengthen the bond with its customers. Equipping the business in such a way that it’s very easy to detect frauds and manage risks in a convenient manner is equally important. Volumes of data that are irrelevant or cannot be used when raw needs to be converted to a more useful form.  The data mining outsourcing services can greatly help you to analyze and interpret data in a more diligent way.

This service to reliable, experienced and qualified hands is very important. Your research project or engineering project can be easily and conveniently handled by experienced staff who guarantees you an accuracy level of about 98% and a massive reduction in operating costs. The quality of work is unsurpassed and the presentation is done in a format that is easy and simple for you. The project is done in a very short time alleviating you delays as well as ensuring on-time completion of your projects. To enjoy a successful outsourcing experience, you need to bank on a famous and reliable expertise.

The only time to rely with data mining outsourcing services is when you do not have a reliable, experienced expertise in your business.  Statistics indicate that it’s very easy to lose business intelligence or expose the privacy of the customers through this process. However companies which offer secure outsourcing process are on the increase as a result of massive competition. It’s an opportunity to develop your potential of sourced data and improve your business in all fields. 

Data mining potential applications are infinite. However major applications are in the marketing research and scientific projects. It’s done both on large and small quantities of data by experienced staff well known for their best analytical procedures to guarantee you accurate and easy to use information. Data mining outsourcing services are the only perfect way to profitability.

Source:http://www.e-edge.biz/Data_Mining_Outsourcing_in_a_Better_and_Unique_Approach.html

Wednesday, 19 November 2014

Online Data Entry & Web Scraping Services

To operate any type of organization smoothly, it is essential to have precise data that is accurate and reliable. When your business expands, data entry on an ongoing basis is a tedious job. It’s a very time consuming task that can often distract employees focusing on core business areas.

Webpop offers all forms of online data entry services that are quick and accurate. We provide data entry services across all verticals that can be completely customized to your business requirements.

Database Population Services

Database population involves content collection from various database sources. This requires a lot of attention to detail, dedication and awareness and can prove a formidable task, especially for websites that largeley depend on it.

Webpop offer a quick and efficient database population service that helps relieve the stress from an extremely laborius task and leaves you more time to focus on more important aspects of your business. By investing just a fraction of the cost, you can outsource your database population tasks to us.

Web Scraping Services

Webpop have been assisting clients in searching, extracting and collecting data from the web for the past 5 years using the latest techniques in web scraping techology. We can scrape all types of information from a variety of sources such as websites, blogs, online directories, e-commerce websites and podcasts to name a few. We use a varied selection of automated and manual web scraping technologies to extract, gather and collect all of the required data you require from any chosen website(s) on the World Wide Web.

We can simplify the whole process from collection to population, converting your scraped data in to structured formats that are applicable to your website. This can be offered as a one time service or an ongoing basis that will assist you in constantly keeping your website’s content fresh and up to date. We can crawl competitors websites, gather sales leads, product details, pricing methodologies and also creat custom campaigns to suit your project’s requirements.

Over the years Webpop has grown from strength-to-strength by providing all types of data entry, database population and web scraping services. All of our data entry services are performed with care, due dilligence and attention to detail. We enjoy a challenge and pride ourselves on delivering results whilst working on precarious projects that require precision and total commitment.

Source:http://www.webpopdesign.com/services/data-entry/

Monday, 17 November 2014

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

Screen Shot 2014-02-18 at 4.29.05 PM

Screen Shot 2014-02-18 at 4.29.27 PM

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

Screen Shot 2014-02-18 at 4.29.50 PM

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source:http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Thursday, 13 November 2014

Future of Web Scraping

The Internet is large, complex and ever-evolving. Nearly 90% of all the data in the world has been generated over the last two years. In this vast ocean of data, how does one get to the relevant piece of information? This is where web scraping takes over.

Web scrapers attach themselves, like a leech, to this beast and ride the waves by extracting information form websites at will. Granted “scraping” doesn’t have a lot of positive connotations, yet it happens to be the only way to access data or content from a web site without RSS or an open API.

Future of Web Scraping

Web scraping faces testing times ahead. We outline why there may be some serious challenges to its future.

With rise in data, redundancies in web scraping are rising. No more is web scraping a domain of the coders; in fact, companies now offer customized scraping tools to clients which they can use to get the data they want. The outcome of everyone equipped to crawl, scrape, and extract, is unnecessary waste of precious man-power. Collaborative scraping could well heal this hurt. Here, where one web crawler does a broad scraping, the others scrape data off an API. An extension of the problem is that text retrieval attracts more attention than multimedia; and with websites becoming more complex, this enforces limited scraping capacity.

Easily, the biggest challenge to web scraping technology is Privacy concerns. With data freely available (most of it voluntary, much of it involuntary), the call for stricter legislation rings loudest. Unintended users can easily target a company and take advantage of the business using web scraping. The disdain with which “do not scrape” policies are treated and terms of usage violated, tells us that even legal restrictions are not enough. This begs to ask an age-old question: is scraping legal?

Is Crawling Legal? from PromptCloud

The flipside to this argument is that if technological barriers replace legal clauses, then web scraping will see a steady, and sure, decline. This is a distinct possibility since the only way scraping activity thrives is on the grid, and if the very means are taken away and programs no longer have access to website information, then web scraping by itself will be wiped out.

Building the Future

On the same thought is the growing trend of accepting “open data”. The open data policy, while long mused hasn’t been used at the scale it should be. The old way was to believe that closed data is the edge over competitors. But that mindset is changing. Increasingly, websites are beginning to offer APIs and embracing open data. But what’s the advantage of doing so?

Selling APIs not only brings in the money, but also is useful in driving back traffic to the sites! APIs are also a more controlled, cleaner way of turning sites into services. Steadily many successful sites like Twitter, LinkedIn etc. are offering access to their APIs with paid services and actively blocking scraper and bots.

Yet, beyond these obvious challenges, there’s a glimmer of hope for web scraping. And this is based on a singular factor: the growing need for data!

With Internet & web technology spreading, massive amounts of data will be accessible on the web. Particularly with increased adoption of mobile internet. According to one report, by 2020, the number of mobile internet users will hit 3.8 billion, or around half of the world’s population!

Since ‘big data’ can be both, structured & unstructured; web scraping tools will only get sharper and incisive. There is fierce competition between those who provide web scraping solutions. With the rise of open source languages like Python, R & Ruby, Customized scraping tools will only flourish bringing in a new wave of data collection and aggregation methods.

Source: https://www.promptcloud.com/blog/Future-of-Web-Scraping

Wednesday, 12 November 2014

3 Reasons to Up Your Web Scraping Game

If you aren’t using a machine-learning-driven intelligent Web scraping solution yet, here are three reasons why you might want to abandon that entry-level Web-scraping software or cut your high-cost script-writing approach.

    You need to keep an eye on a large number of web sources that get updated frequently.
    Understanding what’s changed is at least as critical as the data itself.
    You don’t want maintenance and scheduling to drag you down.

Here’s what an intelligent Web-scraping solution can deliver – and why:

1. Better data monitoring of an ever-shifting Web

If you need to keep a watch over hundreds, thousands or even tens of thousands of sites, an intelligent Web scraper is a must, because:

    It can scale – easily adding new websites, coordinating extraction routines, and automating the normalization of data across different websites.

    It can navigate and extract data from websites efficiently. Script-based approaches typically only can view a Web page in isolation, making it difficult to optimize navigation across unique pages of a targeted site. More intelligent approaches can be trained to bypass unnecessary links and leave a lighter footprint on the sites you need to access. And, they can monitor millions of precise Web data points quickly. This means you can monitor more pages on more sites with more frequent updates.

2. Critical alerts to Web data changes

A key sales executive suddenly drops off of the management page of your main competitor. That can mean big shakeup in the entire organization, which your sales team can jump on.

An intelligent Web scraper can alert you to this personnel shift because it can be set to monitor for just the changes; less powerful technologies or script-based approaches can’t. Whether you’re tracking price shifts, people moves, or product changes (or more) intelligent Web scraping delivers more profound insights.

3. Maintenance may become your biggest nightmare

You’ve purchased an entry-level tool and built out scrapers for a few hundred sites.  At first, everything seems fine. But, within weeks you begin to notice that your data is incomplete and not being updated as you’d expected. Why did your data deliveries disappear?

Reality is that these low-cost tools are simply not designed for mission-critical business applications – on the surface they look helpful and easy to use, but underneath the surface they are script-based and highly dependent upon the HTML of a website. But websites change, and entry-level web scraping tools are simply not engineered to adapt to those changes.

And, most of these tools are simply not designed for enterprise use. They have limited reporting, if any, so the only way to know whether they’re successfully completing their tasks is by finding gaps in the data – often when it’s too late.

An intelligent web scraping approach doesn’t rely upon the HTML of a web page. It uses machine learning algorithms which view the web the same way a user might. A typical reader doesn’t get confused when a font or color is changed on a website, and neither do these algorithms. But simple approaches to web scraping are highly dependent on the specific HTML to help it understand the content of a page. So, when websites have design changes (on average once every 18 months), the software fails.

While entry-level web scraping software can be an easy solution for simple, one-time web scraping projects, the scripts they generate are fragile and the resources required for tracking and maintenance can become overwhelming when you need to regularly extract data from multiple sites.

Case in point: Shopzilla assimilates data five times faster than outsourced Web scrapers

To demonstrate the power of intelligent Web scraping, here’s a real-life example from Shopzilla.  Shopzilla manages a premier portfolio of online shopping brands in the United States and Europe, connecting more than 40 million shoppers each month with millions of products from retailers worldwide. With the explosive growth of retail data on the Web, Shopzilla’s outsourced, custom-built approach, based on scripting, could not add the product lines of new retailers to its site in a timely fashion. It was taking up to two weeks to write the scripts needed to make a single site accessible.

By deploying Connotate’s intelligent web scraping platform on site, Shopzilla gained the ability to harness Web data’s rapid growth and keep up to date. Today, new sources are added in days, not weeks.  The platform continually monitors Web content from thousands of sites, delivering high volumes of data every day in a structured format. The result: 500 percent more data from new retailers. An added bonus: the company has reduced IT maintenance costs and its dependence on outsourced development timetables. Case in point: Deep competitor intelligence in two languages

A global manufacturer needed to monitor competitors’ technology improvements in a field where market leadership hinges on an ability to quickly leverage these advances. That meant accessing scholarly journals and niche sites in multiple languages. Using the Connotate solution, it was able to access highly-targeted, keyword-driven university and industry research journals and blogs in German and English that are hard to reach because they do not support RSS feeds. Our solution also incorporated semantic analysis to tag and categorize data and help identify new technologies and products not currently in the keyword list. The firm enhanced its competitive edge with the up-to-the-minute, precise data it needed.

Is your Web scraping intelligent enough?

See what intelligent agents through an automated Web data extraction and monitoring solution can bring to your business. Contact us and speak with one of experts.

Source:http://www.connotate.com/3-reasons-web-scraping-game-6579#.VGMjH2f4EuQ

Friday, 7 November 2014

Web Scraping the Solution to Data Harvesting

The internet is the number one information provider in the world and it is of course the largest in the same course. Web scraping is meant to extract and harvest useful information from the internet. It can be regarded as a multidisciplinary process that involves statistics, databases, data harvesting and data retrieval.

There has been noted a rapid expansion of the web and therefore causing an enormous growth of information. This has led to increased difficulty in the extraction of useful and potential information. Web scraping therefore confronts this problem by harvesting explicit information from a number of websites for knowledge discovery and easy access. It is important to realize that query interfaces of web databases are prone to sharing of same building blocks. It is therefore important to realize that the web offers unprecedented challenge and opportunity to data harvesting.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-solution-data-harvesting/

Wednesday, 5 November 2014

Application of Web Data Mining in CRM

The process of improvising the customer relations and interactions and making them more amicable may be termed as Customer relationship management (CRM). Since web data mining is used in the utilization of the various modeling and data analysis methods in detecting given patterns and relationships in the data, it can be used as an effective tool in CRM. By the effectively using web data mining you are able to understand what your customers what.

It is important to note that web data mining can be used effectively in searching for the right and potential customers to be offered the right products at the right time. The result of this in any business is the increase in the revenue generated. This is made possible as you are able to respond to each customer in an effective and efficient way. The method further utilizes very few resources and can be therefore termed as an economical method.

In the next paragraphs we discuss the basic process of customer relationship management and its integration with web data mining service. The following are the basic process that should be used in understanding what your customers need, sending them the right offers and products, and reducing the resources used in managing your customers.

Defining the business objective. Web data mining can be used to define and inform your customers your business objective. By doing research you can be able to determine whether your business objective is communicated well to your customers and clients. Does your business objective take interest in the customers? Your business goal must be clearly outlined in your business CRM. By having a more precise and defined goal is the possible way of ensuring success in the customer relationship management.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/application-web-data-mining-crm/