A Data Science Central Community
I want to share an interesting article about data scaping that you might need in your business. The article below is mainly reprinted from here.
Text in the HTML document is the content that placed between HTML tags like <a> </a> , <title> </title>. Sometimes we want to extract the text in the HTML document and there are two methods that can help us fetch the text we want from HTML files.
1. Programming language
For those simple HTML documents, people who have basic coding knowledge can choose to write a program to remove all HTML tags and retain only the text inside HTML files, using regular expression or XPath. There are several widely used programming languages such as C#, Java, Python, JS, PHP, Go and NodeJs that are available for computer users. You can pick a suitable one to start your project. Some of these languages have their own parser for HTML that are available and free online and you will know more about these HTML parsers by click here https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.
It is worth mentioning that the code you write can only be used for one type of web page, that means different types of web pages needs to write different code. Besides, you need to test your code after you have written your program, and it takes longer time for who have no experience to write code and test the code.
2. Web data extraction tools
There are many powerful web extraction tools such as import.io, mozenda, Octoparse that are available for computer users to harvest almost everything on the web page, including the text, links, images, etc. You can convert what you get into structured data format.
You don’ t need to write any code, so it’s especially good for those who have no coding experience. In most cases, you don’t need to write regular expression or XPath. The visualization enable users to better interact with the web page. It’s easy to check and export the data without any IDE.
As for web data extraction tools, I'd like to share with you the answers from Carl Wang on Quora.com. Here is the link: What are some good free web scrapers.
For non-programmers, no programming skills required. Some tools need you to have some basic knowledge about HTML and X path but just very basic.I share some tools down below. They are nice. I think these are very popular tools right now 'cuz when I google things about web scrapers, these tools come out and listed front.
I' v tried some of these but I prefer Octoparse. Personally I think it's pretty easy to use and the interface of it is clear. After failing to build a crawler by coding, I chose to use web scraper that are already existed. Octoparse will automatically generate X path so that I don’t have to write them myself. It can scrape data from websites that have structured layout and it has cloud servers as well.
Import.io is a very useful tool and very easy to use. Import can operate in “Magic” mode where you point it at a URL and it slices and dices the content to produce a table automatically. The "Magic Api" page also provides options for re-running the query and downloading the results in JSON or tab-separated variable format.
Octoparse can extract all the web data, the structured and unstructured data, on the web pages. Octoparse can extract data from any website that can be accessed into. You configure the rule to tell Octoparse what and how to extract data both in depth and breadth. Octoparse can grab all the text data that is composed by strings. Image, text file, video and audio are not supported by Octoparse.
You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity. Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.
It works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate. But No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper. It’s going to be a little bit harder to master this tool. The kimono team is joining Palantir.
5. Screen scraper
It is pretty neat and tackles a lot of difficult tasks including navigation and precise data extractions, however it requires a bit of programming/tokenization skills if you’d like to run it super smooth. The tool is pricey and you’ll have to go through documentation and have basic coding skills to use it.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
I hope that you’ll find something worth your time, but I also encourage you to share your own tools that you use for web scraping, and I’d love to try them out myself.