Subscribe to our Newsletter

Which Language is Better For Writing a Web Crawler? PHP, Python or Node.js?

I want to share with you a good article that might help you better extract web data for your business.

Yesterday, I saw someone asking “which programming language is better for writing a web crawler? PHP, Python or Node.js?”and mentioning some requirements as below.

 

  1. The analytic ability to web page
  2. Operational capability to database(MySQL)
  3. Efficiency of crawling
  4. The amount of code. "

 

Someone replied to this question.

 

“When you are going to crawl large-scale websites, then efficiency, scalability and maintainability are the factors that you must be considered.

 

Crawling large-scale websites involves many problems: multi-threading, I/O mechanism, distributed crawling, communication, duplication checking, task schedule, etc. And then the language used and the frame selected play a significant role at this moment.

 

PHP: The support for multithreading and async is quite weak and therefore is not recommended.

 

Node.js: It can crawling some vertical websites. But due to the support for distributed crawling and communications is relatively weaker than the other two. So you need to make a judgment.

Python: It’s Strongly recommended and has better support for the requirements mentioned above, especially the scrapy framework. Scrapy framework has many advantages:

  1. Support XPath
  2. Good performance based on twisted
  3. Has debugging tools

If you want to perform dynamic analysis of JavaScript, it’s not suitable to use casperjs under the scrapy framework and it’s better to create your own javescript engine based on the Chrome V8 engine.

& C ++: I’m not recommended. Although they have good performance, we still have to consider many factors such as cost. For most companies it is recommended to write crawler program based on some open source framework. Make the best use of the excellent programs available. It’s easy to make a simple crawler, but it’s hard to make an excellent one.

Truly, it’s hard to make a perfect crawler. There are many web data extractors available for you like mozenda, import.io and etc. But if there is such a free software program that could meet your various needs, I think you would willing to have a try. Maybe it would spend you a whole day or even a week to become familiar with this software and proficient with it. But once you use the software, you no longer need to worry about website revision or your IP would be blocked and can use cloud servers to enjoy multi-nodes extraction service.

The article above is mainly reprinted from http://www.octoparse.com/blog/which-language-is-better-for-writing-...

Views: 1819

Comment by Paul Black on June 7, 2016 at 12:45am

Cool.

Comment by jon allie on June 18, 2016 at 4:38am

Great info

Comment by Emily Houston on July 31, 2017 at 10:50am

Hi. My name is Emily and I represent Mozenda. Thank you Nora for suggesting our tool. We have two services: A Do It Yourself Software or Professional Services where can can do it all for you. Please feel free to reach out to me directly if you have any questions!

https://meetingbird.com/meet/emilyhouston

Comment

You need to be a member of BigDataNews to add comments!

Join BigDataNews

© 2018   BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service