Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I need to extract some text patterns from the huge HTML. I can definitely write

ID: 660105 • Letter: I

Question

I need to extract some text patterns from the huge HTML.

I can definitely write a precise regular expression with the group that matches exactly what I need.

What tool would be appropriate to extract these groups and get me the list of them back?

So basically I'm looking for a powerful tool that allows you to play with the text/search/find/replace/extract using regular expressions. I assume this tool could also have different helpful functionality like sort/unique, etc.

I understand there are different linux tools that could be used for that purpose, but I don't know of any combination of them that could allow me to easy do what I need.

Here is the example of the problem:

I have an HTML source with lots of different link on it. I need to extract all the URLs from these links, but not all the URLs on the page. I need to parse these URLs and process them further.

I would appreciate for any suggestions on what is the best tool/tools to do this.

Explanation / Answer

I would suggest looking at Python and Scrapy.

Python with it's standard libraries includes lots of very powerful text processing tools including regular expressions but scrapy takes it a lot further.

From the web site:

Built-in support for selecting and extracting data from HTML and XML sources
Built-in support for cleaning and sanitizing the scraped data using a collection of reusable filters (called Item Loaders) shared between all the spiders.
Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)

Both are free, cross platform and you can test the code interactively which saves a lot of time