I was wondering if some of you can please give me suggestions on how to extract the hyper-links from the file most efficiently. I was thinking of using regular expressions. Any other better ways of doing it?
One thing to note, if the HTML has some script or comment you may get messed results if you use simple regex
eg:
...
<script>
/*
for some reason there's <a> thing in here
*/
</script>
...
<a href="..." > the <a> in the script will end here: </a>
...
<!--
this won't be rendered but you'll get it anyway:
<a>blah blah</a>
-->
...