IT Anawer: Get html file Java

Duplicate:

How do you Programmatically Download a Webpage in Java?

How to fetch html in Java

I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.

How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?

Thks.

From stackoverflow

You could just use a URLConnection. See this Java Tutorial from Sun
You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.
URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient

This code downloads data from a URL, treating it as binary content:

public class Download {

  private static void download(URL input, File output)
      throws IOException {
    InputStream in = input.openStream();
    try {
      OutputStream out = new FileOutputStream(output);
      try {
        copy(in, out);
      } finally {
        out.close();
      }
    } finally {
      in.close();
    }
  }

  private static void copy(InputStream in, OutputStream out)
      throws IOException {
    byte[] buffer = new byte[1024];
    while (true) {
      int readCount = in.read(buffer);
      if (readCount == -1) {
        break;
      }
      out.write(buffer, 0, readCount);
    }
  }

  public static void main(String[] args) {
    try {
      URL url = new URL("http://stackoverflow.com");
      File file = new File("data");
      download(url, file);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

}

The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection (or a more sophisticated API, like the Apache one).

In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidy it first before parsing using a XML parser.

Funnily enough I wrote utility method that does just that the other week

/**
 * Retrieves the file specified by <code>fileUrl</code> and writes it to 
 * <code>out</code>.
 * <p>
 * Does not close <code>out</code>, but does flush.
 * @param fileUrl The URL of the file.
 * @param out An output stream to capture the contents of the file
 * @param batchWriteSize The number of bytes to write to <code>out</code>
 *                       at once (larger files than this will be written
 *                       in several batches)
 * @throws IOException If call to web server fails
 * @throws FileNotFoundException If the call to the web server does not
 *                               return status code 200. 
 */
public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
       throws IOException{
 GetMethod get = new GetMethod(fileURL);
 HttpClient client = new HttpClient();
 HttpClientParams params = client.getParams();
 params.setSoTimeout(2000);
 client.setParams(params);
 try {
  client.executeMethod(get);
 } catch(ConnectException e){
  // Add some context to the exception and rethrow
  throw new IOException("ConnectionException trying to GET " + 
    fileURL,e);
 }

 if(get.getStatusCode()!=200){
  throw new FileNotFoundException(
    "Server returned " + get.getStatusCode());
 }

 // Get the input stream
 BufferedInputStream bis = 
  new BufferedInputStream(get.getResponseBodyAsStream());

 // Read the file and stream it out
 byte[] b = new byte[batchWriteSize];
 int bytesRead = bis.read(b,0,batchWriteSize);
 long bytesTotal = 0;
 while(bytesRead!=-1) {
  bytesTotal += bytesRead;
  out.write(b, 0, bytesRead);
  bytesRead = bis.read(b,0,batchWriteSize);;
 } 
 bis.close(); // Release the input stream.
 out.flush();   
}

Uses Apache Commons library i.e.

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpClientParams;

IT Anawer

Sunday, April 17, 2011

Get html file Java

Duplicate:

0 comments:

Post a Comment

Blog Archive