Sunday, April 17, 2011

Get html file Java

Duplicate:

How do you Programmatically Download a Webpage in Java?

How to fetch html in Java

I'm developping an application that consists on: the user inputs an URL of some website, and then the application have to analyze that URL.

How can I have access to the HTML file, using Java? Does I need to use HttpRequest? How does that works?

Thks.

From stackoverflow
  • You could just use a URLConnection. See this Java Tutorial from Sun

  • You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.

  • URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient

  • This code downloads data from a URL, treating it as binary content:

    public class Download {
    
      private static void download(URL input, File output)
          throws IOException {
        InputStream in = input.openStream();
        try {
          OutputStream out = new FileOutputStream(output);
          try {
            copy(in, out);
          } finally {
            out.close();
          }
        } finally {
          in.close();
        }
      }
    
      private static void copy(InputStream in, OutputStream out)
          throws IOException {
        byte[] buffer = new byte[1024];
        while (true) {
          int readCount = in.read(buffer);
          if (readCount == -1) {
            break;
          }
          out.write(buffer, 0, readCount);
        }
      }
    
      public static void main(String[] args) {
        try {
          URL url = new URL("http://stackoverflow.com");
          File file = new File("data");
          download(url, file);
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    
    }
    

    The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection (or a more sophisticated API, like the Apache one).

    In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidy it first before parsing using a XML parser.

  • Funnily enough I wrote utility method that does just that the other week

    /**
     * Retrieves the file specified by <code>fileUrl</code> and writes it to 
     * <code>out</code>.
     * <p>
     * Does not close <code>out</code>, but does flush.
     * @param fileUrl The URL of the file.
     * @param out An output stream to capture the contents of the file
     * @param batchWriteSize The number of bytes to write to <code>out</code>
     *                       at once (larger files than this will be written
     *                       in several batches)
     * @throws IOException If call to web server fails
     * @throws FileNotFoundException If the call to the web server does not
     *                               return status code 200. 
     */
    public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
           throws IOException{
     GetMethod get = new GetMethod(fileURL);
     HttpClient client = new HttpClient();
     HttpClientParams params = client.getParams();
     params.setSoTimeout(2000);
     client.setParams(params);
     try {
      client.executeMethod(get);
     } catch(ConnectException e){
      // Add some context to the exception and rethrow
      throw new IOException("ConnectionException trying to GET " + 
        fileURL,e);
     }
    
     if(get.getStatusCode()!=200){
      throw new FileNotFoundException(
        "Server returned " + get.getStatusCode());
     }
    
     // Get the input stream
     BufferedInputStream bis = 
      new BufferedInputStream(get.getResponseBodyAsStream());
    
     // Read the file and stream it out
     byte[] b = new byte[batchWriteSize];
     int bytesRead = bis.read(b,0,batchWriteSize);
     long bytesTotal = 0;
     while(bytesRead!=-1) {
      bytesTotal += bytesRead;
      out.write(b, 0, bytesRead);
      bytesRead = bis.read(b,0,batchWriteSize);;
     } 
     bis.close(); // Release the input stream.
     out.flush();   
    }
    

    Uses Apache Commons library i.e.

    import org.apache.commons.httpclient.HttpClient;
    import org.apache.commons.httpclient.methods.GetMethod;
    import org.apache.commons.httpclient.params.HttpClientParams;
    

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.