“How to read PDF content using .NET?” is one of the very common questions you normally found in almost all Microsoft forum. Since I have been answering this question with sample code most of the time in I thought I will write a short article with detailed explanation.
Here I am going to use iTextSharp.dll to read the PDF file. iTextSharp is a C# port of iText, and open source Java library for PDF generation and manipulation. You can download the DLL from sourceforge.net using this download iTextSharp link.
Now we will start the .NET coding part to use the iTextSharp.
As this is a sample programe I am going to add only 3 controls. One FileUpload Control to locate/browse the PDF file, one button to show the content in a label and finally a label display the PDF content.
First we will see the PDF file and it’s content we are going to read.
No we will design our .ASPX page, as I mentioned above we have only three controls.
<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="WebForm1.aspx.cs" Inherits="Sample_2012_Web_App.WebForm1" %>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
</head>
<body>
<form id="form1" runat="server">
<div>
</div>
<asp:Label ID="Label1" runat="server" Text="Please select the PDF File"></asp:Label>
<asp:FileUpload ID="PDFFileUpload" runat="server" />
<br />
<br />
<asp:Button ID="btnShowContent" runat="server" OnClick="btnShowContent_Click" Text="Show PDF Content" />
<br />
<br />
<asp:Label ID="lblPdfContent" runat="server"></asp:Label>
</form>
</body>
</html>
Below image shows you the interface we have created,
Now we will see the C# code to read the PDF content. Before start writing the code we need to add reference to the iTextSharp.dll. So from your solution explorer right click on the Reference and click on Browse button to locate the DLL file you have stored from the downloaded source code.
Once you add the reference we have to add the namespaces like below,
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
Now we will see the complete source code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.Text;
namespace Sample_2012_Web_App
{
public partial class WebForm1 : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
}
protected void btnShowContent_Click(object sender, EventArgs e)
{
if (PDFFileUpload.HasFile)
{
string strPDFFile = PDFFileUpload.FileName;
PDFFileUpload.SaveAs(Server.MapPath(strPDFFile));
StringBuilder strPdfContent = new StringBuilder();
PdfReader reader = new PdfReader(Server.MapPath(strPDFFile));
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy objExtractStrategy = new SimpleTextExtractionStrategy();
string strLineText = PdfTextExtractor.GetTextFromPage(reader, i, objExtractStrategy);
strLineText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strLineText)));
strPdfContent.Append(strLineText);
reader.Close();
strPdfContent.Append("<br/>");
}
lblPdfContent.Text = strPdfContent.ToString();
}
}
}
}
Finally we will see the output.
As usual you are always welcome to post your comment below.