Update README.md
Browse files
README.md
CHANGED
|
@@ -37,15 +37,19 @@ Most LLM applications only convert your PDF simple to txt, nothing more, its lik
|
|
| 37 |
Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
|
| 38 |
I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
|
| 39 |
<ul style="line-height: 1.05;">
|
| 40 |
-
<li>Works with single and multi
|
| 41 |
-
<li>Intelligent multiprocessing</li>
|
| 42 |
<li>Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling</li>
|
| 43 |
<li>Instant view of the result, hit one pdf on top of the list</li>
|
| 44 |
-
<li>
|
| 45 |
-
<li>
|
| 46 |
-
<li>
|
| 47 |
-
<li>
|
| 48 |
-
<li>All
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
<li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
|
| 50 |
<li>tested on 300 PDF files ~30000 pages</li>
|
| 51 |
</ul>
|
|
|
|
| 37 |
Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
|
| 38 |
I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
|
| 39 |
<ul style="line-height: 1.05;">
|
| 40 |
+
<li>Works with single and multi PDF list, works with folder</li>
|
| 41 |
+
<li>Intelligent multiprocessing ~10-20 pages per second</li>
|
| 42 |
<li>Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling</li>
|
| 43 |
<li>Instant view of the result, hit one pdf on top of the list</li>
|
| 44 |
+
<li>Removes about 5% of the margins around the page</li>
|
| 45 |
+
<li>Converts some common tables as json inside the txt file</li>
|
| 46 |
+
<li>Add the absolute PAGE number to each page</li>
|
| 47 |
+
<li>Add the tag “chapter” or “important” to large and/or bold font.</li>
|
| 48 |
+
<li>All txt files will be created in original folder of PDF, same name as *.txt</li>
|
| 49 |
+
<li>All txt files will be overwritten if you start converting with same PDF</li>
|
| 50 |
+
<li>If there are many text blocks on a page, it may be that text blocks that you would read first appear further down the page. (It is a compromise between many layout options)</li>
|
| 51 |
+
<li>Small blocks of text (such as units or individual numbers), usually near diagrams and sketches, appear at the end of each page</li>
|
| 52 |
+
<li>I advise against using a PDF file directly for RAG formatting (embedding), as you never know how it will look, and incorrect input can lead to poor results</li>
|
| 53 |
<li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
|
| 54 |
<li>tested on 300 PDF files ~30000 pages</li>
|
| 55 |
</ul>
|