solr（windows下）如何对目录下的中文内容的docx/doc和PDF的全?报错

我的目的是对某个文件夹下随时生成的中文内容的docx或doc和PDF（每两个文件有一个子目录），实时生成全文索引，供以后查询。我不想抽取纯文本，想直接对整篇word文档建索引，是动态域吗？那怎样配置呢？包括各个XML？
      我在windows2008下使用solr6.0版本，使用的solr本身的web架构，而不是tomcat，目前core1目录下没有schema.xml文件。在solr create -c core1时，自带生成了core1结构，目录为server/solr/core1。core1下有conf和data两个目录，但是conf下没有自动生成schema.xml文件，但是有solrconfig.xml、managed-schema文件。使用admin控制台的schema，添加了field，但依然没有生成schema.xml文件。

      使用post命令对dataword目录下的两个docx文档进行索引，发现能找到文件，但出现400号错误，错误内容附后。如果solr6.0有问题，可以使用其他版本，请告诉具体操作步骤，越详细越好，谢谢。

     post错误如下：
      执行命令：java -Dc=core1 -jar bin/post.jar  dataword/
      执行结果：
        SimplePostTool version 5.0.0
       Posting files to [base] url http://localhost:8393/solr/core1/update-type application/xml....(疑问：为什么是xml类型？对example下面的xml进行post索引，都能成功)
      Indexing directory server\solr\core1\data   (2 files,depth=0)
      POSTing file 820-文档一.docx to [base]
     SimplePostTool:WARNING: solr returned an error #400 (Bad Request) for url: http://localhost:8393/solr/core1/update
     SimplePostTool:WARNING: Response:<?xml version="1.0" encoding="UTF-8"?>
     <reponse>
     <lst name="resopnseHeader"><int name="status">400</int>
          <int name="QTime">2</int>
     </lst>
     <lst name ="error">
      <lst name ="metadata"><str name="error-class">org.apache.solr.common.SolrException></str>
      <str name="root -error-class">java.io.CharConversionException</str></lst>
     <str name="msg">Invalid UTF-8 start byte 0x9b ( at char #16,byte $-1(/str><int name="code">400</int></lst>
   </reponse>

SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/core1/update
POSTing file 820-文档一.docx to [base]

    下面一个文档错误内容同上
   2 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core、update..

Time spent: 0:00:00.114

对docs下的html也报错

我不想抽取纯文本，想直接对整篇word文档建索引，没听懂你的意思，到头来还不是要先把文件里的文本抽出来。

\solr-5.5.0\example\files

这个目录下的核，就可以索引文件

就是solr官方文档上提出的， PostTool直接对docx和pdf进行索引，参见https://cwiki.apache.org/confluence/display/solr/Post+Tool#PostTool-Windows

我不知道如何对schema.xml或其他配置文件进行配置，或是其他什么工作，目的是对整篇文档，感觉word文档没有字段啊，打开就是word的内容，怎样定义field呢？

Indexingrichdocuments(PDF,Word,HTML,etc)

IndexaPDFfileinto gettingstarted.

bin/post-cgettingstarteda.pdf

solr（windows下）如何对目录下的中文内容的docx/doc和PDF的全?报错

Indexingrichdocuments(PDF,Word,HTML,etc)

相关电子书

相关实验场景

solr（windows下）如何对目录下的中文内容的docx/doc和PDF的全?报错

Indexingrichdocuments(PDF,Word,HTML,etc)

相关文章

相关电子书

相关实验场景