Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash'

#### Issue Description
I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr. 
I face problem while crawling and see given below error - 

`Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'`


#### How to reproduce it
1. git clone the main branch.
2. build sparkler-core
3. modify /home/ubuntu/sparkler/sparkler-core/build/conf/sparkler-default.yaml
```
crawldb.backend: solr  # "solr" is default until "elasticsearch" becomes usable.
solr.uri: http://localhost:8983/solr/crawldb
```
2. Run following command to inject -
 `java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk`
4. Run following command to crawl - 
 `java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1`

Additional changes: I have modified Crawler.scala and have added below code at line 171
`conf.set("spark.io.compression.codec", "snappy")`
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.

#### Environment and Version Information

Please indicate relevant versions, including, if relevant:

* Java Version 
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

* Spark Version - 3.0.3, Scala version 2.12.10
* Operating System name and version - AWS Instance based on 20.04.1-Ubuntu
* Solr - 8.5.0 (in local mode) 


I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log
```
2022-01-31 04:44:54.871 ERROR (qtp1984990929-17) [   x:crawldb] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
        at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:977)
```

**StackTrace from sparkler-crawl -** 
```
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManagerMaster - Updated info of block rdd_7_0
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManager - Told master about block rdd_7_0
04:44:54.880 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
	at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResource(SolrProxy.scala:121)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:158)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:37)
	at scala.collection.Iterator.toStream(Iterator.scala:1417)
	at scala.collection.Iterator.toStream$(Iterator.scala:1416)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.toStream(FairFetcher.scala:37)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:336)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:336)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.toSeq(FairFetcher.scala:37)
	at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$3(Crawler.scala:258)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1418)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1345)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1409)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1230)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
04:44:54.908 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.executor.ExecutorMetricsPoller - removing (3, 0) from stageTCMP
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash' #247

Issue Description

How to reproduce it

Environment and Version Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash' #247

Description

Issue Description

How to reproduce it

Environment and Version Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions