Monday, December 30, 2019

Read file from window directory:

e.g - D:/Software\coding\data\file1 and getting below error.



>>> file1=sc.textFile('file\\\Software\coding\data\file1')
>>> file1.collect90
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'RDD' object has no attribute 'collect90'
>>> file1.collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Software\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\rdd.py", line 816, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "D:\Software\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "D:\Software\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "D:\Software\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/D:/file/Software/coding/data ile1
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
        at scala.Option.getOrElse(Option.scala:121)

Solution: Instead of using  backward '\' slash please use forward slash '/' and also mention the extension of like filename.txt or file_name.csv

>>> textFile = spark.read.text("/Software/coding/data/file1.txt")
>>> textFile.count()

No comments:

Post a Comment