[HADOOP] pyspark.sql.utils.AnalysisException : u'Path가 존재하지 않습니다
HADOOPpyspark.sql.utils.AnalysisException : u'Path가 존재하지 않습니다
내 파일을 저장하기 위해 S3이 아닌 표준 hdfs를 사용하여 Amazon Emr을 사용하여 스파크 작업을 실행하고 있습니다. hdfs : // user / hive / warehouse /에 하이브 테이블이 있지만 스파크 작업이 실행될 때 찾을 수 없습니다. 내 hdfs 디렉토리의 스파크 속성을 반영하도록 spark 속성 spark.sql.warehouse.dir을 구성하고 원사 로그에 다음과 같이 말합니다.
17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'.
나중에 로그에 표시됩니다 (페이지 끝의 전체 로그).
LogType:stdout
Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
LogLength:854
Log Contents:
Traceback (most recent call last):
File "test.py", line 25, in <module>
parquet_example(spark)
File "test.py", line 9, in parquet_example
tests = spark.read.parquet("test.parquet")
File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in parquet
File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/test.parquet;'
End of LogType:stdout
경로에 불일치가 발생하여 내가 뭘 잘못하고 있습니까?
hive / warehouse에 대한 내 hdfs 디렉토리는 다음과 같습니다.
hdfs dfs -ls
/user/hive/warehouse
Found 1 items
drwxrwxrwt - hadoop hadoop 0 2017-03-28 18:50 /user/hive/warehouse/test
다음은 / user / hadoop /이 제공하는 것입니다.
hdfs dfs -ls /user/hadoop/
Found 2 items
drwxr-xr-x - hadoop hadoop 0 2017-03-28 16:53 /user/hadoop/.hiveJars
drwxr-xr-x - hadoop hadoop 0 2017-03-28 19:54 /user/hadoop/.sparkStaging
그리고 파이썬에서 내 불꽃 일이 있습니다 :
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql import Row
def parquet_example(spark):
tests = spark.read.parquet("test.parquet")
tests.createOrReplaceTempView("tests")
tests_result = spark.sql("SELECT * FROM test")
tests_result.show()
if __name__ == "__main__":
warehouseLocation = "hdfs://user/hive/warehouse/"
spark = SparkSession.builder.appName("example").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
parquet_example(spark)
spark.stop()
전체 원사 로그 :
Container: container_1490717578939_0012_01_000001 on ip-xxx-xx-xx-xxx.ec2.internal_8041
=========================================================================================
LogType:stderr
Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
LogLength:14054
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/131/__spark_libs__713193244228500015.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for TERM
17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for HUP
17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for INT
17/03/28 19:54:02 INFO ApplicationMaster: Preparing Local resources
17/03/28 19:54:03 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1490717578939_0012_000001
17/03/28 19:54:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing view acls groups to:
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls groups to:
17/03/28 19:54:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
17/03/28 19:54:03 INFO ApplicationMaster: Starting the user application in a separate Thread
17/03/28 19:54:03 INFO ApplicationMaster: Waiting for spark context initialization...
17/03/28 19:54:03 INFO SparkContext: Running Spark version 2.1.0
17/03/28 19:54:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
17/03/28 19:54:03 INFO SecurityManager: Changing view acls groups to:
17/03/28 19:54:03 INFO SecurityManager: Changing modify acls groups to:
17/03/28 19:54:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
17/03/28 19:54:03 INFO Utils: Successfully started service 'sparkDriver' on port 33579.
17/03/28 19:54:04 INFO SparkEnv: Registering MapOutputTracker
17/03/28 19:54:04 INFO SparkEnv: Registering BlockManagerMaster
17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-f3713d64-91da-4cb5-9b55-d4a18c607a74
17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-634c7d4b-026c-4df7-abf4-7846bd7fc958
17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt2/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-19f0a265-755a-42f0-9282-1e3d98a57ab1
17/03/28 19:54:04 INFO MemoryStore: MemoryStore started with capacity 414.4 MB
17/03/28 19:54:04 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/28 19:54:04 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/03/28 19:54:04 INFO Utils: Successfully started service 'SparkUI' on port 37056.
17/03/28 19:54:04 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://xxx.xx.xx.xxx:37056
17/03/28 19:54:04 INFO YarnClusterScheduler: Created YarnClusterScheduler
17/03/28 19:54:04 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1490717578939_0012 and attemptId Some(appattempt_1490717578939_0012_000001)
17/03/28 19:54:04 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/03/28 19:54:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34414.
17/03/28 19:54:04 INFO NettyBlockTransferService: Server created on xxx.xx.xx.xxx:34414
17/03/28 19:54:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/03/28 19:54:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xx.xx.xxx:34414 with 414.4 MB RAM, BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:04 INFO BlockManager: external shuffle service port = 7337
17/03/28 19:54:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
17/03/28 19:54:05 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1490717578939_0012_1
17/03/28 19:54:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/03/28 19:54:05 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
17/03/28 19:54:05 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
17/03/28 19:54:05 INFO YarnClusterScheduler: YarnClusterScheduler.postStartHook done
17/03/28 19:54:05 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@xxx.xx.xx.xxx:33579)
17/03/28 19:54:05 INFO ApplicationMaster:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-api-jdo.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-core.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-rdbms.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*
SPARK_YARN_STAGING_DIR -> hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1490717578939_0012
SPARK_USER -> hadoop
SPARK_YARN_MODE -> true
PYTHONPATH -> {{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
command:
LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH" \
{{JAVA_HOME}}/bin/java \
-server \
-Xmx5120m \
'-verbose:gc' \
'-XX:+PrintGCDetails' \
'-XX:+PrintGCDateStamps' \
'-XX:+UseConcMarkSweepGC' \
'-XX:CMSInitiatingOccupancyFraction=70' \
'-XX:MaxHeapFreeRatio=70' \
'-XX:+CMSClassUnloadingEnabled' \
'-XX:OnOutOfMemoryError=kill -9 %p' \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.history.ui.port=18080' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler@xxx.xx.xx.xxx:33579 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
2 \
--app-id \
application_1490717578939_0012 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1490730839170 type: FILE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/__spark_conf__.zip" } size: 75741 timestamp: 1490730839402 type: ARCHIVE visibility: PRIVATE
pyspark.zip -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/pyspark.zip" } size: 452353 timestamp: 1490730838849 type: FILE visibility: PRIVATE
__spark_libs__ -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/__spark_libs__713193244228500015.zip" } size: 196686961 timestamp: 1490730836856 type: ARCHIVE visibility: PRIVATE
hive-site.xml -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/hive-site.xml" } size: 2375 timestamp: 1490730837023 type: FILE visibility: PRIVATE
===============================================================================
17/03/28 19:54:05 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xx-xx-xxx.ec2.internal/xxx-xx-xx-xxx:8030
17/03/28 19:54:05 INFO YarnRMClient: Registering the ApplicationMaster
17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'.
17/03/28 19:54:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/03/28 19:54:05 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
17/03/28 19:54:05 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/03/28 19:54:06 INFO metastore: Trying to connect to metastore with URI thrift://ip-xxx-xx-xx-xxx.ec2.internal:9083
17/03/28 19:54:06 INFO metastore: Connected to metastore.
17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/yarn
17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/5f653144-e990-45b0-ba73-cdb4d10e9f7a_resources
17/03/28 19:54:06 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/5f653144-e990-45b0-ba73-cdb4d10e9f7a
17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/yarn/5f653144-e990-45b0-ba73-cdb4d10e9f7a
17/03/28 19:54:06 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/5f653144-e990-45b0-ba73-cdb4d10e9f7a/_tmp_space.db
17/03/28 19:54:06 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs://user/hive/warehouse/
17/03/28 19:54:06 ERROR ApplicationMaster: User application exited with status 1
17/03/28 19:54:06 INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
17/03/28 19:54:06 INFO SparkContext: Invoking stop() from shutdown hook
17/03/28 19:54:06 INFO SparkUI: Stopped Spark web UI at http://xxx.xx.xx.xxx:37056
17/03/28 19:54:06 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/03/28 19:54:06 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/03/28 19:54:06 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
17/03/28 19:54:06 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/28 19:54:06 INFO MemoryStore: MemoryStore cleared
17/03/28 19:54:06 INFO BlockManager: BlockManager stopped
17/03/28 19:54:06 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/28 19:54:06 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/28 19:54:06 INFO SparkContext: Successfully stopped SparkContext
17/03/28 19:54:06 INFO ShutdownHookManager: Shutdown hook called
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt1/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-3a6db594-2b44-47fe-8e48-4220b93e789a
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt2/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-a54516f0-48be-4fdb-899b-bbee998468b1
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-552e3cae-c119-47a5-9c63-34d4df59d072
17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-552e3cae-c119-47a5-9c63-34d4df59d072/pyspark-a0240093-16c6-43e4-8f2c-dcef309afe97
End of LogType:stderr
LogType:stdout
Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
LogLength:854
Log Contents:
Traceback (most recent call last):
File "test.py", line 25, in <module>
parquet_example(spark)
File "test.py", line 9, in parquet_example
tests = spark.read.parquet("test.parquet")
File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in parquet
File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/test.parquet;'
End of LogType:stdout
해결법
-
==============================
1.질문의 parquet_example 함수는 parquet 파일 test.parquet에서 DataFrame을 만들고 임시 뷰를 생성하여 쿼리합니다.
질문의 parquet_example 함수는 parquet 파일 test.parquet에서 DataFrame을 만들고 임시 뷰를 생성하여 쿼리합니다.
의견에서 : test라는 Hive 테이블이 이미 존재하므로 생성 된 SparkSession을 사용하여 테이블을 직접 쿼리하십시오.
warehouseLocation = "hdfs://user/hive/warehouse/" spark = SparkSession \ .builder \ .appName("example") \ .config("spark.sql.warehouse.dir", warehouseLocation) \ .enableHiveSupport() \ .getOrCreate() spark.sql("SELECT * FROM test").show()
from https://stackoverflow.com/questions/43100458/pyspark-sql-utils-analysisexception-upath-does-not-exist by cc-by-sa and MIT license
'HADOOP' 카테고리의 다른 글
[HADOOP] Hadoop에서 List 컬렉션 객체를 직렬화하는 방법은 무엇입니까? (0) | 2019.08.12 |
---|---|
[HADOOP] HIVE에서 테이블 이름으로 사용하기 위해 현재 날짜를 가져 와서 변수로 설정하십시오. (0) | 2019.08.12 |
[HADOOP] Java 응용 프로그램 내에서 Pig 실행 (0) | 2019.08.12 |
[HADOOP] Oozie Hive 액션이 멈추고 심장 박동이 영원히 (0) | 2019.08.12 |
[HADOOP] 의사 분산 번호 맵 및 작업 감소 (0) | 2019.08.12 |