colab

10분 코알라 2020.10.28
colab.research.google.com에서 Selenium Webdriver를 사용하는 방법? 2020.06.05
시작하기: Colab에서 CSV 파일을 불러오는 3가지 방법 2020.05.23 2

10분 코알라

2020. 10. 28. 22:03

원본 : http://nbviewer.jupyter.org/github/SDRLurker/TIL/blob/master/python/ipynb/10%EB%B6%84%20%EC%BD%94%EC%95%8C%EB%9D%BC.ipynb

Running Pyspark in Colab¶

참고주소 : https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=sq8U3BtmhtRx

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. Follow the steps to install the dependencies:

Colab에서 Pyspark 실행하기¶

참고주소 : https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=sq8U3BtmhtRx

Colab에서 스파크를 실행하려면 먼저 모든 종속성을 Colab 환경에 설치해야 합니다 (예 : Apache Spark 2.3.2 with hadoop 2.7, Java 8 및 Findspark)는 시스템에서 스파크를 찾습니다. 도구 설치는 Colab의 Jupyter 노트북 내에서 수행할 수 있습니다. 한 가지 중요한 참고 사항은 Spark를 처음 사용하는 경우 일부 사람들이 이미 Python과의 호환성 문제에 대해 불평했기 때문에 Spark 2.4.0 버전을 피하는 것이 좋습니다. 다음 단계에 따라 종속성을 설치하십시오.

In [2]:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null 
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz 
!tar xf spark-2.4.7-bin-hadoop2.7.tgz 
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

이제 Colab에 Spark와 Java를 설치 했으므로 Colab 환경에서 Pyspark를 실행할 수 있는 환경 경로를 설정할 차례입니다. 다음 코드를 실행하여 Java 및 Spark의 위치를 설정합니다.

In [5]:

import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" 
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

Run a local spark session to test your installation:

로컬 스파크 세션을 실행하여 설치를 테스트합니다.

In [6]:

import findspark 
findspark.init() 
from pyspark.sql import SparkSession 
spark = SparkSession.builder.master("local[*]").getOrCreate()

10 minutes to Koalas¶

참고주소 : https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb

This is a short introduction to Koalas, geared mainly for new users. This notebook shows you some key differences between pandas and Koalas. You can run this examples by yourself on a live notebook here. For Databricks users, you can import the current .ipynb file and run it after installing Koalas.

Customarily, we import Koalas as follows:

10분 코알라¶

참고주소 : https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb

이것은 주로 신규 사용자를 대상으로 한 Koalas에 대한 짧은 소개입니다. 이 노트북은 pandas와 Koalas의 몇 가지 주요 차이점을 보여줍니다. 여기에서 라이브 노트북에서 직접이 예제를 실행할 수 있습니다. Databricks 사용자의 경우 현재 .ipynb 파일을 가져와 Koalas를 설치 한 후 실행할 수 있습니다.

일반적으로 다음과 같이 Koalas를 가져옵니다.

In [8]:

!pip install koalas
Collecting koalas Downloading 
...

In [9]:

import pandas as pd 
import numpy as np 
import databricks.koalas as ks 
from pyspark.sql import SparkSession

Object Creation¶

객체 생성¶

Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

코알라 시리즈를 값의 리스트를 전달함으로써 생성하여, 코알라가 기본 정수 인덱스를 생성하도록 합니다.

In [10]:

s = ks.Series([1, 3, 5, np.nan, 6, 8])

s = ks.Series([1, 3, 5, np.nan, 6, 8])

In [11]:

Out[11]:

0 1.0

1 3.0

2 5.0

3 NaN

4 6.0

5 8.0

dtype: float64

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

시리즈처럼 변환될 수 있는 객체의 dict를 전달함으로써 코알라 데이터 프레임을 생성합니다.

In [12]:

kdf = ks.DataFrame(
	{'a': [1, 2, 3, 4, 5, 6], 
     'b': [100, 200, 300, 400, 500, 600], 
     'c': ["one", "two", "three", "four", "five", "six"]}, 
    index=[10, 20, 30, 40, 50, 60])

In [13]:

kdf

Out[13]:

	a	b	c
10	1	100	one
20	2	200	two
30	3	300	three
40	4	400	four
50	5	500	five
60	6	600	six

Creating a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

datetime 인덱스와 레이블이 있는 컬럼으로 numpy 배열을 전달함으로써 pandas 데이터프레임을 생성합니다.

In [14]:

dates = pd.date_range('20130101', periods=6)

In [15]:

dates

Out[15]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')

In [16]:

pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [17]:

pdf

Out[17]:

	A	B	C	D
2013-01-01	0.246792	0.536389	0.292430	-0.593033
2013-01-02	-0.134876	1.100264	-0.311183	0.923779
2013-01-03	0.137727	0.105094	-0.970088	0.584534
2013-01-04	-0.245857	2.213910	1.932763	0.803901
2013-01-05	-0.497545	0.541320	-0.323730	-0.454794
2013-01-06	0.357657	-0.778258	-0.135661	0.905264

Now, this pandas DataFrame can be converted to a Koalas DataFrame

이제, 이 pandas 데이터프레임은 코알라 데이터프레임으로 변환될 수 있습니다.

In [18]:

kdf = ks.from_pandas(pdf)

In [19]:

type(kdf)

Out[19]:

databricks.koalas.frame.DataFrame

It looks and behaves the same as a pandas DataFrame though

이는 pandas 데이터프레임과 똑같이 보이고 행동합니다

In [20]:

kdf

Out[20]:

	A	B	C	D
2013-01-01	0.246792	0.536389	0.292430	-0.593033
2013-01-02	-0.134876	1.100264	-0.311183	0.923779
2013-01-03	0.137727	0.105094	-0.970088	0.584534
2013-01-04	-0.245857	2.213910	1.932763	0.803901
2013-01-05	-0.497545	0.541320	-0.323730	-0.454794
2013-01-06	0.357657	-0.778258	-0.135661	0.905264

Also, it is possible to create a Koalas DataFrame from Spark DataFrame.

Creating a Spark DataFrame from pandas DataFrame

또한, Spark DataFrame으로부터 코알라 데이터프레임을 생성할 수 있습니다.

pandas DataFrame으로 Spark 데이터프레임을 생성합니다.

In [21]:

#spark = SparkSession.builder.getOrCreate()

In [22]:

sdf = spark.createDataFrame(pdf)

In [23]:

sdf.show()

+--------------------+-------------------+--------------------+--------------------+

| A| B| C| D|

+--------------------+-------------------+--------------------+--------------------+

| 0.2467916344312529| 0.5363885661296115| 0.29242981074832786| -0.5930334293597112|

|-0.13487637556398294| 1.1002643172222797|-0.31118252856050166| 0.9237787493823764|

| 0.13772736631889093| 0.105094112056177| -0.9700876227314351| 0.5845338086842855|

|-0.24585721059025922| 2.213909904836645| 1.9327634581838828| 0.8039009110324693|

| -0.4975445167193649| 0.5413197244143908| -0.3237299566752663|-0.45479420585587926|

| 0.35765732299914443|-0.7782577978361066| -0.1356607177712088| 0.9052638419278891|

+--------------------+-------------------+--------------------+--------------------+

Creating Koalas DataFrame from Spark DataFrame. to_koalas() is automatically attached to Spark DataFrame and available as an API when Koalas is imported.

Spark 데이터프레임으로부터 코알라 데이터프레임을 생성합니다. to_koalas()는 자동으로 Spark 데이터프레임에 접근하여 Koalas를 가져올 때 API로 사용할 수 있습니다.

In [24]:

kdf = sdf.to_koalas()

In [25]:

kdf

Out[25]:

	A	B	C	D
0	0.246792	0.536389	0.292430	-0.593033
1	-0.134876	1.100264	-0.311183	0.923779
2	0.137727	0.105094	-0.970088	0.584534
3	-0.245857	2.213910	1.932763	0.803901
4	-0.497545	0.541320	-0.323730	-0.454794
5	0.357657	-0.778258	-0.135661	0.905264

Having specific dtypes . Types that are common to both Spark and pandas are currently supported.

특정 dtypes가 있습니다. 현재 Spark 및 pandas에서 공통적으로 가지는 Type이 지원됩니다.

In [26]:

kdf.dtypes

Out[26]:

A float64

B float64

C float64

D float64

dtype: object

Viewing Data¶

데이터 보기¶

See the API Reference.

API Reference를 확인하세요.

See the top rows of the frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not ordered, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use .loc or iloc instead.

프레임의 최상단 몇개의 행을 확인합니다. 결과는 pandas와 똑같지 않을 수 있습니다. pandas와는 다르게 Spark 데이터프레임의 데이터는 정렬되지 않으며 인덱스에 대한 본질적인 개념이 없습니다. dataframe의 head를 요청하면 Spark는 파티션으로부터 요청한 행의 개수를 가져(take)옵니다. 특정 행을 반환하는 데 의존하지 않으며 대신 .loc나 .iloc를 사용하세요.

In [27]:

kdf.head()

Out[27]:

	A	B	C	D
0	0.246792	0.536389	0.292430	-0.593033
1	-0.134876	1.100264	-0.311183	0.923779
2	0.137727	0.105094	-0.970088	0.584534
3	-0.245857	2.213910	1.932763	0.803901
4	-0.497545	0.541320	-0.323730	-0.454794

Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

인덱스, 열(컬럼), 기본 numpy 데이터를 표시합니다.

인덱스를 받을 수도 있습니다. 인덱스 열은 데이터프레임에 속할 수 있습니다. 나중에 확인해 보겠습니다.

In [28]:

kdf.index

Out[28]:

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [29]:

kdf.columns

Out[29]:

Index(['A', 'B', 'C', 'D'], dtype='object')

In [30]:

kdf.to_numpy()

Out[30]:

array([[ 0.24679163, 0.53638857, 0.29242981, -0.59303343],

[-0.13487638, 1.10026432, -0.31118253, 0.92377875],

[ 0.13772737, 0.10509411, -0.97008762, 0.58453381],

[-0.24585721, 2.2139099 , 1.93276346, 0.80390091],

[-0.49754452, 0.54131972, -0.32372996, -0.45479421],

[ 0.35765732, -0.7782578 , -0.13566072, 0.90526384]])

Describe shows a quick statistic summary of your data

Describe는 데이터의 빠른 통계 요약도 보여줍니다.

In [31]:

kdf.describe()

Out[31]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.022684	0.619786	0.080755	0.361608
std	0.325851	1.000464	0.994291	0.697821
min	-0.497545	-0.778258	-0.970088	-0.593033
25%	-0.245857	0.105094	-0.323730	-0.454794
50%	-0.134876	0.536389	-0.311183	0.584534
75%	0.246792	1.100264	0.292430	0.905264
max	0.357657	2.213910	1.932763	0.923779

Transposing your data

데이터의 전치행렬도 가능합니다.

In [32]:

kdf.T

Out[32]:

	0	1	2	3	4	5
A	0.246792	-0.134876	0.137727	-0.245857	-0.497545	0.357657
B	0.536389	1.100264	0.105094	2.213910	0.541320	-0.778258
C	0.292430	-0.311183	-0.970088	1.932763	-0.323730	-0.135661
D	-0.593033	0.923779	0.584534	0.803901	-0.454794	0.905264

Sorting by its index

인덱스를 정렬합니다.

In [33]:

kdf.sort_index(ascending=False)

Out[33]:

	A	B	C	D
5	0.357657	-0.778258	-0.135661	0.905264
4	-0.497545	0.541320	-0.323730	-0.454794
3	-0.245857	2.213910	1.932763	0.803901
2	0.137727	0.105094	-0.970088	0.584534
1	-0.134876	1.100264	-0.311183	0.923779
0	0.246792	0.536389	0.292430	-0.593033

Sorting by value

값으로 정렬합니다.

In [34]:

kdf.sort_values(by='B')

Out[34]:

	A	B	C	D
5	0.357657	-0.778258	-0.135661	0.905264
4	0.137727	0.105094	-0.970088	0.584534
3	0.246792	0.536389	0.292430	-0.593033
2	-0.497545	0.541320	-0.323730	-0.454794
1	-0.134876	1.100264	-0.311183	0.923779
0	-0.245857	2.213910	1.932763	0.803901

Missing Data¶

결측치¶

Koalas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

코알라는 결측치 데이터를 표현하기 위해 np.nan 값을 주로 사용합니다. 기본적으로 계산시 포함되지 않습니다.

In [35]:

pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [36]:

pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [37]:

kdf1 = ks.from_pandas(pdf1)

In [38]:

kdf1

Out[38]:

	A	B	C	D	E
2013-01-01	0.246792	0.536389	0.292430	-0.593033	1.0
2013-01-02	-0.134876	1.100264	-0.311183	0.923779	1.0
2013-01-03	0.137727	0.105094	-0.970088	0.584534	NaN
2013-01-04	-0.245857	2.213910	1.932763	0.803901	NaN

To drop any rows that have missing data.

결측치를 가진 행을 버립니다.

In [39]:

kdf1.dropna(how='any')

Out[39]:

	A	B	C	D	E
2013-01-01	0.246792	0.536389	0.292430	-0.593033	1.0
2013-01-02	-0.134876	1.100264	-0.311183	0.923779	1.0

Filling missing data.

결측치를 특정값으로 채웁니다.

In [40]:

kdf1.fillna(value=5)

Out[40]:

	A	B	C	D	E
2013-01-01	0.246792	0.536389	0.292430	-0.593033	1.0
2013-01-02	-0.134876	1.100264	-0.311183	0.923779	1.0
2013-01-03	0.137727	0.105094	-0.970088	0.584534	5.0
2013-01-04	-0.245857	2.213910	1.932763	0.803901	5.0

Operations¶

연산¶

Stats¶

통계¶

Operations in general exclude missing data.

Performing a descriptive statistic:

일반적으로 결측치를 제외한 연산을 합니다.

통계치를 묘사하는 연산을 수행합니다.

In [41]:

kdf.mean()

Out[41]:

A -0.022684

B 0.619786

C 0.080755

D 0.361608

dtype: float64

Spark Configurations¶

Various configurations in PySpark could be applied internally in Koalas. For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See PySpark Usage Guide for Pandas with Apache Arrow.

Spark 설정¶

PySpark의 다양한 설정이 코알라 내부적으로 적용될 수 있습니다. 예를 들어 내부 pandas 변환의 속도를 매우 높이기 위해 Arrow 최적화가 가능합니다. Apache Arrow로 Pandas를 위한 PySpark 사용자 가이드를 확인해 주세요.

In [42]:

prev = spark.conf.get("spark.sql.execution.arrow.enabled") # Keep its default value. 기존 값을 유지 
ks.set_option("compute.default_index_type", "distributed") # Use default index prevent overhead. 오버헤드 방지를 위해 기본 index 사용 
import warnings warnings.filterwarnings("ignore") # Ignore warnings coming from Arrow optimizations. Arrow 최적화에서 오는 warning 무시하기.

In [43]:

spark.conf.set("spark.sql.execution.arrow.enabled", True) 
%timeit ks.range(300000).to_pandas()

The slowest run took 4.29 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 3: 286 ms per loop

In [44]:

spark.conf.set("spark.sql.execution.arrow.enabled", False) 
%timeit ks.range(300000).to_pandas()

1 loop, best of 3: 1.24 s per loop

In [45]:

ks.reset_option("compute.default_index_type") 
spark.conf.set("spark.sql.execution.arrow.enabled", prev) # Set its default value back. 기본 값으로 다시 설정합니다.

Grouping¶

By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

그룹화¶

“group by”는 다음 단계 중 하나 이상을 포함하는 과정을 의미합니다.

일부 기준에 따라 데이터를 그룹으로 분할
각 그룹에 독립적인 함수 적용
결과를 데이터 구조로 결합

In [46]:

kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8)})

kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'], 
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'], 
                    'C': np.random.randn(8), 
                    'D': np.random.randn(8)})

In [47]:

kdf

Out[47]:

	A	B	C	D
0	foo	one	-0.049080	1.047839
1	bar	one	-0.047054	-0.349258
2	foo	two	-1.595671	1.756440
3	bar	three	2.167124	0.335527
4	foo	two	-0.939517	0.613638
5	bar	two	-0.257032	-1.379603
6	foo	one	-0.446948	1.938402
7	foo	three	-0.089810	2.017092

Grouping and then applying the sum() function to the resulting groups.

sum() 합계 결과를 적용하고 그룹화합니다.

In [48]:

kdf.groupby('A').sum()

Out[48]:

A	C	D
bar	1.863037	-1.393334
foo	-3.121026	7.373411

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

여러 열로 그룹화하면 계층적 인덱스가 형성되고 다시 sum 함수를 적용할 수 있습니다.

In [49]:

kdf.groupby(['A', 'B']).sum()

Out[49]:

A	B	C	D
foo	one	-0.496027	2.986241
	two	-2.535188	2.370078
bar	three	2.167124	0.335527
foo	three	-0.089810	2.017092
bar	two	-0.257032	-1.379603
	one	-0.047054	-0.349258

Plotting¶

그래프 그리기¶

See the Plotting docs.

Plotting 문서를 확인하세요.

In [50]:

%matplotlib inline 
from matplotlib import pyplot as plt

In [51]:

pser = pd.Series(np.random.randn(1000), 
                 index=pd.date_range('1/1/2000', periods=1000))

In [52]:

kser = ks.Series(pser)

In [53]:

kser = kser.cummax()

In [54]:

kser.plot()

Out[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f95cfeea2b0>

On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

데이터 프레임에서 plot() 메소드는 레이블이 있는 모든 열을 그리는 데 편리합니다.

In [55]:

pdf = pd.DataFrame(np.random.randn(1000, 4), index=pser.index, 
                   columns=['A', 'B', 'C', 'D'])

In [56]:

kdf = ks.from_pandas(pdf)

In [57]:

kdf = kdf.cummax()

In [58]:

kdf.plot()

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f95cdb5e278>

Getting data in/out¶

데이터 입/출력 하기¶

See the Input/Output docs.

입/출력 문서를 확인하세요.

CSV¶

CSV is straightforward and easy to use. See here to write a CSV file and here to read a CSV file.

CSV는 사용하기 쉽고 직관적입니다. CSV 파일을 쓰기 위해서는 여기를 확인 하시고 CSV 파일을 읽기 위해서는 여기를 확인하세요.

In [59]:

kdf.to_csv('foo.csv') 
ks.read_csv('foo.csv').head(10)

Out[59]:

	A	B	C	D
0	0.496167	0.716324	0.055572	0.956235
1	0.496167	0.716324	0.055572	0.956235
2	1.188582	0.716324	0.055572	0.956235
3	1.188582	0.763502	1.351446	0.956235
4	1.188582	1.583660	1.351446	2.841457
5	1.188582	1.583660	1.351446	2.841457
6	1.188582	1.583660	1.351446	2.841457
7	1.188582	1.583660	1.351446	2.841457
8	1.188582	1.583660	1.351446	2.841457
9	1.188582	1.583660	1.351446	2.841457

Parquet¶

Parquet is an efficient and compact file format to read and write faster. See here to write a Parquet file and here to read a Parquet file.

파케이(Parquet)는 더 빠르게 읽고 쓰기 위한 효율적이며 압축된 파일 포멧입니다. 파케이 파일을 쓰기 위해서는 여기를 확인 하시고 파케이 파일을 읽기 위해서는 여기를 확인하세요.

In [60]:

kdf.to_parquet('bar.parquet') 
ks.read_parquet('bar.parquet').head(10)

Out[60]:

	A	B	C	D
0	0.496167	0.716324	0.055572	0.956235
1	0.496167	0.716324	0.055572	0.956235
2	1.188582	0.716324	0.055572	0.956235
3	1.188582	0.763502	1.351446	0.956235
4	1.188582	1.583660	1.351446	2.841457
5	1.188582	1.583660	1.351446	2.841457
6	1.188582	1.583660	1.351446	2.841457
7	1.188582	1.583660	1.351446	2.841457
8	1.188582	1.583660	1.351446	2.841457
9	1.188582	1.583660	1.351446	2.841457

Spark IO¶

In addition, Koalas fully support Spark's various datasources such as ORC and an external datasource. See here to write it to the specified datasource and here to read it from the datasource.

추가적으로 코알라는 ORC나 외부 데이터소스 같은 Spark의 다양한 데이터소스를 완전 지원합니다. 특정 데이터소스로 쓰기 위해서 여기를 확인 하시고 특정 데이터소스로부터 읽기 위해서 여기를 확인해 주세요.

In [61]:

kdf.to_spark_io('zoo.orc', format="orc") 
ks.read_spark_io('zoo.orc', format="orc").head(10)

Out[61]:

	A	B	C	D
0	0.496167	0.716324	0.055572	0.956235
1	0.496167	0.716324	0.055572	0.956235
2	1.188582	0.716324	0.055572	0.956235
3	1.188582	0.763502	1.351446	0.956235
4	1.188582	1.583660	1.351446	2.841457
5	1.188582	1.583660	1.351446	2.841457
6	1.188582	1.583660	1.351446	2.841457
7	1.188582	1.583660	1.351446	2.841457
8	1.188582	1.583660	1.351446	2.841457
9	1.188582	1.583660	1.351446	2.841457

In [62]:

!ls -lrt

total 227888

drwxr-xr-x 13 1000 1000 4096 Sep 8 05:48 spark-2.4.7-bin-hadoop2.7

-rw-r--r-- 1 root root 233333392 Sep 8 07:13 spark-2.4.7-bin-hadoop2.7.tgz

drwxr-xr-x 1 root root 4096 Oct 14 16:31 sample_data

drwxr-xr-x 2 root root 4096 Oct 23 07:16 foo.csv

drwxr-xr-x 2 root root 4096 Oct 23 07:18 bar.parquet

drwxr-xr-x 2 root root 4096 Oct 23 07:21 zoo.orc

'Python' 카테고리의 다른 글

Pandas 데이터프레임에서 "포함되지 않은 것" 찾기 (0)	2021.01.09
Python으로 Redshift의 create table이 작동하지 않습니다. (1)	2020.11.08
Python에서 UDP 멀티캐스트 하는 방법? (1)	2020.10.13
"pip install --user ..."의 목적은? (0)	2020.09.09
asyncio 작업(task)으로 실행한 함수에서 값 얻어오기 (0)	2020.07.29

colab.research.google.com에서 Selenium Webdriver를 사용하는 방법?

2020. 6. 5. 16:57

출처 : https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com

colab.research.google.com에서 Selenium Webdriver를 사용하는 방법?

저는 빠른 처리를 위해 colab.research.google.com 에서 Selenium Webdriver를 사용하고 싶습니다. 저는 !pip install selenium을 사용하여 Selenium을 설치할 수 있었지만 크롬의 웹 드라이버는 webdriverChrome.exe의 경로를 요구합니다. 그것을 사용하려면 어떻게 합니까?

추신 - colab.research.google.com은 딥러닝과 관련된 빠른 연산 문제를 위해 GPU를 제공하는 온라인 플랫폼입니다. webdriver.Chrome(path)와 같은 솔루션을 삼가해 주세요.

4개의 답변 중 1개의 답변만 추려냄

크롬 웹 드라이버를 설치하고 Google Colab에서 충돌하지 않도록 몇 가지 옵션을 조정하여 수행할 수 있습니다.

!pip install selenium
!apt-get update # apt install을 정확히 실행하기 위해 ubuntu 업데이트
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.webite-url.com")

'Python' 카테고리의 다른 글

assertRaises - unittest에서 오류 테스트 하기 (0)	2020.06.15
파이썬 AttributeError: 'module' 객체는 'SSL_ST_INIT' 속성이 없습니다 (0)	2020.06.08
soup.select로 beautiful soup에서 두번째 child 선택하기 (0)	2020.06.03
시작하기: Colab에서 CSV 파일을 불러오는 3가지 방법 (2)	2020.05.23
python 3.7 websockets에서 비정상적으로 1006 접속 종료 오류 (0)	2020.05.12

시작하기: Colab에서 CSV 파일을 불러오는 3가지 방법

2020. 5. 23. 22:49

출처 : https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

시작하기: Colab에서 CSV 파일을 불러오는 3가지 방법

데이터 사이언스는 데이터 없이는 아무것도 아닙니다. 예 그것은 분명합니다. 분명하지 않은 것은 데이터를 탐색할 수 있는 형식으로 데이터를 가져오기 위한 단계입니다. (쉼표로 구분된 값을 줄여서) CSV 형식의 데이터 셋을 가질 수 있지만 다음에 수행할 작업은 알 수 없습니다. 이 글은 Colab에서 CSV 파일을 불러옴으로써 데이터 사이언스를 시작하는 데 도움이 될 것입니다.

Colab(Colaboratory를 줄여서)은 Python으로 코딩하도록 구글로부터 온 무료 플랫폼입니다. Colab은 본질적으로 Jupyter Notebook의 구글 Suite 버전입니다. Jupyter 위의 Colab의 몇가지 이점은 문서로 공유하기와 더 쉬운 패키지 설치가 있습니다. 아직, CSV 파일 처럼 파일을 불러올 때 몇가지 추가 코딩이 필요합니다. 저는 Colab에서 CSV 파일을 불러오기 위한 3가지 방법을 보여드리고 이를 Pandas 데이터프레임으로 추가할 것입니다.

(참고 : 공통 데이터 세트를 전달하는 Python 패키지가 있습니다. 이 기사에서 이러한 데이터 세트를 불러오는 것에 대해서는 논의하지 않을 것 입니다.)

시작하기 위해 당신의 구글 계정으로 로그인하여 Google Drive로 갑니다. 왼쪽에 새로 만들기 버튼을 클릭하고 만약 설치되었다면 Colaboratory 를 선택합니다. (더 많은 앱 연결을 클릭하지 않으면 Colaboratory을 검색하여 설치하십시오.) 여기에서 아래와 같이 Pandas를 import 합니다. (Colab에 이미 설치되어 있음).

import pandas as pd

1) Github로 부터(파일 < 25MB)

CSV 파일을 업로드하는 가장 쉬운 방법은 GitHub 저장소를 사용하는 것입니다. 당신의 저장소에서 데이터 세트를 클릭하고 View Raw를 클릭합니다. raw 데이터 세트에 대한 링크를 복사하여 아래 표시된대로 Colab에 url이라는 문자열 변수로 저장합니다 (보다 깔끔한 방법이지만 필요하지는 않습니다). 마지막 단계는 URL을 Pandas read_csv로 불러와 데이터 프레임을 얻는 것입니다.

url = '복사한_raw_GitHub_링크'
df1 = pd.read_csv(url)
# 데이터 세트는 Pandas Dataframe에 이제 저장됩니다.

2) local drive로 부터

당신의 local drive로부터 업로드하여, 다음 코드로 시작합니다.

from google.colab import files
uploaded = files.upload()

선택할 파일이 프롬프트로 나올 것입니다. "파일 선택하기"를 클릭하여 파일을 선택하여 업로드 합니다. 파일이 100% 업로드 되기를 기다립니다. Colab이 파일을 업로드를 했을 때 파일의 이름이 보여야 합니다.

마지막으로, 데이터 프레임으로 그것을 import하기 위해 다음 코드를 타이핑합니다. (파일 이름은 업로드한 파일의 이름과 똑같은지 확인해야 합니다).

import io
df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv']))
# 데이터 세트는 Pandas Dataframe에 이제 저장됩니다.

3) PyDrive를 통한 Google Drive로 부터

이는 3가지 방법 중에 가장 복잡합니다. 워크 플로 제어를 위해 CSV 파일을 Google 드라이브에 업로드 하는 것을 보여 드리겠습니다. 먼저 다음 코드를 입력하십시오.

# Colaboratory에서 csv 파일을 읽기 위한 코드:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# PyDrive 클라이언트를 생성하고 인증하기
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

프롬프트가 표시되면 링크를 클릭하여 Google이 드라이브에 액세스 할 수 있도록 인증을 받습니다. 상단에 'Google Cloud SDK가 Google 계정에 액세스 하려고 합니다'라는 화면이 표시됩니다. 권한을 허용한 후 제공된 확인 코드를 복사하여 Colab의 상자에 붙여 넣습니다.

확인이 완료되면 Google 드라이브의 CSV 파일로 이동하여 마우스 오른쪽 버튼으로 클릭하고 "공유 가능 링크 가져 오기"를 선택하십시오. 링크가 클립 보드에 복사됩니다. 이 링크를 Colab의 문자열 변수에 붙여 넣습니다.

link = 'https://drive.google.com/open?id=1DPZZQ43w8brRhbEMolgLqOWKbZbE-IQu' # 공유 가능 링크

원하는 것은 등호 뒤의 ID 부분입니다. 해당 부분을 얻으려면 다음 코드를 입력하십시오.

fluff, id = link.split('=')
print (id) # '=' 뒤에 모든 부분이 있는지 확인

마지막으로, 데이터프레임으로 파일을 얻기 위해 다음 코드를 타이핑 합니다.

downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('Filename.csv')  
df3 = pd.read_csv('Filename.csv')
# 데이터 세트는 Pandas Dataframe에 이제 저장됩니다.

마지막 의견

Colab에서 CSV 파일을 업로드하기 위한 3가지 접근이 있습니다. 각각 파일 크기와 워크 플로 구성 방법에 따라 장점이 있습니다. 데이터가 Pandas 데이터프레임처럼 더 좋은 포멧이라면, 작업할 준비가 된 것입니다.

보너스 방법 - 내 드라이브

당신의 지원에 매우 감사합니다. 이 글은 50,000 View와 25K의 읽기에 도달하는 영광이 있었고, 저는 Colab에서 CSV 파일을 얻기 위한 추가적인 방벙을 제공합니다. 이는 더 간단하고 분명합니다. Google Drive("My Drive")에서, 당신이 선택한 위치에서 data라 불리는 폴더를 만듭니다. 이는 당신의 데이터를 업로드할 곳이 될 것입니다.

Colab 노트북에서, 다음을 타이핑합니다.

from google.colab import drive
drive.mount('/content/drive')

세 번째 방법과 마찬가지로 명령을 사용하면 Google 인증 단계로 이동합니다. Google 드라이브 파일 스트림에서 Google 계정에 액세스하시오 라는 화면이 표시됩니다. 권한을 허용한 후 제공된 확인 코드를 복사하여 Colab의 상자에 붙여 넣습니다.

노트북에서 노트북 왼쪽 상단에 >를 클릭하고 파일을 클릭하십시오. 앞에서 만든 data 폴더를 찾고 데이터를 찾으십시오. 데이터를 마우스 오른쪽 버튼으로 클릭하고 경로 복사를 선택하십시오. 이 복사된 경로를 변수에 저장하면 바로 사용할 수 있습니다.

path = "copied path"
df_bonus = pd.read_csv(path)
# 데이터 세트는 Pandas Dataframe에 이제 저장됩니다.

이 방법의 장점은 세 번째 방법과 관련된 추가 단계없이 자체 Google 드라이브에서 생성한 별도의 데이터 세트 폴더에서 데이터 세트에 액세스 할 수 있다는 것입니다.

'Python' 카테고리의 다른 글

colab.research.google.com에서 Selenium Webdriver를 사용하는 방법? (0)	2020.06.05
soup.select로 beautiful soup에서 두번째 child 선택하기 (0)	2020.06.03
python 3.7 websockets에서 비정상적으로 1006 접속 종료 오류 (0)	2020.05.12
UDP - 파이썬에서 클라이언트 서버 예제 프로그램 (0)	2020.05.07
python에서 간단하게 "chmod +x"을 어떻게 합니까? (1)	2020.03.09

PREV 1 NEXT

라이언(Ryan)의 블로그

colab

10분 코알라

Running Pyspark in Colab¶

Colab에서 Pyspark 실행하기¶

10 minutes to Koalas¶

10분 코알라¶

Object Creation¶

객체 생성¶

Viewing Data¶

데이터 보기¶

Missing Data¶

결측치¶

Operations¶

연산¶

Stats¶

통계¶

Spark Configurations¶

Spark 설정¶

Grouping¶

그룹화¶

Plotting¶

그래프 그리기¶

Getting data in/out¶

데이터 입/출력 하기¶

CSV¶

Parquet¶

Spark IO¶

'Python' 카테고리의 다른 글

colab.research.google.com에서 Selenium Webdriver를 사용하는 방법?

colab.research.google.com에서 Selenium Webdriver를 사용하는 방법?

4개의 답변 중 1개의 답변만 추려냄

'Python' 카테고리의 다른 글

시작하기: Colab에서 CSV 파일을 불러오는 3가지 방법

시작하기: Colab에서 CSV 파일을 불러오는 3가지 방법

1) Github로 부터(파일 < 25MB)

2) local drive로 부터

3) PyDrive를 통한 Google Drive로 부터

마지막 의견

보너스 방법 - 내 드라이브

'Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바