100G 数据如何先随机读取 1%?

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

如果想在 V2EX 获得更好的推广效果，欢迎了解 PRO 会员机制：
https://www.v2ex.com/pro/about

This topic created in 1979 days ago, the information mentioned may be changed or developed.

100G 数据如何先随机读取 1%?今天番茄加速就来给大家介绍下。

　　对于动辄就几十或几百个 G 的数据，在读取的这么大数据的时候，我们有没有办法随机选取一小部分数据，然后读入内存，快速了解数据和开展 EDA ?

　　使用 Pandas 的 skiprows 和概率知识，就能做到。解释具体怎么做，如下所示，读取某 100 G 大小的 big_data.csv 数据

　　使用 skiprows 参数，

　　 x > 0 确保首行读入，

　　 np.random.rand() > 0.01 表示 99% 的数据都会被随机过滤掉

　　言外之意，只有全部数据 1% 才有机会选入内存中。

　　 import pandas as pd

　　 import numpy as np

　　 df = pd.read_csv("big_data.csv",

　　 skiprows =

　　 lambda x: x>0and np.random.rand() > 0.01)

　　 print("The shape of the df is {}.

　　 It has been reduced 100 times!".format(df.shape))

　　使用这种方法，读取的数据量迅速缩减到原来的 1% ，对于迅速展开数据分析有一定的帮助。

No Comments Yet

读取 skiprows csv 数据